纵向数据无需替换的随机抽样

我的数据是纵向数据。纵向数据无需替换的随机抽样

VISIT ID VAR1 
1  001 ... 
1  002 ... 
1  003 ... 
1  004 ... 
... 
2  001 ... 
2  002 ... 
2  003 ... 
2  004 ...

我们的最终目标是挑选每次访问10％进行测试。我尝试使用prov SURVEYSELECT来做SRS而无需替换，并使用“VISIT”作为分层。但最终的样本会有重复的ID。例如，可以在VISIT = 1和VISIT = 2中选择ID = 001。

有没有办法使用SURVEYSELECT或其他程序（R也很好）？非常感谢。

来源

2017-09-15 Sailynette Garcia

所以你想从每次访问中获取10％，但是最终数据集中的所有ID都应该是唯一的？ – useR

是的。正如你所说。 –

只要ID是唯一的访问，你可以使用AVE：'$逸拿起< - AVE（is.numeric（DAT $ VISIT），DAT $参观，样品（C（TRUE，FALSE），长度（X）， probs = c（.1，.9），replac = TRUE））'。 – lmo

这是可能的一些相当有创意的数据步骤编程。下面的代码使用一个贪婪的方法，依次从每次访问采样，采样只是以前没有被抽样的ID。如果访问中超过90％的ID已被抽样，则不到10％。在极端情况下，当访问的每个ID已被采样时，不会输出该访问的行。

/*Create some test data*/ 
data test_data; 
    call streaminit(1); 
    do visit = 1 to 1000; 
    do id = 1 to ceil(rand('uniform')*1000); 
     output; 
    end; 
    end; 
run; 


data sample; 
    /*Create a hash object to keep track of unique IDs not sampled yet*/ 
    if 0 then set test_data; 
    call streaminit(0); 
    if _n_ = 1 then do; 
    declare hash h(); 
    rc = h.definekey('id'); 
    rc = h.definedata('available'); 
    rc = h.definedone(); 
    end; 
    /*Find out how many not-previously-sampled ids there are for the current visit*/ 
    do ids_per_visit = 1 by 1 until(last.visit); 
    set test_data; 
    by visit; 
    if h.find() ne 0 then do; 
     available = 1; 
     rc = h.add(); 
    end; 
    available_per_visit = sum(available_per_visit,available); 
    end; 
    /*Read through the current visit again, randomly sampling from the not-yet-sampled ids*/ 
    samprate = 0.1; 
    number_to_sample = round(available_per_visit * samprate,1); 
    do _n_ = 1 to ids_per_visit; 
    set test_data; 
    if available_per_visit > 0 then do; 
     rc = h.find(); 
     if available = 1 then do; 
     if rand('uniform') < number_to_sample/available_per_visit then do; 
      available = 0; 
      rc = h.replace(); 
      samples_per_visit = sum(samples_per_visit,1); 
      output; 
      number_to_sample = number_to_sample - 1; 
     end; 
     available_per_visit = available_per_visit - 1; 
     end; 
    end; 
    end; 
run; 

/*Check that there are no duplicate IDs*/ 
proc sort data = sample out = sample_dedup nodupkey; 
by id; 
run;

来源

2017-10-12 15:23:20 user667489

纵向数据无需替换的随机抽样

回答

相关问题