2008-11-22 44 views
15

我试图测试一个特定的数据聚类偶然发生的可能性。一种可靠的方法是蒙特卡罗模拟,其中数据和组之间的关联被随机重新分配很多次(例如10,000次),并且使用聚类度量来比较实际数据和模拟以确定ap值。抽样没有替换的算法?

我已经得到了大部分的工作,指针将分组映射到数据元素,所以我打算随机重新分配指向数据的指针。问题:什么是快速取样而不替换的方法,以便每个指针在重复数据集中随机重新分配?

例如(这些数据只是一个简单的例子):

数据(N = 12个值) - 基团A:0.1,0.2,0.4/B组:0.5,0.6,0.8/C组:0.4,0.5 /组D:0.2,0.2,0.3,0.5

对于每个重复数据集,我将具有相同的簇大小(A = 3,B = 3,C = 2,D = 4 )和数据值,但会将值重新分配给群集。

为此,我可以生成1-12范围内的随机数,分配A组的第一个元素,然后生成1-11范围内的随机数并分配A组中的第二个元素,依此类推。指针重新分配很快,并且我将预先分配所有数据结构,但没有替换的抽样看起来像是一个可能在以前解决了很多次的问题。

首选逻辑或伪代码。

回答

5

看到我对这个问题的回答Unique (non-repeating) random numbers in O(1)?。同样的逻辑应该完成你正在寻找的东西。

+0

非常好!对不起,我没有看到答案时,我搜索SO(无取代取样,统计,算法等)。也许这会像一个元问题来引导像我这样的人到你的原始答案。干杯! – Argalatyr 2008-11-22 20:07:22

35

下面是根据Knuth的书Seminumeric Algorithms的算法3.4.2S进行抽样而无需替换的一些代码。

void SampleWithoutReplacement 
(
    int populationSize, // size of set sampling from 
    int sampleSize,  // size of each sample 
    vector<int> & samples // output, zero-offset indicies to selected items 
) 
{ 
    // Use Knuth's variable names 
    int& n = sampleSize; 
    int& N = populationSize; 

    int t = 0; // total input records dealt with 
    int m = 0; // number of items selected so far 
    double u; 

    while (m < n) 
    { 
     u = GetUniform(); // call a uniform(0,1) random number generator 

     if ((N - t)*u >= n - m) 
     { 
      t++; 
     } 
     else 
     { 
      samples[m] = t; 
      t++; m++; 
     } 
    } 
} 

有杰弗里·斯科特·维特在一个更有效,但更复杂的方法“为顺序随机抽样的高效算法,”在数学软件ACM交易,13(1),1987年3月,58-67。

+0

我还没有这本书,并且无法证明算法对我自己的正确性。我用java实现了它,并且检查了以统一概率抽样的人口项目。结果令人信服。看到这[gist](https://gist.github.com/10864989) – Alban 2014-04-16 12:28:49

+0

在Mathematica中不加批判地实施Vitter的方法D比内置算法快几个数量级。我在这里描述它:http://tinyurl.com/lbldlpq – 2014-09-06 23:46:04

0

描述了另一种无替换采样算法here

它与John D. Cook在他的回答和Knuth中描述的类似,但它有不同的假设:人口规模是未知的,但样本可以适应记忆。这个被称为“Knuth的算法S”。

引述rosettacode文章:

  1. 选择前n项的,因为他们成为可用的样本;
  2. 对于i> n的第i项,有n/i的随机机率。如果失败,样本保持不变。如果 不是,请随机(1/n)替换之前选择的样本中的一个样本。
  3. 对任何后续项目重复#2。
7

基于answer by John D. Cook的C++工作代码。

#include <random> 
#include <vector> 

double GetUniform() 
{ 
    static std::default_random_engine re; 
    static std::uniform_real_distribution<double> Dist(0,1); 
    return Dist(re); 
} 

// John D. Cook, https://stackoverflow.com/a/311716/15485 
void SampleWithoutReplacement 
(
    int populationSize, // size of set sampling from 
    int sampleSize,  // size of each sample 
    std::vector<int> & samples // output, zero-offset indicies to selected items 
) 
{ 
    // Use Knuth's variable names 
    int& n = sampleSize; 
    int& N = populationSize; 

    int t = 0; // total input records dealt with 
    int m = 0; // number of items selected so far 
    double u; 

    while (m < n) 
    { 
     u = GetUniform(); // call a uniform(0,1) random number generator 

     if ((N - t)*u >= n - m) 
     { 
      t++; 
     } 
     else 
     { 
      samples[m] = t; 
      t++; m++; 
     } 
    } 
} 

#include <iostream> 
int main(int,char**) 
{ 
    const size_t sz = 10; 
    std::vector<int> samples(sz); 
    SampleWithoutReplacement(10*sz,sz,samples); 
    for (size_t i = 0; i < sz; i++) { 
    std::cout << samples[i] << "\t"; 
    } 

    return 0; 
} 
2

@John D. Cook's answer的启发,我在Nim中写了一个实现。起初我很难理解它是如何工作的,所以我广泛地评论了一个例子。也许这有助于理解这个想法。另外,我已经稍微改变了变量名称。

iterator uniqueRandomValuesBelow*(N, M: int) = 
    ## Returns a total of M unique random values i with 0 <= i < N 
    ## These indices can be used to construct e.g. a random sample without replacement 
    assert(M <= N) 

    var t = 0 # total input records dealt with 
    var m = 0 # number of items selected so far 

    while (m < M): 
    let u = random(1.0) # call a uniform(0,1) random number generator 

    # meaning of the following terms: 
    # (N - t) is the total number of remaining draws left (initially just N) 
    # (M - m) is the number how many of these remaining draw must be positive (initially just M) 
    # => Probability for next draw = (M-m)/(N-t) 
    # i.e.: (required positive draws left)/(total draw left) 
    # 
    # This is implemented by the inequality expression below: 
    # - the larger (M-m), the larger the probability of a positive draw 
    # - for (N-t) == (M-m), the term on the left is always smaller => we will draw 100% 
    # - for (N-t) >> (M-m), we must get a very small u 
    # 
    # example: (N-t) = 7, (M-m) = 5 
    # => we draw the next with prob 5/7 
    # lets assume the draw fails 
    # => t += 1 => (N-t) = 6 
    # => we draw the next with prob 5/6 
    # lets assume the draw succeeds 
    # => t += 1, m += 1 => (N-t) = 5, (M-m) = 4 
    # => we draw the next with prob 4/5 
    # lets assume the draw fails 
    # => t += 1 => (N-t) = 4 
    # => we draw the next with prob 4/4, i.e., 
    # we will draw with certainty from now on 
    # (in the next steps we get prob 3/3, 2/2, ...) 
    if (N - t)*u >= (M - m).toFloat: # this is essentially a draw with P = (M-m)/(N-t) 
     # no draw -- happens mainly for (N-t) >> (M-m) and/or high u 
     t += 1 
    else: 
     # draw t -- happens when (M-m) gets large and/or low u 
     yield t # this is where we output an index, can be used to sample 
     t += 1 
     m += 1 

# example use 
for i in uniqueRandomValuesBelow(100, 5): 
    echo i 
1

当人口尺寸比样品大小大得多,上述算法变得低效的,因为它们具有复杂øÑÑ是种群规模。

当我是学生我写为均匀采样一些算法无需更换,其具有平均复杂度ø小号日志小号),其中小号是样本大小。下面是二进制树算法的代码,具有平均复杂ø小号日志小号)中,R:

# The Tree growing algorithm for uniform sampling without replacement 
# by Pavel Ruzankin 
quicksample = function (n,size) 
# n - the number of items to choose from 
# size - the sample size 
{ 
    s=as.integer(size) 
    if (s>n) { 
    stop("Sample size is greater than the number of items to choose from") 
    } 
    # upv=integer(s) #level up edge is pointing to 
    leftv=integer(s) #left edge is poiting to; must be filled with zeros 
    rightv=integer(s) #right edge is pointig to; must be filled with zeros 
    samp=integer(s) #the sample 
    ordn=integer(s) #relative ordinal number 

    ordn[1L]=1L #initial value for the root vertex 
    samp[1L]=sample(n,1L) 
    if (s > 1L) for (j in 2L:s) { 
    curn=sample(n-j+1L,1L) #current number sampled 
    curordn=0L #currend ordinal number 
    v=1L #current vertice 
    from=1L #how have come here: 0 - by left edge, 1 - by right edge 
    repeat { 
     curordn=curordn+ordn[v] 
     if (curn+curordn>samp[v]) { #going down by the right edge 
     if (from == 0L) { 
      ordn[v]=ordn[v]-1L 
     } 
     if (rightv[v]!=0L) { 
      v=rightv[v] 
      from=1L 
     } else { #creating a new vertex 
      samp[j]=curn+curordn 
      ordn[j]=1L 
      # upv[j]=v 
      rightv[v]=j 
      break 
     } 
     } else { #going down by the left edge 
     if (from==1L) { 
      ordn[v]=ordn[v]+1L 
     } 
     if (leftv[v]!=0L) { 
      v=leftv[v] 
      from=0L 
     } else { #creating a new vertex 
      samp[j]=curn+curordn-1L 
      ordn[j]=-1L 
      # upv[j]=v 
      leftv[v]=j 
      break 
     } 
     } 
    } 
    } 
    return(samp) 
} 

的这个算法的复杂性中讨论: Rouzankin,P. S .; Voytishek,A. V.关于随机选择算法的代价。蒙特卡洛方法Appl。 5(1999),no。 1,39-54。 http://dx.doi.org/10.1515/mcma.1999.5.1.39

如果您发现该算法有用,请进行参考。

另见: P. Gupta,G. P. Bhattacharjee。 (1984)一种高效的无替换随机抽样算法。国际计算机数学杂志16:4,201-209页。 DOI:10.1080/00207168408803438

Teuhola,J.和Nevalainen,1982年。两个有效的算法,用于无替代随机抽样。/IJCM /,11(2):127-140。 DOI:10.1080/00207168208803304

在过去的论文作者使用哈希表,并声称他们的算法有Ø小号)的复杂性。还有一个更快的哈希表算法,它将很快在pqR(相当快的R)中实现: https://stat.ethz.ch/pipermail/r-devel/2017-October/075012.html