抽样没有替换的算法？

我试图测试一个特定的数据聚类偶然发生的可能性。一种可靠的方法是蒙特卡罗模拟，其中数据和组之间的关联被随机重新分配很多次（例如10,000次），并且使用聚类度量来比较实际数据和模拟以确定ap值。抽样没有替换的算法？

我已经得到了大部分的工作，指针将分组映射到数据元素，所以我打算随机重新分配指向数据的指针。问题：什么是快速取样而不替换的方法，以便每个指针在重复数据集中随机重新分配？

例如（这些数据只是一个简单的例子）：

数据（N = 12个值） - 基团A：0.1，0.2，0.4/B组：0.5，0.6，0.8/C组：0.4,0.5 /组D：0.2,0.2,0.3,0.5

对于每个重复数据集，我将具有相同的簇大小（A = 3，B = 3，C = 2，D = 4 ）和数据值，但会将值重新分配给群集。

为此，我可以生成1-12范围内的随机数，分配A组的第一个元素，然后生成1-11范围内的随机数并分配A组中的第二个元素，依此类推。指针重新分配很快，并且我将预先分配所有数据结构，但没有替换的抽样看起来像是一个可能在以前解决了很多次的问题。

首选逻辑或伪代码。

来源

2008-11-22 Argalatyr

看到我对这个问题的回答Unique (non-repeating) random numbers in O(1)?。同样的逻辑应该完成你正在寻找的东西。

来源

2008-11-22 20:00:24

非常好！对不起，我没有看到答案时，我搜索SO（无取代取样，统计，算法等）。也许这会像一个元问题来引导像我这样的人到你的原始答案。干杯! – Argalatyr 2008-11-22 20:07:22

下面是根据Knuth的书Seminumeric Algorithms的算法3.4.2S进行抽样而无需替换的一些代码。

void SampleWithoutReplacement 
(
    int populationSize, // size of set sampling from 
    int sampleSize,  // size of each sample 
    vector<int> & samples // output, zero-offset indicies to selected items 
) 
{ 
    // Use Knuth's variable names 
    int& n = sampleSize; 
    int& N = populationSize; 

    int t = 0; // total input records dealt with 
    int m = 0; // number of items selected so far 
    double u; 

    while (m < n) 
    { 
     u = GetUniform(); // call a uniform(0,1) random number generator 

     if ((N - t)*u >= n - m) 
     { 
      t++; 
     } 
     else 
     { 
      samples[m] = t; 
      t++; m++; 
     } 
    } 
}

有杰弗里·斯科特·维特在一个更有效，但更复杂的方法“为顺序随机抽样的高效算法，”在数学软件ACM交易，13（1），1987年3月，58-67。

来源

2008-11-22 20:08:14

我还没有这本书，并且无法证明算法对我自己的正确性。我用java实现了它，并且检查了以统一概率抽样的人口项目。结果令人信服。看到这[gist]（https://gist.github.com/10864989） – Alban 2014-04-16 12:28:49

在Mathematica中不加批判地实施Vitter的方法D比内置算法快几个数量级。我在这里描述它：http://tinyurl.com/lbldlpq – 2014-09-06 23:46:04

描述了另一种无替换采样算法here。

它与John D. Cook在他的回答和Knuth中描述的类似，但它有不同的假设：人口规模是未知的，但样本可以适应记忆。这个被称为“Knuth的算法S”。

引述rosettacode文章：

选择前n项的，因为他们成为可用的样本;

对于i> n的第i项，有n/i的随机机率。如果失败，样本保持不变。如果不是，请随机（1/n）替换之前选择的样本中的一个样本。

对任何后续项目重复＃2。

来源

2014-04-16 12:42:22 Alban

基于answer by John D. Cook的C++工作代码。

#include <random> 
#include <vector> 

double GetUniform() 
{ 
    static std::default_random_engine re; 
    static std::uniform_real_distribution<double> Dist(0,1); 
    return Dist(re); 
} 

// John D. Cook, https://stackoverflow.com/a/311716/15485 
void SampleWithoutReplacement 
(
    int populationSize, // size of set sampling from 
    int sampleSize,  // size of each sample 
    std::vector<int> & samples // output, zero-offset indicies to selected items 
) 
{ 
    // Use Knuth's variable names 
    int& n = sampleSize; 
    int& N = populationSize; 

    int t = 0; // total input records dealt with 
    int m = 0; // number of items selected so far 
    double u; 

    while (m < n) 
    { 
     u = GetUniform(); // call a uniform(0,1) random number generator 

     if ((N - t)*u >= n - m) 
     { 
      t++; 
     } 
     else 
     { 
      samples[m] = t; 
      t++; m++; 
     } 
    } 
} 

#include <iostream> 
int main(int,char**) 
{ 
    const size_t sz = 10; 
    std::vector<int> samples(sz); 
    SampleWithoutReplacement(10*sz,sz,samples); 
    for (size_t i = 0; i < sz; i++) { 
    std::cout << samples[i] << "\t"; 
    } 

    return 0; 
}

来源

2014-08-07 10:15:47

受@John D. Cook's answer的启发，我在Nim中写了一个实现。起初我很难理解它是如何工作的，所以我广泛地评论了一个例子。也许这有助于理解这个想法。另外，我已经稍微改变了变量名称。

iterator uniqueRandomValuesBelow*(N, M: int) = 
    ## Returns a total of M unique random values i with 0 <= i < N 
    ## These indices can be used to construct e.g. a random sample without replacement 
    assert(M <= N) 

    var t = 0 # total input records dealt with 
    var m = 0 # number of items selected so far 

    while (m < M): 
    let u = random(1.0) # call a uniform(0,1) random number generator 

    # meaning of the following terms: 
    # (N - t) is the total number of remaining draws left (initially just N) 
    # (M - m) is the number how many of these remaining draw must be positive (initially just M) 
    # => Probability for next draw = (M-m)/(N-t) 
    # i.e.: (required positive draws left)/(total draw left) 
    # 
    # This is implemented by the inequality expression below: 
    # - the larger (M-m), the larger the probability of a positive draw 
    # - for (N-t) == (M-m), the term on the left is always smaller => we will draw 100% 
    # - for (N-t) >> (M-m), we must get a very small u 
    # 
    # example: (N-t) = 7, (M-m) = 5 
    # => we draw the next with prob 5/7 
    # lets assume the draw fails 
    # => t += 1 => (N-t) = 6 
    # => we draw the next with prob 5/6 
    # lets assume the draw succeeds 
    # => t += 1, m += 1 => (N-t) = 5, (M-m) = 4 
    # => we draw the next with prob 4/5 
    # lets assume the draw fails 
    # => t += 1 => (N-t) = 4 
    # => we draw the next with prob 4/4, i.e., 
    # we will draw with certainty from now on 
    # (in the next steps we get prob 3/3, 2/2, ...) 
    if (N - t)*u >= (M - m).toFloat: # this is essentially a draw with P = (M-m)/(N-t) 
     # no draw -- happens mainly for (N-t) >> (M-m) and/or high u 
     t += 1 
    else: 
     # draw t -- happens when (M-m) gets large and/or low u 
     yield t # this is where we output an index, can be used to sample 
     t += 1 
     m += 1 

# example use 
for i in uniqueRandomValuesBelow(100, 5): 
    echo i

来源

2015-05-13 11:51:42 bluenote10

当人口尺寸比样品大小大得多，上述算法变得低效的，因为它们具有复杂ø（Ñ）Ñ是种群规模。

当我是学生我写为均匀采样一些算法无需更换，其具有平均复杂度ø（小号日志小号），其中小号是样本大小。下面是二进制树算法的代码，具有平均复杂ø（小号日志小号）中，R：

# The Tree growing algorithm for uniform sampling without replacement 
# by Pavel Ruzankin 
quicksample = function (n,size) 
# n - the number of items to choose from 
# size - the sample size 
{ 
    s=as.integer(size) 
    if (s>n) { 
    stop("Sample size is greater than the number of items to choose from") 
    } 
    # upv=integer(s) #level up edge is pointing to 
    leftv=integer(s) #left edge is poiting to; must be filled with zeros 
    rightv=integer(s) #right edge is pointig to; must be filled with zeros 
    samp=integer(s) #the sample 
    ordn=integer(s) #relative ordinal number 

    ordn[1L]=1L #initial value for the root vertex 
    samp[1L]=sample(n,1L) 
    if (s > 1L) for (j in 2L:s) { 
    curn=sample(n-j+1L,1L) #current number sampled 
    curordn=0L #currend ordinal number 
    v=1L #current vertice 
    from=1L #how have come here: 0 - by left edge, 1 - by right edge 
    repeat { 
     curordn=curordn+ordn[v] 
     if (curn+curordn>samp[v]) { #going down by the right edge 
     if (from == 0L) { 
      ordn[v]=ordn[v]-1L 
     } 
     if (rightv[v]!=0L) { 
      v=rightv[v] 
      from=1L 
     } else { #creating a new vertex 
      samp[j]=curn+curordn 
      ordn[j]=1L 
      # upv[j]=v 
      rightv[v]=j 
      break 
     } 
     } else { #going down by the left edge 
     if (from==1L) { 
      ordn[v]=ordn[v]+1L 
     } 
     if (leftv[v]!=0L) { 
      v=leftv[v] 
      from=0L 
     } else { #creating a new vertex 
      samp[j]=curn+curordn-1L 
      ordn[j]=-1L 
      # upv[j]=v 
      leftv[v]=j 
      break 
     } 
     } 
    } 
    } 
    return(samp) 
}

的这个算法的复杂性中讨论： Rouzankin，P. S .; Voytishek，A. V.关于随机选择算法的代价。蒙特卡洛方法Appl。 5（1999），no。 1，39-54。 http://dx.doi.org/10.1515/mcma.1999.5.1.39

如果您发现该算法有用，请进行参考。

另见： P. Gupta，G. P. Bhattacharjee。（1984）一种高效的无替换随机抽样算法。国际计算机数学杂志16：4，201-209页。 DOI：10.1080/00207168408803438

Teuhola，J.和Nevalainen，1982年。两个有效的算法，用于无替代随机抽样。/IJCM /，11（2）：127-140。 DOI：10.1080/00207168208803304

在过去的论文作者使用哈希表，并声称他们的算法有Ø（小号）的复杂性。还有一个更快的哈希表算法，它将很快在pqR（相当快的R）中实现： https://stat.ethz.ch/pipermail/r-devel/2017-October/075012.html

来源

2017-10-18 09:38:13

抽样没有替换的算法？

回答

相关问题