Spark :: KMeans调用两次takeSample（）？

我有很多数据，我已经尝试过基数分区[20k，200k +]。Spark :: KMeans调用两次takeSample（）？

我把它叫做这样的：

from pyspark.mllib.clustering import KMeans, KMeansModel 
C0 = KMeans.train(first, 8192, initializationMode='random', maxIterations=10, seed=None) 
C0 = KMeans.train(second, 8192, initializationMode='random', maxIterations=10, seed=None)

，我看到initRandom()调用takeSample()一次。

然后takeSample()实现似乎并没有自称或类似的东西，所以我希望KMeans()调用takeSample()一次。那么为什么监视器显示两个takeSample() s每KMeans()？

注：我执行更KMeans()，他们都援引2个takeSample() S，不管数据是.cache()“d与否。

此外，分区不影响数takeSample()叫的数量，这是不断为2

我使用星火1.6.2（我不能升级）和我的应用程序是在Python，如果那很重要！

我把这个给星火开发者的邮件列表，所以我更新：1日takeSample()

详情：次takeSample()

详情：

可以看到执行相同的代码。

来源

2016-08-17 gsamaras

_{如星火的邮件列表建议由Shivaram卡塔拉曼：}

我觉得takeSample本身运行多个任务，如果样品的第一遍收集的量是远远不够的。应该解释何时发生这种情况的注释和代码路径在GitHub 。您也可以通过来确认这一点，检查logWarning是否显示在您的日志中。

// If the first sample didn't turn out large enough, keep trying to take samples; 
// this shouldn't happen often because we use a big multiplier for the initial size 
var numIters = 0 
while (samples.length < num) { 
    logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters") 
    samples = this.sample(withReplacement, fraction, rand.nextInt()).collect() 
    numIters += 1 
}

然而，正如人们可以看到，第二评论说，这不应该经常发生，而且它总是发生在我身上，所以如果任何人有另外一种想法，请让我知道。

也有人认为这是UI的问题，takeSample()实际上只被调用过一次，但那只是热空气。

来源

2016-09-01 23:03:39 gsamaras

Spark :: KMeans调用两次takeSample（）？

回答

相关问题