我想模拟一个遗传学问题,我们试图解决,建立在它的步骤。我可以成功运行Spark示例中的PiAverage示例。这个例子“掷飞镖”在一个圆圈(我们的情况是10^6),并计算“落在圆圈内”的数字来估计PI在Apache Spark中,我可以轻松地重复/嵌套一个SparkContext.parallelize吗?
假设我想重复这个过程1000次(并行)和平均所有这些估计。我试图看到最好的方法,似乎有两个并行化调用?嵌套调用?有没有办法将地图连接在一起或减少呼叫?我看不到它。
我想知道类似下面的想法的智慧。我想使用累加器跟踪结果估计。 jsc是我的SparkContext,单次运行的完整代码在问题结束时,感谢您的任何输入!
Accumulator<Double> accum = jsc.accumulator(0.0);
// make a list 1000 long to pass to parallelize (no for loops in Spark, right?)
List<Integer> numberOfEstimates = new ArrayList<Integer>(HOW_MANY_ESTIMATES);
// pass this "dummy list" to parallelize, which then
// calls a pieceOfPI method to produce each individual estimate
// accumulating the estimates. PieceOfPI would contain a
// parallelize call too with the individual test in the code at the end
jsc.parallelize(numberOfEstimates).foreach(accum.add(pieceOfPI(jsc, numList, slices, HOW_MANY_ESTIMATES)));
// get the value of the total of PI estimates and print their average
double totalPi = accum.value();
// output the average of averages
System.out.println("The average of " + HOW_MANY_ESTIMATES + " estimates of Pi is " + totalPi/HOW_MANY_ESTIMATES);
它似乎并不像一个矩阵或其他答案我看到这样就给回答这个具体问题,我做了几个搜索,但我没有看到如何做到这一点没有“并行的并行化。 “这是一个坏主意吗?
(是的,我从数学角度认识到,我可以做更多的估计,并且得到相同的结果:)试图建立一个我的老板想要的结构,再次感谢!
我已经把我的整个单测试程序放在这里,如果有帮助的话,那就是我正在测试的累加器。这个核心会变成PieceOfPI():
import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.Accumulable;
import org.apache.spark.Accumulator;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.storage.StorageLevel;
import org.apache.spark.SparkConf;
import org.apache.spark.storage.StorageLevel;
public class PiAverage implements Serializable {
public static void main(String[] args) {
PiAverage pa = new PiAverage();
pa.go();
}
public void go() {
// should make a parameter like all these finals should be
// int slices = (args.length == 1) ? Integer.parseInt(args[0]) : 2;
final int SLICES = 16;
// how many "darts" are thrown at the circle to get one single Pi estimate
final int HOW_MANY_DARTS = 1000000;
// how many "dartboards" to collect to average the Pi estimate, which we hope converges on the real Pi
final int HOW_MANY_ESTIMATES = 1000;
SparkConf sparkConf = new SparkConf().setAppName("PiAverage")
.setMaster("local[4]");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
// setup "dummy" ArrayList of size HOW_MANY_DARTS -- how many darts to throw
List<Integer> throwsList = new ArrayList<Integer>(HOW_MANY_DARTS);
for (int i = 0; i < HOW_MANY_DARTS; i++) {
throwsList.add(i);
}
// setup "dummy" ArrayList of size HOW_MANY_ESTIMATES
List<Integer> numberOfEstimates = new ArrayList<Integer>(HOW_MANY_ESTIMATES);
for (int i = 0; i < HOW_MANY_ESTIMATES; i++) {
numberOfEstimates.add(i);
}
JavaRDD<Integer> dataSet = jsc.parallelize(throwsList, SLICES);
long totalPi = dataSet.filter(new Function<Integer, Boolean>() {
public Boolean call(Integer i) {
double x = Math.random();
double y = Math.random();
if (x * x + y * y < 1) {
return true;
} else
return false;
}
}).count();
System.out.println(
"The average of " + HOW_MANY_DARTS + " estimates of Pi is " + 4 * totalPi/(double)HOW_MANY_DARTS);
jsc.stop();
jsc.close();
}
}
只是在这个问题的一个小背景下,我的老板看到了创建RDD的构造,然后分配了一个映射函数的输出。他想知道为什么需要“额外的RDD”,因为地图会生成额外的RDD。这可能是一个单独的问题,但是激发了我的问题,即一系列地图是否可以链接,但是迭代次数不同,就像我做的{for j do}循环 – JimLohse