1

我正在使用spark 1.4.1。当我试图播放随机林模型就说明我这个错误:PySpark中的广播随机森林模型

Traceback (most recent call last): 
    File "/gpfs/haifa/home/d/a/davidbi/codeBook/Nice.py", line 358, in <module> 
broadModel = sc.broadcast(model) 
    File "/opt/apache/spark-1.4.1-bin-hadoop2.4_doop/python/lib/pyspark.zip/pyspark/context.py", line 698, in broadcast 
    File "/opt/apache/spark-1.4.1-bin-hadoop2.4_doop/python/lib/pyspark.zip/pyspark/broadcast.py", line 70, in __init__ 
    File "/opt/apache/spark-1.4.1-bin-hadoop2.4_doop/python/lib/pyspark.zip/pyspark/broadcast.py", line 78, in dump 
File "/opt/apache/spark-1.4.1-bin-hadoop2.4_doop/python/lib/pyspark.zip/pyspark/context.py", line 252, in __getnewargs__ 
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063. 

例子的代码我试图执行:

sc = SparkContext(appName= "Something") 
model = RandomForest.trainRegressor(sc.parallelize(data), categoricalFeaturesInfo=categorical, numTrees=100, featureSubsetStrategy="auto", impurity='variance', maxDepth=4) 
broadModel= sc.broadcast(model) 

如果有人能与我会帮助我非常感谢! 非常感谢!

+0

有你为什么需要广播整个模型的一个原因?该模型可以对输入的RDD进行预测。 – Magsol

+0

有多个模型(在我的情况下,每个模型定义组)。每个样本需要从每个模型中获取预测,以了解他最擅长的组。我正在处理大数据,因此我需要将模型广播到映射器。 – dadibiton

回答

1

简短的回答是它不可能使用PySpark。预测所需的callJavaFunc因此使用了SparkContext错误。尽管如此,使用Scala API可以做到这一点。

在Python中,您可以使用与单个模型相同的方法,这意味着model.predict后跟zip

models = [mode1, mode2, mode3] 

predictions = [ 
    model.predict(testData.map(lambda x: x.features)) for model in models] 

def flatten(x): 
    if isinstance(x[0], tuple): 
     return tuple(list(x[0]) + [x[1]]) 
    else: 
     return x 

(testData 
    .map(lambda lp: lp.label) 
    .zip(reduce(lambda p1, p2: p1.zip(p2).map(flatten), predictions))) 

如果想更多地了解问题的根源,请检查:How to use Java/Scala function from an action or a transformation?