如何将Hive表转换为MLlib LabeledPoint？

我使用Impala构建了一个包含目标和数百个功能的表格。我想使用Spark MLlib来训练模型。我明白，为了通过Spark运行分布式监督模型，数据需要采用多种格式之一。 LabeledPoint对我来说似乎是最直观的。使用PySpark将Hive表转换为Labeled Points的最有效方法是什么？如何将Hive表转换为MLlib LabeledPoint？

来源

2016-02-23 ADJ

这个问题的最佳解决方案可能使用ml库，它是模型，因为它们直接作用于数据框。

http://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=ml#module-pyspark.ml.classification

然而，毫升API还没有达到与mllib功能平价又和你需要的东西可能会丢失。所以我们通过调用由hive上下文检索的数据框上的映射来解决我们工作流中的这个问题。

from pyspark import SparkContext, HiveContext 
from pyspark.mllib.regression import LabeledPoint 
from pyspark.mllib.classification import LogisticRegressionWithLBFGS 

table_name = "MyTable" 
target_col = "MyTargetCol" 

sc = SparkContext() 
hc = HiveContext(sc) 

# get the table from the hive context 
df = hc.table(table_name) 

# reorder columns so that we know the index of the target column 
df = df.select(target_col, *[col for col in dataframe.columns if col != target_col]) 

# map through the data to produce an rdd of labeled points 
rdd_of_labeled_points = df.map(lambda row: LabeledPoint(row[0], row[1:])) 

# use the rdd as input to a model 
model = LogisticRegressionWithLBFGS.train(rdd_of_labeled_points)

请记住，任何时候你映射蟒蛇，需要被对面的JVM的Python的虚拟机封的数据和性能会受到影响，因为这一点。我们发现，使用地图造成的性能对我们的数据而言可忽略不计，但您的里程可能会有所不同。

来源

2016-02-23 20:59:35 dayman

如何将Hive表转换为MLlib LabeledPoint？

回答

相关问题