2017-10-10 74 views
0

关于如何在pyspark 1.6.1中将rdd转换为数据帧并将数据帧转换回rdd的任何示例? toDF()不能在1.6.1中使用?如何在pyspark 1.6.1中将rdd转换为数据框?

例如,我有一个这样的RDD:

data = sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), \ 
         ('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)]) 

回答

0

如果由于某种原因,你不能使用.toDF()方法不能,我提出的解决方案是这样的:

data = sqlContext.createDataFrame(sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), \ 
        ('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)])) 

这将创建名称为“_n”的DF,其中n是列的编号。如果你想重新命名列,我建议你看看这个帖子:How to change dataframe column names in pyspark?。但是,所有你需要做的是:

data_named = data.selectExpr("_1 as One", "_2 as Two", "_3 as Three", "_4 as Four", "_5 as Five") 

现在,让我们看到了DF:

data_named.show() 

,这将输出:

+---+---+-----+----+----+ 
|One|Two|Three|Four|Five| 
+---+---+-----+----+----+ 
| a| b| c| 1| 4| 
| o| u| w| 9| 3| 
| s| q| a| 8| 6| 
| l| g| z| 8| 3| 
| a| b| c| 9| 8| 
| s| q| a| 10| 10| 
| l| g| z| 20| 20| 
| o| u| w| 77| 77| 
+---+---+-----+----+----+ 

编辑:再试一次,因为你应该能够在spark 1.6.1中使用.toDF()

0

我看不到rdd.toDF无法在pyspark中使用的原因f或者火花1.6.1。请检查火花例如1.6.1 python文档上toDF()https://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext

根据您的要求,

rdd = sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), ('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)]) 

#rdd to dataframe 
df = rdd.toDF() 
## can provide column names like df2 = df.toDF('col1', 'col2','col3,'col4') 

#dataframe to rdd 
rdd2 = df.rdd 
相关问题