2017-09-03 39 views
-2

我从CSV如下读取数据帧,PySpark - 改变长期类型数组类型(长型)

df1= 
category value Referece value 
count   1  1 
n_timer   20  40,20 
frames   54  56 
timer   8  3,6,7 
pdf    99  100,101,22 
zip    10  10,11,12 

但它解读为长型和字符串类型的列,但我都在想数组类型(LongType),以便我可以交叉这些列并获取输出。

我想读数据帧为象下面这样:

category value Referece value 

count  [1]  [1] 
n_timer  [20] [40,20] 
frames  [54] [56] 
timer  [8]  [3,6,7] 
pdf   [99] [100,101,22] 
zip   [10] [10,11,12] 

请提出一些解决方案

回答

-2
# check below code 
from pyspark import SparkContext 
from pyspark.sql.functions import split 
sc = SparkContext.getOrCreate() 
df1 = sc.parallelize([("count","1","1"), ("n_timer","20","40,20"), ("frames","54","56"),("timer","8","3,6,7"),("pdf","99","100,101,22"),("zip","10","10,11,12")]).toDF(["category", "value","Reference_value"]) 
print(df1.show()) 
df1=df1.withColumn("Reference_value", split("Reference_value", ",\s*").cast("array<long>")) 
df1=df1.withColumn("value", split("value", ",\s*").cast("array<long>")) 
print(df1.show()) 

Input df1= 
+--------+-----+---------------+ 
|category|value|Reference_value| 
+--------+-----+---------------+ 
| count| 1|    1| 
| n_timer| 20|   40,20| 
| frames| 54|    56| 
| timer| 8|   3,6,7| 
|  pdf| 99|  100,101,22| 
|  zip| 10|  10,11,12| 
+--------+-----+---------------+ 

output df2= 
+--------+-----+---------------+ 
|category|value|Reference_value| 
+--------+-----+---------------+ 
| count| [1]|   [1]| 
| n_timer| [20]|  [40, 20]| 
| frames| [54]|   [56]| 
| timer| [8]|  [3, 6, 7]| 
|  pdf| [99]| [100, 101, 22]| 
|  zip| [10]| [10, 11, 12]| 
+--------+-----+---------------+ 
-2

与值和参考列作为阵列型编码器一类..

如何在JAVA中执行: Dataset sampleDim = sqlContext.read().csv(filePath).as(Encoders.bean(sample.class));

您可以在Python中使用相同的方法