2016-03-07 41 views
4

我有一个有两列的数据框,其中一个(称为dist)是密集向量。我如何将它转换回整数数组列。将数据帧中的矢量列转换回数组列

+---+-----+ 
| id| dist| 
+---+-----+ 
|1.0|[2.0]| 
|2.0|[4.0]| 
|3.0|[6.0]| 
|4.0|[8.0]| 
+---+-----+ 

我尝试使用以下UDF的几个变种,但它返回一个类型不匹配错误

val toInt4 = udf[Int, Vector]({ (a) => (a)}) 

val result = df.withColumn("dist", toDf4(df("dist"))).select("dist") 
+0

什么是“标准”栏? –

+0

一个数组例如 – ulrich

+0

所以,你显然想要在一个矢量中合并所有列,对吗? –

回答

5

我认为这是最容易通过进入RDD API,然后再去做。

import org.apache.spark.mllib.linalg.DenseVector 
import org.apache.spark.sql.DataFrame 
import org.apache.spark.rdd.RDD 
import sqlContext._ 

// The original data. 
val input: DataFrame = 
    sc.parallelize(1 to 4) 
    .map(i => i.toDouble -> new DenseVector(Array(i.toDouble * 2))) 
    .toDF("id", "dist") 

// Turn it into an RDD for manipulation. 
val inputRDD: RDD[(Double, DenseVector)] = 
    input.map(row => row.getAs[Double]("id") -> row.getAs[DenseVector]("dist")) 

// Change the DenseVector into an integer array. 
val outputRDD: RDD[(Double, Array[Int])] = 
    inputRDD.mapValues(_.toArray.map(_.toInt)) 

// Go back to a DataFrame. 
val output = outputRDD.toDF("id", "dist") 
output.show 

你得到:

+---+----+ 
| id|dist| 
+---+----+ 
|1.0| [2]| 
|2.0| [4]| 
|3.0| [6]| 
|4.0| [8]| 
+---+----+ 
4

在火花2.0,你可以这样做:

import org.apache.spark.mllib.linalg.DenseVector 
import org.apache.spark.sql.functions.udf 

val vectorHead = udf{ x:DenseVector => x(0) } 
df.withColumn("firstValue", vectorHead(df("vectorColumn"))) 
+0

@ pwb2103提到第一行应该是import org.apache.spark.ml.linalg.DenseVector –

6

我挣扎了一段时间才能从@ThomasLuechtefeld工作答案。但也陷入了这个非常令人沮丧的错误:

org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(features_scaled)' due to data type mismatch: argument 1 requires vector type, however, '`features_scaled`' is of vector type. 

原来我需要从ML封装而不是mllib包导入DenseVector。

所以这个工作对我来说:

import org.apache.spark.ml.linalg.DenseVector 
import org.apache.spark.sql.functions._ 

val vectorToColumn = udf{ (x:DenseVector, index: Int) => x(index) } 
myDataframe.withColumn("clusters_scaled",vectorToColumn(col("features_scaled"),lit(0))) 

是的,唯一不同的是第一道防线。这绝对应该是一个评论,但我没有声望。抱歉!