将数据帧中的矢量列转换回数组列

我有一个有两列的数据框，其中一个（称为dist）是密集向量。我如何将它转换回整数数组列。将数据帧中的矢量列转换回数组列

+---+-----+ 
| id| dist| 
+---+-----+ 
|1.0|[2.0]| 
|2.0|[4.0]| 
|3.0|[6.0]| 
|4.0|[8.0]| 
+---+-----+

我尝试使用以下UDF的几个变种，但它返回一个类型不匹配错误

val toInt4 = udf[Int, Vector]({ (a) => (a)}) 

val result = df.withColumn("dist", toDf4(df("dist"))).select("dist")

来源

2016-03-07 ulrich

什么是“标准”栏？ –

一个数组例如 – ulrich

所以，你显然想要在一个矢量中合并所有列，对吗？ –

我认为这是最容易通过进入RDD API，然后再去做。

import org.apache.spark.mllib.linalg.DenseVector 
import org.apache.spark.sql.DataFrame 
import org.apache.spark.rdd.RDD 
import sqlContext._ 

// The original data. 
val input: DataFrame = 
    sc.parallelize(1 to 4) 
    .map(i => i.toDouble -> new DenseVector(Array(i.toDouble * 2))) 
    .toDF("id", "dist") 

// Turn it into an RDD for manipulation. 
val inputRDD: RDD[(Double, DenseVector)] = 
    input.map(row => row.getAs[Double]("id") -> row.getAs[DenseVector]("dist")) 

// Change the DenseVector into an integer array. 
val outputRDD: RDD[(Double, Array[Int])] = 
    inputRDD.mapValues(_.toArray.map(_.toInt)) 

// Go back to a DataFrame. 
val output = outputRDD.toDF("id", "dist") 
output.show

你得到：

+---+----+ 
| id|dist| 
+---+----+ 
|1.0| [2]| 
|2.0| [4]| 
|3.0| [6]| 
|4.0| [8]| 
+---+----+

来源

2016-03-07 23:41:25

在火花2.0，你可以这样做：

import org.apache.spark.mllib.linalg.DenseVector 
import org.apache.spark.sql.functions.udf 

val vectorHead = udf{ x:DenseVector => x(0) } 
df.withColumn("firstValue", vectorHead(df("vectorColumn")))

来源

2016-09-22 20:51:24

@ pwb2103提到第一行应该是import org.apache.spark.ml.linalg.DenseVector –

我挣扎了一段时间才能从@ThomasLuechtefeld工作答案。但也陷入了这个非常令人沮丧的错误：

org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(features_scaled)' due to data type mismatch: argument 1 requires vector type, however, '`features_scaled`' is of vector type.

原来我需要从ML封装而不是mllib包导入DenseVector。

所以这个工作对我来说：

import org.apache.spark.ml.linalg.DenseVector 
import org.apache.spark.sql.functions._ 

val vectorToColumn = udf{ (x:DenseVector, index: Int) => x(index) } 
myDataframe.withColumn("clusters_scaled",vectorToColumn(col("features_scaled"),lit(0)))

是的，唯一不同的是第一道防线。这绝对应该是一个评论，但我没有声望。抱歉!

来源

2016-10-24 20:48:17 pwb2103

将数据帧中的矢量列转换回数组列

回答

相关问题