7

我收到以下错误试图建立一个ML Pipeline如何将ArrayType转换为PySpark DataFrame中的DenseVector?

pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column features must be of type [email protected] but was actually ArrayType(DoubleType,true).' 

features列包含浮点值的数组。这听起来像我需要将这些转换为某种类型的矢量(它不稀疏,所以DenseVector?)。有没有办法直接在DataFrame上执行此操作,还是需要将其转换为RDD?

回答

12

您可以使用UDF:

udf(lambda vs: Vectors.dense(vs), VectorUDT()) 

火花< 2.0进口:

from pyspark.mllib.linalg import Vectors, VectorUDT 

火花2.0+进口:

from pyspark.ml.linalg import Vectors, VectorUDT 

请注意,这些类不兼容尽管相同的实施。

也可以提取各个特征并与VectorAssembler进行汇编。假设输入列被称为features

from pyspark.ml.feature import VectorAssembler 

n = ... # Size of features 

assembler = VectorAssembler(
    inputCols=["features[{0}]".format(i) for i in range(n)], 
    outputCol="features_vector") 

assembler.transform(df.select(
    "*", *(df["features"].getItem(i) for i in range(n)) 
)) 
相关问题