2017-06-01 43 views
3

我想运行在pyspark代码(火花2.1.1):趴趴的类型必须为org.apache.spark.ml.linalg.VectorUDT

from pyspark.ml.feature import PCA 

bankPCA = PCA(k=3, inputCol="features", outputCol="pcaFeatures") 
pcaModel = bankPCA.fit(bankDf)  
pcaResult = pcaModel.transform(bankDF).select("label", "pcaFeatures")  
pcaResult.show(truncate= false) 

但我得到这个错误:

requirement failed: Column features must be of type org.apache.spark.ml.linalg.Vect [email protected] but was actually [email protected] .

回答

1

的例子,你可以找到here

from pyspark.ml.feature import PCA 
from pyspark.ml.linalg import Vectors 

data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),), 
    (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),), 
    (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)] 
df = spark.createDataFrame(data, ["features"]) 

pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures") 
model = pca.fit(df) 

... other code ... 

正如你可以在上面看到,DF是一个数据帧,其中包含从pyspark.ml.linalg导入的的Vectors.sparse()和Vectors.dense()。

也许,你bankDf包含载体进口从pyspark.mllib.linalg

所以,你必须设置载体在dataframes进口

from pyspark.ml.linalg import Vectors 

代替:

from pyspark.mllib.linalg import Vectors 

也许你可以发现,有趣此stackoverflow question

相关问题