0
我正在使用PySpark作为工具PCA分析,但我有错误,由于从CSV文件中读取数据的配伍。我该怎么办?你能帮我吗?PCA分析与PySpark
from __future__ import print_function
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from pyspark.sql.functions import udf
import pandas as pd
import numpy as np
from numpy import array
conf = SparkConf().setAppName("building a warehouse")
sc = SparkContext(conf=conf)
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("PCAExample")\
.getOrCreate()
data = sc.textFile('dataset.csv') \
.map(lambda line: line.split(','))\
.collect()
#create a data frame from data read from csv file
df = spark.createDataFrame(data, ["features"])
#convert data to vector udt
df.show()
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
result = model.transform(df).select("pcaFeatures")
result.show(truncate=False)
spark.stop()
这里是我得到的错误:
File "C:/spark/spark-2.1.0-bin-hadoop2.7/bin/pca_bigdata.py", line 38, in <module>
model = pca.fit(df)
pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Column features must be of type [email protected] but was actually StringType.'
你能提供一个文件的例子吗?谢谢。 – Keith
它包含了数据这样的:15,447176933288574,58783,89453125,117,73371124267578,0,0,0,30145,232421875,127,86238861083984,30113,59375,126,52108001708984,512,08636474609375,514,4246826171875,571 ,90142822265625,573,742431640625,586,60888671875,571,6429443359375 ,, –
您的数字还在读作字符串没有花车,做图是这样的:'数据= sc.textFile(“dataset.csv”)地图( lambda行:[float(k)for line in line.split(',')])' –