2016-12-26 92 views
2

我有这样转换稀疏向量以密集向量在Pyspark

>>> countVectors.rdd.map(lambda vector: vector[1]).collect() 
[SparseVector(13, {0: 1.0, 2: 1.0, 3: 1.0, 6: 1.0, 8: 1.0, 9: 1.0, 10: 1.0, 12: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 2: 1.0, 4: 1.0}), SparseVector(13, {0: 1.0, 1: 1.0, 3: 1.0, 4: 1.0, 7: 1.0}), SparseVector(13, {1: 1.0, 2: 1.0, 5: 1.0, 11: 1.0})] 

稀疏向量我试图将其转换成致密载体,pyspark 2.0.0这样

>>> frequencyVectors = countVectors.rdd.map(lambda vector: vector[1]) 
>>> frequencyVectors.map(lambda vector: Vectors.dense(vector)).collect() 

我得到这样的错误:

16/12/26 14:03:35 ERROR Executor: Exception in task 0.0 in stage 13.0 (TID 13) 
org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
    File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main 
    process() 
    File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process 
    serializer.dump_stream(func(split_index, iterator), outfile) 
    File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream 
    vs = list(itertools.islice(iterator, batch)) 
    File "<stdin>", line 1, in <lambda> 
    File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py", line 878, in dense 
    return DenseVector(elements) 
    File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py", line 286, in __init__ 
    ar = np.array(ar, dtype=np.float64) 
    File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/linalg/__init__.py", line 701, in __getitem__ 
    raise ValueError("Index %d out of bounds." % index) 
ValueError: Index 13 out of bounds. 

我该如何实现这种转换?这里有什么不对吗?

回答

2

这解决了我的问题

frequencyDenseVectors = frequencyVectors.map(lambda vector: DenseVector(vector.toArray()))