2016-09-18 82 views
0

我有以下RDD,每个记录(BIGINT,载体)的元组:pyspark:扩大DenseVector到元组到RDD

myRDD.take(5) 

[(1, DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432])), 
(1, DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432])), 
(0, DenseVector([5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0])), 
(1, DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432])), 
(1, DenseVector([9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432]))] 

如何展开密集的载体,使其一部分一个元组?即我希望以上成为:

[(1, 9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432), 
(1, 9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432), 
(0, 5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0), 
(1, 9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432), 
(1, 9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432)] 

谢谢!

+1

提示:'Vector'是可迭代的。其他一切都是一个基本的Python(参数拆包可能是有用的,但不是必需的)。 – zero323

+0

谢谢zero323!我尝试newRDD = myRDD.map(lambda x:(x [0],tuple(x [1]))),它确实将DenseVector展开为一个元组,但我仍然在元组内部找到一个元组,如:(1, (1,9.2463,1.0,0.392,0.3381,162.6437,7.9432)),这个嵌套元组变成一个元组的任何提示?谢谢! – Edamame

回答

1

好吧,既然pyspark.ml.linalg.DenseVector(或mllib)是iterbale(提供__len____getitem__方法),你可以把它像任何其他的Python的集合,例如:

def as_tuple(kv): 
    """ 
    >>> as_tuple((1, DenseVector([9.25, 1.0, 0.31, 0.31, 162.37]))) 
    (1, 9.25, 1.0, 0.31, 0.31, 162.37) 
    """ 
    k, v = kv 
    # Use *v.toArray() if you want to support Sparse one as well. 
    return (k, *v) 

对于Python 2取代:

(k, *v) 

有:

from itertools import chain 

tuple(chain([k], v)) 

或:

(k,) + tuple(v) 

如果你想值转换到Python(未NumPy的)标量使用:代替v

v.toArray().tolist() 

+0

'k,v = kv'是拆包的结构。你可以使用'kv [0]','kv [1]'代替,但如果发现它更优雅,更容易阅读。 – zero323