Pyspark：将RDD转换为RowMatrix

我有一个RDD窗体（id1，id2，score）。顶部（5）行看起来像Pyspark：将RDD转换为RowMatrix

[(41955624, 42044497, 3.913625989045223e-06), 
(41955624, 42039940, 0.0001018890937469129), 
(41955624, 42037797, 7.901647831291928e-05), 
(41955624, 42011137, -0.00016191403038589588), 
(41955624, 42006663, -0.0005302800991148567)]

我想根据分数计算id2成员之间的相似度。我想使用RowMatrix.columnSimilarity，但我需要先将它转换为RowMatrix。我希望矩阵的结构为id1 x id2 - 即，使id为id1外的行id和id2外的列id。

如果我的数据是小我可以把它转换成数据帧Pyspark然后用旋转像

rdd_df.groupBy("id1").pivot("id2").sum("score")

但有超过10,000个不同的ID2 borks，我有比这更多。

天真 rdd_Mat = la.RowMatrix（红色）带来的数据作为3列矩阵，这不是我想要的。

非常感谢。

来源

2017-08-10 efreeman

数据的结构更类似于CoordinateMatrix的结构，它基本上是RDD的元组的封装。正因为如此，您可以轻松地从您现有的RDD创建CoordinetMatrix。

from pyspark.mllib.linalg.distributed import CoordinateMatrix 

cmat=CoordinateMatrix(yourRDD)

此外，因为您最初问了RowMatrix，pyspark提供了一种轻松矩阵类型之间转换：给你想要的RowMatrix

rmat=cmat.toRowMatrix()

。

来源

2017-08-10 23:28:58 DavidWayne

谢谢。我发现我不得不做一个中间步骤，将ID转换成连续的整数，以避免制作40毫米柱的矩阵。 – efreeman

不客气。如果此答案已解决您的问题，请考虑通过点击复选标记来接受此问题。没有义务。 – DavidWayne

Pyspark：将RDD转换为RowMatrix

回答

相关问题