阿帕奇星火Python的余弦相似度超过DataFrames

对于推荐系统，我需要计算余弦相似度整个星火据帧的所有之间的列。阿帕奇星火Python的余弦相似度超过DataFrames

在熊猫我来做到这一点：

import sklearn.metrics as metrics 
import pandas as pd 
df= pd.DataFrame(...some dataframe over here :D ...) 
metrics.pairwise.cosine_similarity(df.T,df.T)

生成该列之间的相似矩阵（因为我使用的换位）

有没有办法做同样的事情在Spark（Python）中？

（我需要这适用于由数百万行和列的成千上万的矩阵，所以这就是为什么我需要做的是在星火）

来源

2017-05-11 Valerio Storch

您可以使用内置的columnSimilarities()方法可以计算精确的余弦相似度，也可以使用DIMSUM方法进行估计，对于较大的数据集，这种方法将快得多。使用方法的差异在于，对于后者，您必须指定threshold。

这里有一个小的可重复的例子：

from pyspark.mllib.linalg.distributed import RowMatrix 
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)]) 

# Convert to RowMatrix 
mat = RowMatrix(rows) 

# Calculate exact and approximate similarities 
exact = mat.columnSimilarities() 
approx = mat.columnSimilarities(0.05) 

# Output 
exact.entries.collect() 
[MatrixEntry(0, 2, 0.991935352214), 
MatrixEntry(1, 2, 0.998441152599), 
MatrixEntry(0, 1, 0.997463284056)]

来源

2017-05-11 17:46:42 mtoto

我该怎么办了行，而不是列？ – Charleslmh

@mtoto你知道如何在Scala中实现相同的功能吗？https://stackoverflow.com/questions/47010126/calculate-cosine-similarity-spark-dataframe –

你能解释一下matrixEntry的结果吗？像什么是0和2？ –

阿帕奇星火Python的余弦相似度超过DataFrames

回答

相关问题