2
给出一列具有NaN条目的密集向量,我想计算列之间的相关性。有没有办法做到这一点,而不需要拆卸矢量来清理价值?如何计算带零点的列上的火花相关性?
#pyspark
from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
from pyspark.mllib.linalg import Vectors as MlVectors # (
from pyspark.mllib.stat import Statistics
def get_data():
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
(Vectors.dense(1., 3., 2.), 0),
(Vectors.dense(None, 4., 1.), 1),
(Vectors.dense(3., None, 0.), 2),
(Vectors.dense(4., 12., None), 3),
(Vectors.dense(5., 0., 1.), 5),
(Vectors.dense(6., -1., 0.), 6)], ["features", "foo"])
return df
def correlation(df):
digestible_data = df.select("features").rdd.map(lambda row: MlVectors.dense(row[0]))
print(Statistics.corr(digestible_data))
if __name__ == '__main__':
correlation(get_data())
# OUTPUT:
# [[ 1. nan nan]
# [ nan 1. nan]
# [ nan nan 1.]]
我只对输出矩阵的最后一列(行)感兴趣,但这与问题无关。 –