2015-12-12 29 views
4

试图理解Spark的归一化算法。我的小的测试集合包含5个载体:Spark中的特征归一化算法

{0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0}, 
{1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0}, 
{-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0}, 
{-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0}, 
{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0}, 

我期望new Normalizer().transform(vectors)创建JavaRDD其中每个矢量的特征是归一化的方式为:所有值(v-mean)/stdev为特征-0,`特征-1等
将所得组是:

[-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,-1.4285714276967932E-5,0.9999999993877552] 
[1.357142668768307E-5,2.571428214508371E-7,0.0,3.428570952677828E-4,3.428570952677828E-4,2.057142571606697E-4,0.9999998611976999] 
[-1.357142668768307E-5,2.571428214508371E-7,0.0,3.428570952677828E-4,3.428570952677828E-4,2.057142571606697E-4,0.9999998611976999] 
[1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,1.4285714276967932E-5,0.9999999993877552] 
[0.0,0.0,0.0,0.0,0.0,0.0,1.0] 

请注意,所有原始值7000.0都会导致不同的“归一化”值。另外,例如,当值为:.95,1,-1,-.95,0时如何计算1.357142668768307E-5?更重要的是,如果我删除某个功能,结果会有所不同。找不到关于此问题的任何文档。
实际上,我的问题是,如何正确标准化RDD中的所有向量?

+1

你确定你的输入是正确的吗?如果你手工计算你认为什么是stdev? –

回答

6

您的期望根本不正确。由于它是在the official documentation明确指出“Normalizer秤个体样本具有单元中的L p规范”,其中缺省值p是2.忽略数值精度的问题:

import org.apache.spark.mllib.linalg.Vectors 

val rdd = sc.parallelize(Seq(
    Vectors.dense(0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0), 
    Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 70000.0), 
    Vectors.dense(-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 70000.0), 
    Vectors.dense(-0.95, 0.018, 0.0, 24.0, 24.0, 14.4, 70000.0), 
    Vectors.dense(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 70000.0))) 

val transformed = normalizer.transform(rdd) 
transformed.map(_.toArray.sum).collect 
// Array[Double] = Array(1.0009051182149054, 1.000085713673417, 
// 0.9999142851020933, 1.00087797536153, 1.0 

MLLib不提供您需要的功能,但可以使用从MLStandardScaler

import org.apache.spark.ml.feature.StandardScaler 

val df = rdd.map(Tuple1(_)).toDF("features") 

val scaler = new StandardScaler() 
    .setInputCol("features") 
    .setOutputCol("scaledFeatures") 
    .setWithStd(true) 
    .setWithMean(true) 

val transformedDF = scaler.fit(df).transform(df) 

transformedDF.select($"scaledFeatures")show(5, false) 

// +--------------------------------------------------------------------------------------------------------------------------+ 
// |scaledFeatures                           | 
// +--------------------------------------------------------------------------------------------------------------------------+ 
// |[0.9740388301169303,0.015272022105217588,0.0,1.0938637007095298,1.0938637007095298,1.0910691283447955,0.0]    | 
// |[1.0253040317020319,1.4038947727833362,1.414213562373095,-0.6532797101459693,-0.6532797101459693,-0.6010982697825494,0.0] | 
// |[-1.0253040317020319,-1.4242574689236265,-1.414213562373095,-0.805205224133404,-0.805205224133404,-0.8536605680105113,0.0]| 
// |[-0.9740388301169303,0.015272022105217588,0.0,1.0938637007095298,1.0938637007095298,1.0910691283447955,0.0]    | 
// |[0.0,-0.010181348070145075,0.0,-0.7292424671396867,-0.7292424671396867,-0.7273794188965303,0.0]       | 
// +--------------------------------------------------------------------------------------------------------------------------+ 
+3

**注意:StandardScaler和Normalizer是不同的动物!** - StandardScaler工作在矢量列上,然后减去平均值然后用stdev除。 - 标准化程序分别在每个矢量上工作并按照标准划分。 – WillemM

+0

@WillemM是的,他们是。这确切的问题在这里 – zero323