2015-02-10 122 views
2

我有一个rowMatrix xw火花mllib应用功能

scala> xw 
res109: org.apache.spark.mllib.linalg.distributed.RowMatrix = [email protected] 

,我想给一个函数应用到它的每个元素的一个rowMatrix的所有元素:

f(x)=exp(-x*x)

的矩阵元素的类型可以被可视化为:

scala> xw.rows.first 

res110: org.apache.spark.mllib.linalg.Vector = [0.008930720313311474,0.017169380001300985,-0.013414238595719104,0.02239106636801034,0.023009502628798143,0.02891937604244297,0.03378470969100948,0.03644030110678057,0.0031586143217048825,0.011230244437457062,0.00477455053405408,0.020251682490519785,-0.005429788421130285,0.011578489275815267,0.0019301805575977788,0.022513736483645713,0.009475039307158668,0.019457912132044935,0.019209006632742498,-0.029811133879879596] 

我的主要问题是我不能在地图上使用地图

scala> xw.rows.map(row => row.map(e => breeze.numerics.exp(e))) 
<console>:44: error: value map is not a member of org.apache.spark.mllib.linalg.Vector 
       xw.rows.map(row => row.map(e => breeze.numerics.exp(e))) 
            ^

scala> 

我该如何解决?

回答

6

这是假设你知道你实际上有一个DenseVector(这似乎是这种情况)。您可以在载体中,其中有一个叫图toArray,然后再转换回DenseVectorVectors.dense

xw.rows.map{row => Vectors.dense(row.toArray.map{e => breeze.numerics.exp(e)})}

你可以这样做一个SparseVector为好;它在数学上是正确的,但是转换为数组可能效率极低。另一个选择是拨打row.copy,然后使用foreachActive,这对密集和稀疏矢量都有意义。但copy可能不会针对您正在使用的特定Vector类实现,并且如果您不知道向量的类型,则不能对数据进行变异。如果你真的需要支持稀疏密集的向量,我会做这样的事情:

xw.rows.map{ 
    case denseVec: DenseVector => 
    Vectors.dense(denseVec.toArray.map{e => breeze.numerics.exp(e)})} 
    case sparseVec: SparseVector => 
    //we only need to update values of the sparse vector -- the indices remain the same 
    val newValues: Array[Double] = sparseVec.values.map{e => breeze.numerics.exp(e)} 
    Vectors.sparse(sparseVec.size, sparseVec.indices, newValues) 
} 
+0

感谢您的答案。所以对于vectors.dense类,你建议我使用提供的代码行吗?你是否可以在答案的第二部分编写代码?我是斯卡拉初学者,所以它不是太容易遵循 – Donbeo 2015-02-12 15:36:16

+0

@唐贝我更新了答案一点。如果你确定你有DenseVectors,那就去找第一个答案。如果你可能稀疏或密集,你可以使用第二个,但请注意,即使这样也不能处理Vector的其他可能的实现。 (例如,它不处理'VectorUDT') – 2015-02-12 17:58:13