2017-01-19 32 views
1

我有这样的代码,如何正确使用Java Spark在Apache Spark中制作TF-IDF语句向量?

public class TfIdfExample { 
     public static void main(String[] args){ 
      JavaSparkContext sc = SparkSingleton.getContext(); 
      SparkSession spark = SparkSession.builder() 
        .config("spark.sql.warehouse.dir", "spark-warehouse") 
        .getOrCreate(); 
      JavaRDD<List<String>> documents = sc.parallelize(Arrays.asList(
        Arrays.asList("this is a sentence".split(" ")), 
        Arrays.asList("this is another sentence".split(" ")), 
        Arrays.asList("this is still a sentence".split(" "))), 2); 


      HashingTF hashingTF = new HashingTF(); 
      documents.cache(); 
      JavaRDD<Vector> featurizedData = hashingTF.transform(documents); 
      // alternatively, CountVectorizer can also be used to get term frequency vectors 

      IDF idf = new IDF(); 
      IDFModel idfModel = idf.fit(featurizedData); 

      featurizedData.cache(); 

      JavaRDD<Vector> tfidfs = idfModel.transform(featurizedData); 
      System.out.println(tfidfs.collect()); 
      KMeansProcessor kMeansProcessor = new KMeansProcessor(); 
      JavaPairRDD<Vector,Integer> result = kMeansProcessor.Process(tfidfs); 
      result.collect().forEach(System.out::println); 
     } 
    } 

我需要得到矢量k均值,但我越来越多载体

[(1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]), 
    (1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]), 
    (1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0])] 

后K-均值工作,我得到它

((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),1) 
((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),0) 
((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),1) 
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),1) 
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),1) 
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),0) 
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),1) 
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),0) 
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),1) 

但我认为它工作不正确,因为tf-idf必须有另一个视图。 我认为mllib已经为此准备好了方法,但是我测试了文档示例并且没有收到我需要的东西。 Spark的自定义解决方案我没有找到。可能有人与它合作并给我回答我做错了什么?可能是我不正确使用mllib功能?

回答

1

TF-IDF的后面是SparseVector

要了解的值更好,让我开始与TF载体:

(1048576,[489554,540177,736740,894973],[1.0,1.0,1.0,1.0]) 
(1048576,[455491,540177,736740,894973],[1.0,1.0,1.0,1.0]) 
(1048576,[489554,540177,560488,736740,894973],[1.0,1.0,1.0,1.0,1.0]) 

例如,对应于第一句子TF载体是1048576= 2^20)分量矢量,用相应的4-非零值以指数489554,540177,736740894973,所有其他值为零,因此不存储在稀疏矢量表示中。

特征向量的维度等于您散列到的桶的数量:1048576 = 2^20您的案例中的桶。

对于这种规模的语料库,你应该考虑降低桶的数量:2

HashingTF hashingTF = new HashingTF(32); 

权力,建议尽量减少哈希冲突的数量。

接下来,您将IDF权重:

(1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]) 
(1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]) 
(1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]) 

如果我们在第一句再看看,我们得到了3个零 - 这是意料之中的,因为术语“这”,“是”和“句子“出现在语料库的每个文档中,因此by definition of IDF将等于零。

为什么零值仍在(稀疏)向量中?因为在当前的实现中,the size of the vector is kept the same并且只有值乘以IDF。

+0

谢谢,但是你的意思_I我假设打印输出被截断,我从控制台复制粘贴全部。因为我认为tf-idf没有给我带来真正的向量。我做'新的HashingTF(32);'和第一个元组中的ID变小。但我不明白为什么在第二个元组中的一些值我得到0.0 –

+1

我跑你的例子,实际上这些值应该等于零。我已经添加了更多的细节/解释链接 - 让我知道它是否有帮助。 –

+0

在这个向量中有一个问题,'(1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0])''[0.28768207245178085,0.0,0.0,0.0]'it tf-idf apply IDF after到TF? –