如何使用3个值来减少键值？

我试图循环访问文本文件的RDD，并对文件中的每个唯一字进行计数，然后累积每个唯一字后面的所有单词以及它们的计数。到目前为止，这是我所：如何使用3个值来减少键值？

// connecting to spark driver 
val conf = new SparkConf().setAppName("WordStats").setMaster("local") 
val spark = new SparkContext(conf) //Creates a new SparkContext object 

//Loads the specified file into an RDD 
val lines = sparkContext.textFile(System.getProperty("user.dir") + "/" + "basketball_words_only.txt") 

//Splits the file into individual words 
val words = lines.flatMap(line => { 

    val wordList = line.split(" ") 

    for {i <- 0 until wordList.length - 1} 

    yield (wordList(i), wordList(i + 1), 1) 

})

如果我没有明确迄今为止，我所要做的是积累了一套遵循每个单词的词文件，用的次数沿所述词语按照他们的前述词语的形式：

（PrecedingWord，（FollowingWord，numberOfTimesWordFollows））

其数据类型是（字符串，（字符串，整数））

来源

2017-04-23 JGT

你可能想沿着这些路线的东西：

(for { 
    line <- lines 
    Array(word1, word2) <- line.split("\\s+").sliding(2) 
} yield ((word1, word2), 1)) 
.reduceByKey(_ + _) 
.map({ case ((word1, word2), count) => (word1, (word2, count)) })

顺便说一句，你可能希望确保每个linesRDD“行”相当于句话让你不跨越计算词对句子边界。此外，如果你还没有，你可能想要看看像OpenNLP或CoreNLP自然语言处理库进行句子边界检测，标记等。

来源

2017-04-23 09:41:11

如何使用3个值来减少键值？

回答

相关问题