0
我试图循环访问文本文件的RDD,并对文件中的每个唯一字进行计数,然后累积每个唯一字后面的所有单词以及它们的计数。到目前为止,这是我所:如何使用3个值来减少键值?
// connecting to spark driver
val conf = new SparkConf().setAppName("WordStats").setMaster("local")
val spark = new SparkContext(conf) //Creates a new SparkContext object
//Loads the specified file into an RDD
val lines = sparkContext.textFile(System.getProperty("user.dir") + "/" + "basketball_words_only.txt")
//Splits the file into individual words
val words = lines.flatMap(line => {
val wordList = line.split(" ")
for {i <- 0 until wordList.length - 1}
yield (wordList(i), wordList(i + 1), 1)
})
如果我没有明确迄今为止,我所要做的是积累了一套遵循每个单词的词文件,用的次数沿所述词语按照他们的前述词语的形式:
(PrecedingWord,(FollowingWord,numberOfTimesWordFollows))
其数据类型是 (字符串,(字符串,整数))