2015-06-05 113 views
0

假设我有这些文件,我想删除重复:比较文件,并删除重复的星火和Scala

buy sansa view sell product player charger world charge player charger receive 
oldest daughter teen daughter player christmas so daughter life line listen sooo hold 
thourghly sansa view delete song time wont wont connect-computer computer put time 
oldest daughter teen daughter player christmas so daughter life line listen sooo hold 
oldest daughter teen daughter player christmas so daughter life line listen sooo hold 

这是输出:

buy sansa view sell product player charger world charge player charger receive 
oldest daughter teen daughter player christmas so daughter life line listen sooo hold 
thourghly sansa view delete song time wont wont connect-computer computer put time 

有Scala中这方面的任何解决方案和Spark?

回答

1

你似乎在读一本线形式基础上的文件,以便textFile将正确地读入字符串RDD,每行一个行的这一点。在此之后,distinct将RDD减肥为一个独特的集合。

sc.textFile("yourfile.txt") 
    .distinct 
    .saveAsTextFile("distinct.txt") 
0

使用reduceByKey函数,可以实现您的要求。

您可以使用此代码

val textFile = spark.textFile("hdfs://...") 
val uLine = textFile.map(line => (line, 1)) 
       .reduceByKey(_ + _).map(uLine => uLine._1) 
uLine.saveAsTextFile("hdfs://...") 

,或者您可以使用

val uLine = spark.textFile("hdfs://...").distinct 
uLine.saveAsTextFile("hdfs://...")