2015-09-14 72 views
0

我使用下面的代码导出DataFrame如何合并两个文本文件,并转换成csv文件斯卡拉

df.select("A", "b", "C", "D","E") 
    .write.format("com.databricks.spark.csv") 
    .save("newiris.csv") 

我得到两个文本文件如下:

部分00000

5.1,3.5,1.4,0.2,Iris-setosa 
4.9,3,1.4,0.2,Iris-setosa 
4.7,3.2,1.3,0.2,Iris-setosa 
4.6,3.1,1.5,0.2,Iris-setosa 
5,3.6,1.4,0.2,Iris-setosa 
5.4,3.9,1.7,0.4,Iris-setosa 

部分00001

6.7,3,5,1.7,Iris-versicolor 
6,2.9,4.5,1.5,Iris-versicolor 
5.7,2.6,3.5,1,Iris-versicolor 
5.5,2.4,3.8,1.1,Iris-versicolor 
5.5,2.4,3.7,1,Iris-versicolor 
5.8,2.7,3.9,1.2,Iris-versicolor 

现在我想拥有它们组合成一个文件中像

5.1,3.5,1.4,0.2,Iris-setosa 
4.9,3,1.4,0.2,Iris-setosa 
4.7,3.2,1.3,0.2,Iris-setosa 
4.6,3.1,1.5,0.2,Iris-setosa 
5,3.6,1.4,0.2,Iris-setosa 
5.4,3.9,1.7,0.4,Iris-setosa 
6.7,3,5,1.7,Iris-versicolor 
6,2.9,4.5,1.5,Iris-versicolor 
5.7,2.6,3.5,1,Iris-versicolor 
5.5,2.4,3.8,1.1,Iris-versicolor 
5.5,2.4,3.7,1,Iris-versicolor 
5.8,2.7,3.9,1.2,Iris-versicolor 

然后将其转换为CSV。我如何在Scala中做到这一点?

回答

1

必要的斯卡拉这里位被scala.io.Source读取该文件,并得到了线,++追加part0-00000part-00001foreach循环都要经过组合的数据和写入文件。文件I/O与Java中的相同。

scala> import java.io._ 

scala> import scala.io.Source 

scala> val part0 = Source.fromFile("part-00000.txt").getLines 
part0: Iterator[String] = non-empty iterator 

scala> val part1 = Source.fromFile("part-00001.txt").getLines 
part1: Iterator[String] = non-empty iterator 

scala> val part2 = part0.toList ++ part1.toList 
part2: List[String] = List(5.1,3.5,1.4,0.2,Iris-setosa, 4.9,3,1.4,0.2,Iris-setosa, 4.7,3.2,1.3,0.2,Iris-setosa, 4.6,3.1,1.5,0.2,Iris-setosa, 5,3.6,1.4,0.2,Iris-setosa, 5.4,3.9,1.7,0.4,Iris-setosa, 6.7,3,5,1.7,Iris-versicolor, 6,2.9,4.5,1.5,Iris-versicolor, 5.7,2.6,3.5,1,Iris-versicolor, 5.5,2.4,3.8,1.1,Iris-versicolor, 5.5,2.4,3.7,1,Iris-versicolor, 5.8,2.7,3.9,1.2,Iris-versicolor) 

scala> val part00002 = new File("part-00002") 
part00002: java.io.File = part-00002 

scala> val bw = new BufferedWriter(new FileWriter(part00002)) 
bw: java.io.BufferedWriter = [email protected] 

scala> part2.foreach(p => bw.write(p + "\n")) 


scala> bw.close 

检查文件:

brian:/tmp/ $ cat part-00002                
5.1,3.5,1.4,0.2,Iris-setosa 
4.9,3,1.4,0.2,Iris-setosa 
4.7,3.2,1.3,0.2,Iris-setosa 
4.6,3.1,1.5,0.2,Iris-setosa 
5,3.6,1.4,0.2,Iris-setosa 
5.4,3.9,1.7,0.4,Iris-setosa 
6.7,3,5,1.7,Iris-versicolor 
6,2.9,4.5,1.5,Iris-versicolor 
5.7,2.6,3.5,1,Iris-versicolor 
5.5,2.4,3.8,1.1,Iris-versicolor 
5.5,2.4,3.7,1,Iris-versicolor 
5.8,2.7,3.9,1.2,Iris-versicolor 
+0

非常感谢!当我做val part00002 =新文件(“part-00002”)我得到一个错误没有找到:键入文件。我需要定义文件还是导入? – Tong

+0

'import java.io._'应该这样做。 – Brian

+0

谢谢!它工作完美。还有一个问题,如果part-00000和part-00001采用csv格式,这个操作会更容易吗? – Tong