2016-09-08 88 views
3

我是Scala的新手。我试图将一个scala列表(它保存源数据框上的一些计算数据的结果)转换为Dataframe或Dataset。我没有找到任何直接的方法来做到这一点。 但是,我已经尝试了以下过程将我的列表转换为DataSet,但它似乎无法正常工作。我正在提供以下三种情况。将Scala列表转换为DataFrame或DataSet

有人可以给我提供一些希望,如何做这种转换?谢谢。

import org.apache.spark.sql.{DataFrame, Row, SQLContext, DataFrameReader} 
import java.sql.{Connection, DriverManager, ResultSet, Timestamp} 
import scala.collection._ 

case class TestPerson(name: String, age: Long, salary: Double) 
var tom = new TestPerson("Tom Hanks",37,35.5) 
var sam = new TestPerson("Sam Smith",40,40.5) 

val PersonList = mutable.MutableList[TestPerson]() 

//Adding data in list 
PersonList += tom 
PersonList += sam 

//Situation 1: Trying to create dataset from List of objects:- Result:Error 
//Throwing error 
var personDS = Seq(PersonList).toDS() 
/* 
ERROR: 
error: Unable to find encoder for type stored in a Dataset. Primitive types 
    (Int, String, etc) and Product types (case classes) are supported by  
importing sqlContext.implicits._ Support for serializing other types will 
be added in future releases. 
    var personDS = Seq(PersonList).toDS() 

*/ 
//Situation 2: Trying to add data 1-by-1 :- Result: not working as desired.  
the last record overwriting any existing data in the DS 
var personDS = Seq(tom).toDS() 
personDS = Seq(sam).toDS() 

personDS += sam //not working. throwing error 


//Situation 3: Working. However, I am having consolidated data in the list  
which I want to convert to DS; if I loop the results of the list in comma 
separated values and then pass that here, it will work but will create an 
extra loop in the code, which I want to avoid. 
var personDS = Seq(tom,sam).toDS() 
scala> personDS.show() 
+---------+---+------+ 
|  name|age|salary| 
+---------+---+------+ 
|Tom Hanks| 37| 35.5| 
|Sam Smith| 40| 40.5| 
+---------+---+------+ 
+0

什么是你的火花和斯卡拉版本? –

+0

Spark版本为1.6.1 – Leo

回答

6

尝试没有Seq

case class TestPerson(name: String, age: Long, salary: Double) 
val tom = TestPerson("Tom Hanks",37,35.5) 
val sam = TestPerson("Sam Smith",40,40.5) 
val PersonList = mutable.MutableList[TestPerson]() 
PersonList += tom 
PersonList += sam 

val personDS = PersonList.toDS() 
println(personDS.getClass) 
personDS.show() 

val personDF = PersonList.toDF() 
println(personDF.getClass) 
personDF.show() 
personDF.select("name", "age").show() 

输出:

class org.apache.spark.sql.Dataset 

+---------+---+------+ 
|  name|age|salary| 
+---------+---+------+ 
|Tom Hanks| 37| 35.5| 
|Sam Smith| 40| 40.5| 
+---------+---+------+ 

class org.apache.spark.sql.DataFrame 

+---------+---+------+ 
|  name|age|salary| 
+---------+---+------+ 
|Tom Hanks| 37| 35.5| 
|Sam Smith| 40| 40.5| 
+---------+---+------+ 

+---------+---+ 
|  name|age| 
+---------+---+ 
|Tom Hanks| 37| 
|Sam Smith| 40| 
+---------+---+ 

此外,确保移动的情况下类TestPersonoutside the scope of your object的声明。

+0

感谢上述解决方案,它适用于Dataset。我的最终目标是在DataFrame中获取数据。我用这个命令“scala> val RowsDF = sc.parallelize(personDS).toDF()”但是出现错误“:51:error:type mismatch; found:org.apache.spark.sql.Dataset [TestPerson] 需要:Seq [?] val RowsDF = sc.parallelize(personDS).toDF() “ – Leo

+0

我得到这个:scala> val RowsDF = personDS.toDF() RowsDF:org.apache.spark.sql.DataFrame = [名称:字符串,年龄:bigint,工资:双倍] – Leo

相关问题