将RDD转换为DataFrame

嗨，我是Spark新手，我正在尝试将rdd转换为dataframe.rdd是一个文件夹，其中包含许多.txt文件，并且每个文件都有一段text.Assume我RDD是这个将RDD转换为DataFrame

val data = sc.textFile("data")

我想将数据转换为数据帧像这样

+------------+------+ 
    |text  | code | 
    +----+-------+------| 
    |data of txt1| 1.0 | 
    |data of txt2| 1.0 |

所以列“文本”应该让每个txt文件和原始数据列“代码“1.0 任何帮助，将不胜感激。

来源

2016-01-28 luis

你甚至试图看文档？ – Niemand

val data = sc.textFile("data.txt") 

*// The schema is encoded in a string* 
val schemaString = "text code" 

*// Import Row.* 
import org.apache.spark.sql.Row; 

*// Import Spark SQL data types* 
import org.apache.spark.sql.types.{StructType,StructField,StringType}; 

*// Generate the schema based on the string of schema* 
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) 

*// Convert records of the RDD (data) to Rows.* 
val rowRDD = data.map(_.split(",")).map(p => Row(p(0), p(1).trim)) 

*// Apply the schema to the RDD.* 
val dataDataFrame = sqlContext.createDataFrame(rowRDD, schema) 

*// Register the DataFrames as a table.* 
dataDataFrame.registerTempTable("data") 

*// SQL statements can be run by using the sql methods provided by sqlContext.* 
val results = sqlContext.sql("SELECT name FROM data")

从所有文件中添加数据不是一个好主意，因为所有的数据都会被加载到内存中。一次只读一个文件将是更好的方法。

但是，根据您的使用情况，如果您需要所有文件的数据，则需要以某种方式追加rdds。

希望能回答你的问题！干杯！ :)

来源

2016-01-28 11:29:41 Kaushal

感谢编辑@Sumit – Kaushal

星火SQL可以使用 “toDF” 方法

http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection

在你的情况下做到这一点：

case class Data(text: String, code: Float) 

val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
// this is used to implicitly convert an RDD to a DataFrame. 
import sqlContext.implicits._ 

val data = sc.textFile("data") 
val dataFrame = data.map(d => Data(d._1, d._2._foFloat)).toDF()

来源

2016-01-29 12:55:43 rhernando

将RDD转换为DataFrame

回答

相关问题