我想加载包含两个时间戳列的选项卡分隔文件,并生成一个计算列,这是一列之间的差异(以天为单位)和当前的时间戳。我已经在RDD上应用registerTempTable()方法将其转换为SchemaRDD。之后,我几乎碰到了墙壁,因为所有后续的操作都依赖于这个已计算的字段。是否有可能在Apache Spark中使用当前时间戳在时间戳列上做日期差异?
这是我迄今所做的。谢谢您的帮助 !
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val conf = new SparkConf().setMaster("local[2]").setAppName("CookieSummary")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
case class CookieDates(CLPartnerSyncCreateDT: String, CookieSyncRequestDT: String)
val cookies = sc.textFile("/Users/shubhro/Documents/dataFiles/clean/worker1.01012015.1420081201_sub.tsv").map(_.split("\t")).map(p => CookieDates(p(0), p(1)))
cookies.registerTempTable("cookies")
val allCookies = sqlContext.sql("SELECT CAST(CLPartnerSyncCreateDT AS TIMESTAMP),CAST(CookieSyncRequestDT AS TIMESTAMP) FROM cookies")
allCookies.collect().foreach(println)