2017-07-28 111 views
2

我想执行的一些数据帧一个简单的SQL查询火花壳查询增加了1周的时间间隔在一定日期如下:日期和间隔增加在SparkSQL

原始查询:

scala> spark.sql("select Cast(table1.date2 as Date) + interval 1 week from table1").show() 

现在,当我做了一些测试:

scala> spark.sql("select Cast('1999-09-19' as Date) + interval 1 week from table1").show() 

我得到的结果正确

+----------------------------------------------------------------------------+ 
|CAST(CAST(CAST(1999-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)| 
+----------------------------------------------------------------------------+ 
|                 1999-09-26| 
+----------------------------------------------------------------------------+ 

(刚刚加7天到19 = 26)

但是当我刚刚改变1997年而不是1999年,结果改变了!

scala> spark.sql("select Cast('1997-09-19' as Date) + interval 1 week from table1").show() 

+----------------------------------------------------------------------------+ 
|CAST(CAST(CAST(1997-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)| 
+----------------------------------------------------------------------------+ 
|                 1997-09-25| 
+----------------------------------------------------------------------------+ 

为什么reuslts改变了?它不是26而是25吗?

所以,这是涉及到某种itermediate计算损失或者我失去了一些东西在sparkSQL的错误吗?

回答

6

这可能是转换为本地时间的问题。 INTERVAL投射数据以TIMESTAMP,然后返回到DATE

scala> spark.sql("SELECT CAST('1997-09-19' AS DATE) + INTERVAL 1 weeks").explain 
== Physical Plan == 
*Project [10130 AS CAST(CAST(CAST(1997-09-19 AS DATE) AS TIMESTAMP) + interval 1 weeks AS DATE)#19] 
+- Scan OneRowRelation[] 

(注意在第二和第三CASTs)和火花已知是inconsequent when handling timestamps

DATE_ADD应该表现出更加稳定的行为:

scala> spark.sql("SELECT DATE_ADD(CAST('1997-09-19' AS DATE), 7)").explain 
== Physical Plan == 
*Project [10130 AS date_add(CAST(1997-09-19 AS DATE), 7)#27] 
+- Scan OneRowRelation[] 
+2

也不一致:如果您有跨越两个时区,时间戳日期转换集群完全分崩离析(除非您使用每次明确的时区的方法)。 –