2017-04-17 35 views
0

尝试,因为我可能,我不能创建具有足够的精度来处理DecimalType(38,0)的情况下类的数据集。如何在BigInts中使用数据集?

我已经试过:

case class BigId(id: scala.math.BigInt) 

这遇到错误的ExpressionEncoderhttps://issues.apache.org/jira/browse/SPARK-20341

我已经试过:

case class BigId(id: java.math.BigDecimal) 

但这运行到错误,唯一的可能精度为DecimalType(38,18)。我甚至创建了自定义编码器,从spark source code中大量借用。最大的变化是我默认java.math.BigDecimal的模式为DecimalType(38,0)。我找不到任何更改串行器或解串器的理由。当我为我的自定义编码器来Dataset.asDataset.map,我碰到下面的堆栈跟踪:

User class threw exception: org.apache.spark.sql.AnalysisException: Cannot up cast `id` from decimal(38,0) to decimal(38,18) as it may truncate 
The type path of the target object is: 
- field (class: "java.math.BigDecimal", name: "id") 
- root class: "BigId" 
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object; 
org.apache.spark.sql.AnalysisException: Cannot up cast `id` from decimal(38,0) to decimal(38,18) as it may truncate 
The type path of the target object is: 
- field (class: "java.math.BigDecimal", name: "id") 
- root class: "BigId" 
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object; 
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:1998) 
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2020) 
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2015) 
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) 
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) 
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) 
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:285) 
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:291) 
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:291) 
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) 
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) 
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) 
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) 
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:291) 
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:291) 
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:357) 
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) 
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) 
    at scala.collection.immutable.List.foreach(List.scala:381) 
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) 
    at scala.collection.immutable.List.map(List.scala:285) 
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:355) 
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) 
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) 
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) 
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:235) 
    at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:245) 
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:254) 
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) 
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:254) 
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:223) 
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2015) 
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2011) 
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) 
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) 
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) 
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) 
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.apply(Analyzer.scala:2011) 
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.apply(Analyzer.scala:1996) 
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) 
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) 
    at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) 
    at scala.collection.immutable.List.foldLeft(List.scala:84) 
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) 
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) 
    at scala.collection.immutable.List.foreach(List.scala:381) 
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) 
    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.resolveAndBind(ExpressionEncoder.scala:244) 
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:210) 
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167) 
    at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59) 
    at org.apache.spark.sql.Dataset.as(Dataset.scala:359) 

我可以证实,我的两个输入DataFrame.schema和我encoder.schemaDecimalType(38,0)精度。我也删除了任何import spark.implicits._,以确认DataFrame方法正在使用我的自定义编码器。

在这一点上,似乎最简单的选择就是通过ID各地为一个字符串。这看起来很浪费。

回答

0

虽然我很佩服你的进取心在定义一个定制的编码器,这是不必要的。您的价值是一个ID - 不是您打算使用作为数字。换句话说,你不会用它来计算。您只是将String转换为BigId,仅用于感知优化。

正如传奇人物Donald Knuth曾经写道:“程序员浪费了大量的时间来思考或担心程序中非关键部分的速度,而这些效率的尝试实际上在调试时会产生强烈的负面影响, 。维护被认为是我们应该忘记小的效率,说约97%的时间:过早的优化是一切罪恶

所以,解决问题的效率时他们实际发生。目前,您有一个解决方案正在寻找一个问题 - 甚至在花费大量时间花费在分析质量上之后,也没有一个可行的解决方案。

至于使用String作为一般事项的效率,而不是依靠对Tungsten optimizations星火团队已经在引擎盖下很努力的,并保持你的眼睛球

+0

的确。在我开始的时候,将id存储为BigInt并不是一个难题。它似乎得到了Spark的支持。事实上,只要我坚持使用无类型转换(基本上是DataFrame API),它就可以工作。它似乎也是[不容易](https://issues.apache.org/jira/browse/SPARK-18484)覆盖默认精度和缩放比例。 –

+0

你仍然专注于任务的难度,这是错误的重点。我的观点是,即使这很容易,也没有必要。如果你需要一个数字(如你引用的JIRA问题的情况)而不是'String',这将是一个有用的练习。与此同时,你不觉得Spark是否需要支持'BigInt'(事实并非如此),因为成千上万使用Spark进行分析的人多年来都需要它,现在他们会这样做吗?并且它不会将你当作代码来代表ID作为'BigDecimal'吗? – Vidya