如何在Spark中实现“交叉连接”？

我们计划将Apache Pig代码移至新的Spark平台。如何在Spark中实现“交叉连接”？

猪有一个“Bag/Tuple/Field”的概念，其行为与关系型数据库相似。 Pig提供对CROSS/INNER/OUTER连接的支持。

对于CROSS JOIN，我们可以使用alias = CROSS alias, alias [, alias …] [PARTITION BY partitioner] [PARALLEL n];

但是，当我们转移到星火平台我找不到星火API中的任何对手。你有什么主意吗？

来源

2014-07-21 Shawn Guo

这还没有准备好，但叉勺（上火花猪）正在建设中目前，所以你可能不需要改变你的任何代码 – aaronman

它是oneRDD.cartesian(anotherRDD)。

来源

2014-07-21 10:32:11

谢谢，笛卡尔连接是交叉连接的昵称 –

这里是Spark 2.x的数据集和DataFrames推荐的版本：

scala> val ds1 = spark.range(10) 
ds1: org.apache.spark.sql.Dataset[Long] = [id: bigint] 

scala> ds1.cache.count 
res1: Long = 10 

scala> val ds2 = spark.range(10) 
ds2: org.apache.spark.sql.Dataset[Long] = [id: bigint] 

scala> ds2.cache.count 
res2: Long = 10 

scala> val crossDS1DS2 = ds1.crossJoin(ds2) 
crossDS1DS2: org.apache.spark.sql.DataFrame = [id: bigint, id: bigint] 

scala> crossDS1DS2.count 
res3: Long = 100

或者，可以使用传统的JOIN语法没有连接条件。使用此配置选项可避免以下错误。

scala> val crossDS1DS2 = ds1.join(ds2) 
crossDS1DS2: org.apache.spark.sql.DataFrame = [id: bigint, id: bigint] 

scala> crossDS1DS2.count 
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans 
... 
Join condition is missing or trivial. 
Use the CROSS JOIN syntax to allow cartesian products between these relations.;

相关：当该配置被省略（使用“加入”语法专）

spark.conf.set("spark.sql.crossJoin.enabled", true)

错误spark.sql.crossJoin.enabled for Spark 2.x

来源

2017-03-02 05:15:45 Garren

如何在Spark中实现“交叉连接”？

回答

相关问题