加入数据框火花java

首先，感谢您阅读我的问题。加入数据框火花java

我的问题是如下：在Spark与Java，我加载两个数据帧的两个CSV文件的数据。

这些数据帧将具有以下信息。

数据帧机场

Id | Name | City 
----------------------- 
1 | Barajas | Madrid

数据帧airport_city_state

City | state 
---------------- 
Madrid | España

我想，这样它看起来像这样加入这两个dataframes：

数据帧结果

Id | Name | City | state 
-------------------------- 
1 | Barajas | Madrid | España

其中dfairport.city = dfaiport_city_state.city

但我无法用语法澄清所以我可以正确地进行连接。我是如何创建的变量的一些代码：

// Load the csv, you have to specify that you have header and what delimiter you have 
Dataset <Row> dfairport = Load.Csv (sqlContext, data_airport); 
Dataset <Row> dfairport_city_state = Load.Csv (sqlContext, data_airport_city_state); 


// Change the name of the columns in the csv dataframe to match the columns in the database 
// Once they match the name we can insert them 
Dfairport 
.withColumnRenamed ("leg_key", "id") 
.withColumnRenamed ("leg_name", "name") 
.withColumnRenamed ("leg_city", "city") 

dfairport_city_state 
.withColumnRenamed("city", "ciudad") 
.withColumnRenamed("state", "estado");

来源

2017-03-26 Alejandro Reina

首先，非常感谢您的回复。

我已经试过我的两个解决方案，但没有他们的工作，我得到以下错误：方法dfairport_city_state（字符串）是未定义ETL_Airport

我无法访问数据帧的特定列类型加入。

编辑：已经有了做加盟，我把这里的情况下，其他人可以帮助解决;

感谢一切和问候

//Join de tablas en las que comparten ciudad 
Dataset <Row> joined = dfairport.join(dfairport_city_state, dfairport.col("leg_city").equalTo(dfairport_city_state.col("city")));

）

来源

2017-03-27 10:26:41

您可以使用join方法与列名连接两个dataframes，如：

Dataset <Row> dfairport = Load.Csv (sqlContext, data_airport); 
Dataset <Row> dfairport_city_state = Load.Csv (sqlContext, data_airport_city_state); 

Dataset <Row> joined = dfairport.join(dfairport_city_state, dfairport_city_state("City"));

还有一个重载的版本，它允许你指定join类型作为第三个参数，例如：

Dataset <Row> joined = dfairport.join(dfairport_city_state, dfairport_city_state("City"), "left_outer");

Here的更上连接。

来源

2017-03-26 20:07:13

加入数据框火花java

回答

相关问题