2014-03-19 48 views
0

我有加入猪的问题。我将从给你的背景开始。这里是我的代码:猪 - 加入不起作用

-- START file loading 
start_file = LOAD 'dir/start_file.csv' USING PigStorage(';') as (PARTRANGE:chararray,  COD_IPUSER:chararray); 

-- trim 
A = FOREACH start_file GENERATE TRIM(PARTRANGE) AS PARTRANGE, TRIM(COD_IPUSER) AS COD_IPUSER; 

dump A; 

这给输出:

(79.92.147.88,20140310) 
(79.92.147.88,20140310) 
(109.31.67.3,20140310) 
(109.31.67.3,20140310) 
(109.7.229.143,20140310) 
(109.8.114.133,20140310) 
(77.198.79.99,20140310) 
(77.200.174.171,20140310) 
(77.200.174.171,20140310) 
(109.17.117.212,20140310) 

加载其他的文件:

-- Chargement du fichier recherche Hadopi 
file2 = LOAD 'dir/file2.csv' USING PigStorage(';') as (IP_RECHERCHEE:chararray, DATE_HADO:chararray); 

dump file2; 

输出是这样的:

(2014/03/10 00:00:00,79.92.147.88) 
(2014/03/10 00:00:01,79.92.147.88) 
(2014/03/10 00:00:00,192.168.2.67) 

现在,我想要做一个左外连接。下面的代码:

result = JOIN file2 by IP_RECHERCHEE LEFT OUTER, A by COD_IPUSER; 
dump result; 

输出是这样的:

(2014/03/10 00:00:00,79.92.147.88,,) 
(2014/03/10 00:00:00,192.168.2.67,,) 
(2014/03/10 00:00:01,79.92.147.88,,) 

所有的“文件2”的记录都在这里,这是很好的,但任何start_file都在这里。这就好像加入失败了一样。

你知道问题在哪里吗?

谢谢。

回答

2

您在file2中错误标记了您的字段。您正在呼叫第一个字段IP,第二个字段是日期,如dump所示,情况正好相反。尝试FOREACH file2 GENERATE IP_RECHERCHEE,您将看到您尝试加入的字段。

1

结果如预期。您正在呼叫Left outer join,它寻找file2中的IP_RECHERCHEE字段与COD_IPUSER A的匹配。
由于没有匹配,它会返回file2中的所有IP_RECHERCHEE字段并将null置换为A的字段。
很明显2014/03/10 00:00:00 != 20140310

1

你的领域的名字是错误的,你加入了错误的领域。看来你想通过IP地址加入。

start_file = LOAD 'dir/start_file.csv' USING PigStorage(';') as (IP:chararray, PARTRANGE:chararray); 

A = FOREACH start_file GENERATE TRIM(IP) AS IP, TRIM(PARTRANGE) AS PARTRANGE; 

file2 = LOAD 'dir/file2.csv' USING PigStorage(';') as (DATE_HADO:chararray, IP:chararray); 

我得到的是这样的

(2014/03/10 00:00:00,192.168.2.67,,) 
(2014/03/10 00:00:00,79.92.147.88,79.92.147.88,20140310) 
(2014/03/10 00:00:00,79.92.147.88,79.92.147.88,20140310) 
(2014/03/10 00:00:01,79.92.147.88,79.92.147.88,20140310) 
(2014/03/10 00:00:01,79.92.147.88,79.92.147.88,20140310)