2017-02-02 199 views
0

条款我试图像NOT IN猪

select * from A where A.ID NOT IN (select id from B) (in sql) 

sourcenew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Source.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray); 
destnew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Destination.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray); 
c= FOREACH destnew GENERATE ID; 
D=FILTER sourcenew BY NOT ID (c.ID); 
org.apache.pig.tools.pigscript.parser.ParseException: Encountered " <PATH> "D=FILTER "" at line 1, column 1. 
Was expecting one of: 
<EOF> 
"cat" ... 
"clear" ...<EOF> 

任何帮助这个来解决错误,得到这个在最后一行的执行。

+0

想一想由ID分组2间的关系,过滤出这些不具有匹配 – 54l3d

回答

1

使用LEFT OUTER JOIN和过滤空

sourcenew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Source.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray); 
destnew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Destination.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray); 
c = FOREACH destnew GENERATE ID; 
d = JOIN sourcenew BY ID LEFT OUTER,destnew by ID; 
e = FILTER d by destnew.ID is null; 

注意 我写了一个示例脚本与夫妇的测试文件以下是工作solution.In你情况的检查,看看如果要加载从您的文件正确的数据。

test1.txt的

1 abc 
2 def 
3 ghi 
4 jkl 
5 mno 
6 pqr 
7 stu 
8 vwx 
1 abc 
2 def 
3 ghi 
4 jkl 
1 abc 
2 def 
3 ghi 
1 abc 
2 def 

的test2.txt

1 
2 
3 
4 

脚本

A = LOAD 'test1.txt' USING PigStorage('\t') AS (aid:int,name:chararray); 
B = LOAD 'test2.txt' USING PigStorage('\t') AS (bid:int); 
C = JOIN A BY aid LEFT OUTER,B BY bid; 
D = FILTER C BY bid is null; 
DUMP D; 

因此,在上面的例子中RECO rds 5,6,7,8应该在结果中,因为这些Ids不在test2.txt中。

Output

+0

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066:无法打开迭代器别名d。后端错误:org.apache.pig.backend.executionengine.ExecException:错误0:标量在输出中有多个行。第一:(1),第二:(2)(常见原因:“JOIN”,然后“FOREACH ... GENERATE foo.bar”应该是“foo :: bar”)@inquisitive_mind – Vickyster

+0

我甚至试过d = FILTER sourcenew BY NOT(sourcenew.ID == c.ID); – Vickyster

+0

@Vickyster,我已经编辑了答案,并且还包含了一个例子。希望有帮助。 –