2013-06-29 42 views
2

我有2个数据源。一个包含api调用列表,另一个包含所有相关的认证事件。每个Api调用可以有多个Auth事件,我想查找以下验证事件:
a)包含与Api调用相同的“标识符”
b)Api调用后一秒内发生
c)在上述过滤之后最接近Api调用。Pig Latin(在foreach循环中过滤第2个数据源)

我曾在一个foreach循环通过每个ApiCall事件计划循环再利用的authevents过滤语句来找到正确的 - 但是,它不会出现,这是可能的(USING Filter in a Nested FOREACH in PIG

会有人能够建议其他方式来实现这一点。如果有帮助,这里的猪脚本我试着使用:

apiRequests = LOAD '/Documents/ApiRequests.txt' AS (api_fileName:chararray, api_requestTime:long, api_timeFromLog:chararray, api_call:chararray, api_leadString:chararray, api_xmlPayload:chararray, api_sourceIp:chararray, api_username:chararray, api_identifier:chararray); 
authEvents = LOAD '/Documents/AuthEvents.txt' AS (auth_fileName:chararray, auth_requestTime:long, auth_timeFromLog:chararray, auth_call:chararray, auth_leadString:chararray, auth_xmlPayload:chararray, auth_sourceIp:chararray, auth_username:chararray, auth_identifier:chararray); 
specificApiCall = FILTER apiRequests BY api_call == 'CSGetUser';     -- Get all events for this specific call 
match = foreach specificApiCall {            -- Now try to get the closest mathcing auth event 
     filtered1 = filter authEvents by auth_identifier == api_identifier;  -- Only use auth events that have the same identifier (this will return several) 
     filtered2 = filter filtered1 by (auth_requestTime-api_requestTime)<1000; -- Further refine by usings auth events within a second on the api call's tiime 
     sorted = order filtered2 by auth_requestTime;       -- Get the auth event that's closest to the api call 
     limited = limit sorted 1; 
     generate limited; 
     }; 
dump match; 

回答

1

嵌套FOREACH不是与同时遍历第一个第二个关系的工作。这是因为当你的关系有一个袋子,你想用这个袋子工作,就好像它是它自己的关系一样。您不能同时使用apiRequestsauthEvents,除非您先进行某种连接或分组,以将所需的所有信息放入单个关系中。

你的任务很好地工作在概念上与JOINFILTER,如果你并不需要限制自己一个授权事件:

allPairs = JOIN specificApiCall BY api_identifier, authEvents BY auth_identifier; 
match = FILTER allPairs BY (auth_requestTime-api_requestTime)<1000; 

现在所有的信息是在一起,你可以做其次是GROUP match BY api_identifier一个嵌套的FOREACH挑出一个单一的事件。

但是,如果您使用COGROUP运算符(与JOIN相似但没有交叉积),则可以在一个步骤中完成此操作 - 您可以从每个关系中得到两个包含分组记录的行李。使用此挑选出最近的授权事件:

cogrp = COGROUP specificApiCall BY api_identifier, authEvents BY auth_identifier; 
singleAuth = FOREACH cogrp { 
    auth_sorted = ORDER authEvents BY auth_requestTime; 
    auth_1 = LIMIT auth_sorted 1; 
    GENERATE FLATTEN(specificApiCall), FLATTEN(auth_1); 
    }; 

然后FILTER只留下1秒内的那些:

match = FILTER singleAuth BY (auth_requestTime-api_requestTime)<1000; 
+0

谢谢小熊,我用协同组和它的工作一种享受。你是最好的! – Hinchy