2016-07-18 189 views
0

我要总结的RW柱的每个端口每小时阿帕奇猪脚本

Time  ID Name    RW   
-------- --- -------   ---------- 
14:57:01 000 Port0   1340 
14:57:01 001 Port1    13 

14:58:01 000 Port0    864 
14:58:01 001 Port1    36 

14:59:01 000 Port0   1394 
14:59:01 001 Port1    22 

15:57:01 000 Port0   1340 
15:57:01 001 Port1    13 

15:58:01 000 Port0   864 
15:58:01 001 Port1    36 

15:59:01 000 Port0   1394 
15:59:01 001 Port1    22 
. 
. 
. 

20:57:01 000 Port0   1340 
20:57:01 001 Port1    13 

20:58:01 000 Port0   864 
20:58:01 001 Port1   36 

20:59:01 000 Port0   1394 
20:59:01 001 Port1    22 

我的剧本是

data = LOAD 'hdfs:/data/data.txt' USING PigStorage(',') AS (time:chararray, id:chararray, name:chararray, read:int, write:int, rw:int); 
runs = FOREACH data GENERATE time, name, rw; 

如何

+0

你能证明你试过了吗? – mhatch

回答

1

您必须生成一个新的从时间列中调出小时,然后按小时,端口名称分组,然后获得每个分组的总和。

data = LOAD 'hdfs:/data/data.txt' USING PigStorage(',') AS (time:chararray, id:chararray, name:chararray, read:int, write:int, rw:int); 
runs = FOREACH data GENERATE GetHour((timestamp)time) as hour, name, rw; 
grouped = GROUP runs by (hour,name); 
port_total = FOREACH grouped GENERATE FLATTEN(group) as (hour,name),SUM(data.rw); 
DUMP port_total; 
+0

我得到了: '无法推断org.apache.pig.builtin.GetHour的匹配函数为多个或不匹配。请使用明确的演员表。' 我不知道我的数据文件中的时间格式是否必须采用“datetime”'yyyy/MM/dd HH:mm:ss'的形式吗? – agamil

+0

@agamil你正在加载时间chararray。将它投射到datetime.I已编辑答案。 –

+0

@agamil时间戳 –