2014-01-13 96 views
1

我有如下表:蜂巢collect_set崩溃查询

hive> describe tv_counter_stats; 
OK 
day  string 
event string 
query_id  string 
userid string 
headers  string 

而且我想执行以下查询:

hive -e 'SELECT 
    day, 
    event, 
    query_id, 
    COUNT(1) AS count, 
    COLLECT_SET(userid) 
FROM 
    tv_counter_stats 
GROUP BY 
    day, 
    event, 
    query_id;' > counter_stats_data.csv 

然而,查询失败。但以下查询工作正常:

hive -e 'SELECT 
    day, 
    event, 
    query_id, 
    COUNT(1) AS count 
FROM 
    tv_counter_stats 
GROUP BY 
    day, 
    event, 
    query_id;' > counter_stats_data.csv 

其中我删除collect_set命令。所以我的问题:有没有人知道为什么collect_set可能在这种情况下失败?

UPDATE:错误消息说:

Diagnostic Messages for this Task: 

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask 
MapReduce Jobs Launched: 
Job 0: Map: 3 Reduce: 1 Cumulative CPU: 10.49 sec HDFS Read: 109136387 HDFS Write: 0 FAIL 
Total MapReduce CPU Time Spent: 10 seconds 490 msec 

java.lang.Throwable: Child Error 
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250) 
Caused by: java.io.IOException: Task process exit with nonzero status of 1. 
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237) 

Error: GC overhead limit exceeded 
java.lang.Throwable: Child Error 
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250) 
Caused by: java.io.IOException: Task process exit with nonzero status of 1. 
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237) 

Error: GC overhead limit exceeded 

更新2: 我改变了查询,使得它现在这个样子:

hive -e ' 
SET mapred.child.java.opts="-server -Xmx1g -XX:+UseConcMarkSweepGC"; 
SELECT 
    day, 
    event, 
    query_id, 
    COUNT(1) AS count, 
    COLLECT_SET(userid) 
FROM 
    tv_counter_stats 
GROUP BY 
    day, 
    event, 
    query_id;' > counter_stats_data.csv 

然而,然后我得到以下错误:

Diagnostic Messages for this Task: 
java.lang.Throwable: Child Error 
     at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250) 
Caused by: java.io.IOException: Task process exit with nonzero status of 1. 
     at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237) 


FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask 
MapReduce Jobs Launched: 
Job 0: Map: 3 Reduce: 1 HDFS Read: 0 HDFS Write: 0 FAIL 
Total MapReduce CPU Time Spent: 0 msec 
+0

您可以添加失败消息吗? –

+0

好的,我添加了错误信息 – toom

回答

1

这可能是内存问题,因为collect_set聚合了内存中的数据。

尝试增加堆大小并启用并发GC(通过将Hadoop mapred.child.java.opts设置为例如-Xmx1g -XX:+UseConcMarkSweepGC)。

This answer有关于“GC开销限制”错误的更多信息。

+0

Thx作为答案。我更新了我的问题(更新2) – toom

1

我有同样的确切问题,并遇到这个问题,所以我想我会分享我找到的解决方案。

底层问题很可能是Hive试图在映射器端进行聚合,并且它用于管理内存中hashmaps的启发式方法被“宽而浅”的数据抛出, - 即在你的情况下,如果每天/ event/query_id组的user_id值非常少。

我找到了一个article,解释了解决这个问题的各种方法,但其中大多数只是对全面核选项的优化:完全禁用映射器端聚合。

使用set hive.map.aggr = false;应该做的伎俩。