2013-01-31 136 views
0

我想知道是否可以根据条件组合列值。让我来解释......Hive根据条件组合列值

让说我的数据看起来像这样

Id name offset 
1 Jan 100 
2 Janssen 104 
3 Klaas 150 
4 Jan 160 
5 Janssen 164 

的我的输出应该是这样的

Id fullname offsets 
1 Jan Janssen [ 100, 160 ] 

我想的名字值在两行合并的地方两行的偏移不再相隔1个字符。

我的问题是,如果这种类型的数据操作是可能的,是否有人可以共享一些代码和解释?

请温柔,但是这小小的一段代码返回这片一些我想要的东西是什么?

ArrayList<String> persons = new ArrayList<String>(); 

    // write your code here 
    String _previous = ""; 

    //Sample output form entities.txt 
    //USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Berkowitz,PERSON,9,10660 
    //USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,10685 
    File file = new File("entities.txt"); 

    try { 
     // 
     // Create a new Scanner object which will read the data 
     // from the file passed in. To check if there are more 
     // line to read from it we check by calling the 
     // scanner.hasNextLine() method. We then read line one 
     // by one till all line is read. 
     // 
     Scanner scanner = new Scanner(file); 
     while (scanner.hasNextLine()) { 

      if(_previous == "" || _previous == null) 
       _previous = scanner.nextLine(); 

      String _current = scanner.nextLine(); 
      //Compare the lines, if there offset is = 1 
      int x = Integer.parseInt(_previous.split(",")[3]) + Integer.parseInt(_previous.split(",")[4]); 
      int y = Integer.parseInt(_current.split(",")[4]); 
      if(y-x == 1){ 
       persons.add(_previous.split(",")[1] + " " + _current.split(",")[1]); 
       if(scanner.hasNextLine()){ 
        _current = scanner.nextLine(); 
       } 
      }else{ 
       persons.add(_previous.split(",")[1]); 
      } 
      _previous = _current; 
     } 
    } catch (Exception e) { 
     e.printStackTrace(); 
    } 

    for(String person : persons){ 
     System.out.println(person); 
    } 

工作样本数据

USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Richard,PERSON,7,2732 
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,2740 
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,2756 
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,3093 
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,3195 
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Berkowitz,PERSON,9,3220 
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Berkowitz,PERSON,9,10660 
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,10685 
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Lea,PERSON,3,10858 
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Lea,PERSON,3,11063 
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Ken,PERSON,3,11186 
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,11234 
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Berkowitz,PERSON,9,17073 
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Lea,PERSON,3,17095 
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Stephanie,PERSON,9,17330 
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Putt,PERSON,4,17340 

其中产生这样的输出

Richard Marottoli 
Marottoli 
Marottoli 
Marottoli 
Berkowitz 
Berkowitz 
Marottoli 
Lea 
Lea 
Ken 
Marottoli 
Berkowitz 
Lea 
Stephanie Putt 

亲切的问候

+0

我对输出如何派生有点困惑,但我认为这是非常类似于[这个问题](http://stackoverflow.com/questions/14028796/reduce-a-set-of-rows -hive-to-another-set-of-rows)我在配置单元中使用自定义映射/减少来回答。你只需要提供适当的reduce脚本。 – libjack

+0

我用一段java代码编写我的问题,样本数据和输出。我想将我的java代码转换为配置单元代码。任何想法,如果这是可能的? – Tinuz

+0

抱歉,您的其他代码仍未明确说明您要完成的任务。较新的代码/数据看起来像要将表加载到配置单元并提取列(这很可能),而前者以某种方式组合行。 – libjack

回答

1

使用下装载表创建见下表查询

drop table if exists default.stack; 
create external table default.stack 
(junk string, 
    name string, 
cat string, 
len int, 
off int 
) 
ROW FORMAT DELIMITED 
FIELDS terminated by ',' 
STORED AS INPUTFORMAT             
    'org.apache.hadoop.mapred.TextInputFormat'       
OUTPUTFORMAT               
    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
location 'hdfs://nameservice1/....'; 

使用,让您所需的输出。

select max(name), off from (
select CASE when b.name is not null then 
      concat(b.name," ",a.name) 
      else 
      a.name 
     end as name 
     ,Case WHEN b.off1 is not null 
      then b.off1 
      else a.off 
     end as off 
from default.stack a 
left outer join (select name 
         ,len+off+ 1 as off 
         ,off as off1 
       from default.stack) b 
on a.off = b.off) a 
group by off 
order by off; 

我已经测试过它产生了你想要的结果。