2013-02-04 20 views
3

我想知道如何在Pig中读取Mahout生成的序列文件?我想可能有一个UDF,但我还找不到。如何阅读Mahout在Pig中生成的序列文件

+0

你检查出[象夫的支持(https://github.com/kevinweil/elephant-bird/blob/ master/mahout/src/main/java/com/twitter/elephantbird/pig/mahout/VectorWritableConverter.java)[elephant-bird](https://github.com/kevinweil/elephant-bird/)? –

回答

1

我结束了使用象鸟(V2.2.3)是这样的:

register '/usr/share/dse/mahout/mahout-core-0.6-job.jar'; 
register './elephant-bird-2.2.3.jar'; 

%declare SEQFILE_LOADER 'com.twitter.elephantbird.pig.load.SequenceFileLoader'; 
%declare LONG_CONVERTER 'com.twitter.elephantbird.pig.util.LongWritableConverter'; 
%declare INT_CONVERTER 'com.twitter.elephantbird.pig.util.IntWritableConverter'; 
%declare VECTOR_CONVERTER 'com.twitter.elephantbird.pig.mahout.VectorWritableConverter'; 
%declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter'; 

.... 

sets = LOAD '$INPUT_SETS' USING $SEQFILE_LOADER ('-c $INT_CONVERTER', '-c $VECTOR_CONVERTER') AS (thing_id:int, recommendations:chararray); 

...