使用Hive读取Hadoop SequenceFiles

我有一些来自Common Crawl的已存储在SequenceFile格式中的已映射数据。我曾多次尝试将这些数据“按原样”与Hive一起使用，以便我可以在各个阶段对其进行查询和采样。但是，我总是得到下面的错误在我的作业输出：使用Hive读取Hadoop SequenceFiles

LazySimpleSerDe: expects either BytesWritable or Text object!

我甚至建立的[文本，LongWritable]记录的简单（小）的数据集，但还是失败，那么。如果我将数据输出到文本格式，然后创建一个表，它工作正常：

hive> create external table page_urls_1346823845675 
    >  (pageurl string, xcount bigint) 
    >  location 's3://mybucket/text-parse/1346823845675/'; 
OK 
Time taken: 0.434 seconds 
hive> select * from page_urls_1346823845675 limit 10; 
OK 
http://0-italy.com/tag/package-deals 643 NULL 
http://011.hebiichigo.com/d63e83abff92df5f5913827798251276/d1ca3aaf52b41acd68ebb3bf69079bd1.html 9 NULL 
http://01fishing.com/fly-fishing-knots/ 3437 NULL 
http://01fishing.com/flyin-slab-creek/ 1005 NULL 
...

我尝试使用自定义inputformat：

// My custom input class--very simple 
import org.apache.hadoop.io.LongWritable; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.mapred.SequenceFileInputFormat; 
public class UrlXCountDataInputFormat extends 
    SequenceFileInputFormat<Text, LongWritable> { }

我创建表，然后用：

create external table page_urls_1346823845675_seq 
    (pageurl string, xcount bigint) 
    stored as inputformat 'my.package.io.UrlXCountDataInputFormat' 
    outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat' 
    location 's3://mybucket/seq-parse/1346823845675/';

但我仍然得到相同的SerDer错误。

我确定这里有一些非常基本的东西我很想念，但我似乎无法做到。此外，我必须能够解析序列文件（即我无法将我的数据转换为文本）。所以我需要弄清楚我的项目未来部分的SequenceFile方法。

解决方案： 作为@标记格罗弗指出下面的问题是，蜂巢忽略默认的关键。只有一列（即只是值），serder无法映射我的第二列。

解决方案是使用一个自定义的InputFormat，它比我原来使用的要复杂得多。我跟踪了一个关于使用键而不是数值的Git链接的答案，然后我修改了它以适应我的需要：从内部SequenceFile.Reader中获取键和值，然后将它们组合到最终的BytesWritable中。即这样的事情（从定制的阅读器，因为这是所有的辛勤工作情况）：

// I used generics so I can use this all with 
// other output files with just a small amount 
// of additional code ... 
public abstract class HiveKeyValueSequenceFileReader<K,V> implements RecordReader<K, BytesWritable> { 

    public synchronized boolean next(K key, BytesWritable value) throws IOException { 
     if (!more) return false; 

     long pos = in.getPosition(); 
     V trueValue = (V) ReflectionUtils.newInstance(in.getValueClass(), conf); 
     boolean remaining = in.next((Writable)key, (Writable)trueValue); 
     if (remaining) combineKeyValue(key, trueValue, value); 
     if (pos >= end && in.syncSeen()) { 
      more = false; 
     } else { 
      more = remaining; 
     } 
     return more; 
    } 

    protected abstract void combineKeyValue(K key, V trueValue, BytesWritable newValue); 

} 

// from my final implementation 
public class UrlXCountDataReader extends HiveKeyValueSequenceFileReader<Text,LongWritable> 
    @Override 
    protected void combineKeyValue(Text key, LongWritable trueValue, BytesWritable newValue) { 
     // TODO I think we need to use straight bytes--I'm not sure this works? 
     StringBuilder builder = new StringBuilder(); 
     builder.append(key); 
     builder.append('\001'); 
     builder.append(trueValue.get()); 
     newValue.set(new BytesWritable(builder.toString().getBytes())); 
    } 
}

就这样，我把我所有的列！

http://0-italy.com/tag/package-deals 643 
http://011.hebiichigo.com/d63e83abff92df5f5913827798251276/d1ca3aaf52b41acd68ebb3bf69079bd1.html 9 
http://01fishing.com/fly-fishing-knots/ 3437 
http://01fishing.com/flyin-slab-creek/ 1005 
http://01fishing.com/pflueger-1195x-automatic-fly-reels/ 1999

来源

2012-11-02 codingmonk

找到关于这里使用的键，而不是值的更详细的讨论：阿帕奇蜂巢线程（ http://mail-archives.apache.org/mod_mbox/hive-user/201204.mbox/%[email protected].com%3E），这导致我[[gist]（https：/ /gist.github.com/2421795），它具有自定义格式和阅读器。使用这两个链接加上其他信息允许我构建上述。 – codingmonk

不知道这是否会影响您，但Hive在读取SequenceFiles时会忽略键。您可能需要创建一个自定义InputFormat（除非你能找到一个在线:-)）

参考：http://mail-archives.apache.org/mod_mbox/hive-user/200910.mbox/%[email protected]%3E

来源

2012-11-04 19:08:21

是的，这似乎是我的问题。即它忽略了关键，然后继续尝试找到第二列 - 并找不到一个。我将发布更多的细节，包括我必须做的所有细节，因为该链接包含几个不同的解决方案。 – codingmonk

使用Hive读取Hadoop SequenceFiles

回答

相关问题