为什么Hadoop SequenceFile的写入比读取要慢得多？

我正在使用Java API将一些自定义文件转换为hadoop序列文件。为什么Hadoop SequenceFile的写入比读取要慢得多？

我从本地文件读取的字节数组，并把它们添加到一个序列文件作为对指数（整数）的 - 数据（字节[]）：

InputStream in = new BufferedInputStream(new FileInputStream(localSource)); 
FileSystem fs = FileSystem.get(URI.create(hDFSDestinationDirectory),conf); 
Path sequenceFilePath = new Path(hDFSDestinationDirectory + "/"+ "data.seq"); 

IntWritable key = new IntWritable(); 
BytesWritable value = new BytesWritable(); 
SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, 
      sequenceFilePath, key.getClass(), value.getClass()); 

    for (int i = 1; i <= nz; i++) { 
    byte[] imageData = new byte[nx * ny * 2]; 
    in.read(imageData); 

    key.set(i); 
    value.set(imageData, 0, imageData.length); 
    writer.append(key, value); 
    } 
IOUtils.closeStream(writer); 
in.close();

我做的正是我所想要的逆把文件恢复到初始格式：

for (int i = 1; i <= nz; i++) { 
     reader.next(key, value); 
     int byteLength = value.getLength(); 
     byte[] tempValue = value.getBytes(); 
     out.write(tempValue, 0, byteLength); 
     out.flush(); 
    }

我注意到书面方式向SequenceFile花费幅度差不多一个数量级比读书。我期望写作比阅读慢，但这种差异是否正常？为什么？

更多信息： 字节阵列读我是2MB大小（NX = ny的= 1024和nz = 128）
我在伪分布式模式下测试。

来源

2012-03-02 fgrollio

时间单位什么是“数量级”？ – 2012-03-04 16:19:30

“十倍以上” – fgrollio 2012-03-06 08:06:37

您正在从本地磁盘读取数据并写入HDFS。当您写入HDFS时，您的数据可能正在被复制，因此根据您为复制因子设置的内容，其物理写入两到三次。

因此，您不仅可以书写而且可以书写两到三倍的数据量。你的写作正在通过网络进行。你的阅读不是。

来源

2012-03-02 14:29:59

我正在伪分布式模式下测试，所以我没有复制，也没有网络流量。请不要指出它。 – fgrollio 2012-03-02 15:05:53

是nx和ny常量？

你可能会看到这个的一个原因是for循环的每次迭代都会创建一个新的字节数组。这需要JVM为您分配一些堆空间。如果阵列足够大，这将会很昂贵，并且最终你会碰到GC。但我不太确定HotSpot可以做什么来优化这一点。

我的建议是建立一个单一的BytesWritable：

// use DataInputStream so you can call readFully() 
DataInputStream in = new DataInputStream(new FileInputStream(localSource)); 
FileSystem fs = FileSystem.get(URI.create(hDFSDestinationDirectory),conf); 
Path sequenceFilePath = new Path(hDFSDestinationDirectory + "/"+ "data.seq"); 

IntWritable key = new IntWritable(); 
// create a BytesWritable, which can hold the maximum possible number of bytes 
BytesWritable value = new BytesWritable(new byte[maxPossibleSize]); 
// grab a reference to the value's underlying byte array 
byte byteBuf[] = value.getBytes(); 
SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, 
     sequenceFilePath, key.getClass(), value.getClass()); 

for (int i = 1; i <= nz; i++) { 
    // work out how many bytes to read - if this is a constant, move outside the for loop 
    int imageDataSize nx * ny * 2; 
    // read in bytes to the byte array 
    in.readFully(byteBuf, 0, imageDataSize); 

    key.set(i); 
    // set the actual number of bytes used in the BytesWritable object 
    value.setSize(imageDataSize); 
    writer.append(key, value); 
} 

IOUtils.closeStream(writer); 
in.close();

来源

2012-03-21 00:48:56

是的nx，nz是常量，我会试试这个，谢谢你的详细解答。 – fgrollio 2012-03-27 14:20:43

fgrollio，是否有助于提高性能？ – 2013-02-26 15:10:49

为什么Hadoop SequenceFile的写入比读取要慢得多？

回答

相关问题