我有一个输入文件(大小约31GB),其中包含有关某些产品的消费者评论,我试图推理并找到相应的引理计数。该方法有点类似于Hadoop提供的WordCount示例。我有4个课程来进行处理:StanfordLemmatizer [包含来自Stanford的coreNLP软件包v3.3.0的词汇推理的好东西],WordCount [驱动程序],WordCountMapper [映射程序]和WordCountReducer [reducer]。运行Hadoop作业的java.lang.OutOfMemoryError
我已经测试了原始数据集的一个子集(以MB为单位)的程序,它运行良好。不幸的是,当我在大小〜31GB的完整数据集上运行作业时,作业失败。我检查作业的日志它包含在此:
java.lang.OutOfMemoryError: Java heap space at edu.stanford.nlp.sequences.ExactBestSequenceFinder.bestSequence(ExactBestSequenceFinder.java:109) [...]
如何处理这有什么建议?
注意:我使用的是预先配置了hadoop-0.18.0的Yahoo VM。我也尝试分配更多的堆的解决方案,在这个线程中提到:out of Memory Error in Hadoop
WordCountMapper代码:
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WordCountMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final IntWritable one = new IntWritable(1);
private final Text word = new Text();
private final StanfordLemmatizer slem = new StanfordLemmatizer();
public void map(LongWritable key, Text value,
OutputCollector output, Reporter reporter) throws IOException {
String line = value.toString();
if(line.matches("^review/(summary|text).*")) //if the current line represents a summary/text of a review, process it!
{
for(String lemma: slem.lemmatize(line.replaceAll("^review/(summary|text):.", "").toLowerCase()))
{
word.set(lemma);
output.collect(word, one);
}
}
}
}
谢谢曼宁教授的详细解释和建议。将尝试他们,看看我是否可以管理一些解决方法:) – Aditya