使用MapReduce作业调用StanfordCoreNLP API

我想要使用MapReduce处理大量文档，这个想法是将文件拆分为映射器中的文档并在还原器阶段应用stanford coreNLP注释器。我有一个相当简单（标准）管道的“标记化，ssplit，pos，引理，ner”，并且reducer只是调用一个函数，将这些annotators应用到reducer传递的值并返回注释（as字符串列表），但是生成的输出是垃圾。使用MapReduce作业调用StanfordCoreNLP API

我观察到，如果我从映射程序中调用注释函数，那么作业会返回预期的输出结果，但是会打败整个并行性游戏。当我忽略reducer中获取的值并仅将注释器应用于虚拟字符串时，该作业也会返回预期的输出。

这可能表明在进程中存在一些线程安全问题，但我无法弄清楚在哪里，我的注释函数是同步的，而管道是私有的。

有人可以提供一些关于如何解决这个问题的指针？

-Angshu

编辑：

这是我减速的样子，希望这增加了更清晰

public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> { 
    public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { 
     while (values.hasNext()) { 
      output.collect(key, new Text(se.getExtracts(values.next().toString()).toString()));    
     } 
    } 
}

这是获得提取码：

final StanfordCoreNLP pipeline; 
public instantiatePipeline(){ 
    Properties props = new Properties(); 
    props.put("annotators", "tokenize, ssplit, pos, lemma, ner"); 

} 


synchronized List<String> getExtracts(String l){ 
    Annotation document = new Annotation(l); 

    ArrayList<String> ret = new ArrayList<String>(); 

    pipeline.annotate(document); 

    List<CoreMap> sentences = document.get(SentencesAnnotation.class); 
    int sid = 0; 
    for(CoreMap sentence:sentences){ 
     sid++; 
     for(CoreLabel token: sentence.get(TokensAnnotation.class)){ 
      String word = token.get(TextAnnotation.class); 
      String pos = token.get(PartOfSpeechAnnotation.class); 
      String ner = token.get(NamedEntityTagAnnotation.class); 
      String lemma = token.get(LemmaAnnotation.class); 

      Timex timex = token.get(TimeAnnotations.TimexAnnotation.class); 

      String ex = word+","+pos+","+ner+","+lemma; 
      if(timex!=null){ 
       ex = ex+","+timex.tid(); 
      } 
      else{ 
       ex = ex+","; 
      } 
      ex = ex+","+sid; 
      ret.add(ex); 
     } 
    }

来源

2014-06-28 Angshu

我想你需要提供更多关于你的实现的细节（如你的代码）。 – Daniel

把我的代码，将不胜感激任何指针。 – Angshu

基于实现，我认为如果有任何问题，应该在第一个代码（MapReduce的代码）中。如果不是斯坦福的注释，你可以调用一个简单的函数吗？ – Daniel

我解决了这个问题，实际上问题是在文件中的文本编码从（将其转换为文本导致进一步的损坏，我猜）这是造成标记化和溢出垃圾的问题。我正在清理输入字符串并应用严格的UTF-8编码，而且现在工作正常。

来源

2014-06-30 06:46:17 Angshu

选择它作为答案！ :) – Daniel

使用MapReduce作业调用StanfordCoreNLP API

回答

相关问题