斯坦福大学NLP：OutOfMemoryError

pipeline.annotate方法在每次读取文件时变得越来越慢。最终，我得到一个OutOfMemoryError。

管道一旦被初始化：

protected void initializeNlp() 
{ 
    Log.getLogger().debug("Starting Stanford NLP"); 


    // creates a StanfordCoreNLP object, with POS tagging, lemmatization, 
    // NER, parsing, and 
    Properties props = new Properties(); 

    props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner, depparse, natlog, openie"); 
    props.put("regexner.mapping", namedEntityPropertiesPath); 

    pipeline = new StanfordCoreNLP(props); 


    Log.getLogger().debug("\n\n\nStarted Stanford NLP Successfully\n\n\n"); 
}

我那么过程使用管道的相同实例中的每个文件（如在SO和由斯坦福别处推荐）。

 public void processFile(Path file) 
{ 
    try 
    { 
     Instant start = Instant.now(); 

     Annotation document = new Annotation(cleanString); 
     Log.getLogger().info("ANNOTATE"); 
     pipeline.annotate(document); 
     Long millis= Duration.between(start, Instant.now()).toMillis(); 
     Log.getLogger().info("Annotation Duration in millis: "+millis); 

     AnalyzedFile af = AnalyzedFileFactory.getAnalyzedFile(AnalyzedFileFactory.GENERIC_JOB_POST, file); 

     processSentences(af, document); 

     Log.getLogger().info("\n\n\nFile Processing Complete\n\n\n\n\n"); 



     Long millis1= Duration.between(start, Instant.now()).toMillis(); 
     Log.getLogger().info("Total Duration in millis: "+millis1); 

     allFiles.put(file.toUri().toString(), af); 


    } 
    catch (Exception e) 
    { 
     Log.getLogger().debug(e.getMessage(), e); 
    } 

}

要清楚，我期望问题是与我的配置。但是，我确信stall and memory问题发生在pipeline.annate（file）方法中。

在处理每个文件后，我处理除管道之外的所有对Stanford-NLP对象的引用（例如，CoreLabel）。也就是说，我不会在方法级别之外继续引用我代码中的任何斯坦福对象。

任何提示或指导，将深表赞赏

来源

2016-06-18 Jake

OK，问题的最后那句话让我去仔细检查。答案是我在自己的类中继续引用CoreMap。换句话说，我记忆了我的语料库中每个句子的所有树，标记和其他分析。

简而言之，将StanfordNLP CoreMaps保留给定数量的句子，然后进行处置。（我期望一位核心计算语言学家会说，一旦分析完成后，几乎不需要保留一个CoreMap，但我必须在此声明我的初学者状态）

来源

2016-06-18 19:54:35 Jake

斯坦福大学NLP：OutOfMemoryError

回答

相关问题