如何将PTBTokenizer的结果分成几个句子？

我知道我可以用DocumentPreprocessor将文本拆分成句子。但是如果要将标记文本转换回原始文本，它不会提供足够的信息。所以我必须使用PTBTokenizer，它有一个invertible选项。如何将PTBTokenizer的结果分成几个句子？

但是，PTBTokenizer只是返回文档中所有令牌（CoreLabel）的迭代器。它不会将文档分成多个句子。

The documentation说：

PTBTokenizer的输出可以进行后期处理来划分文本的句子。

但这显然不是微不足道的。

Stanford NLP库中是否有一个类可以输入一个CoreLabel的序列并输出句子？这就是我的意思：

List<List<CoreLabel>> split(List<CoreLabel> documentTokens);

来源

2015-11-13 Yuhuan Jiang

我建议你使用StanfordCoreNLP类。下面是一些示例代码：

import java.io.*; 
import java.util.*; 
import edu.stanford.nlp.io.*; 
import edu.stanford.nlp.ling.*; 
import edu.stanford.nlp.pipeline.*; 
import edu.stanford.nlp.trees.*; 
import edu.stanford.nlp.semgraph.*; 
import edu.stanford.nlp.ling.CoreAnnotations.*; 
import edu.stanford.nlp.util.*; 

public class PipelineExample { 

    public static void main (String[] args) throws IOException { 
     // build pipeline                                   
     Properties props = new Properties(); 
     props.setProperty("annotators","tokenize, ssplit, pos"); 
     StanfordCoreNLP pipeline = new StanfordCoreNLP(props); 
     String text = " I am a sentence. I am another sentence."; 
     Annotation annotation = new Annotation(text); 
     pipeline.annotate(annotation); 
     System.out.println(annotation.get(TextAnnotation.class)); 
     List<CoreMap> sentences = annotation.get(SentencesAnnotation.class); 
     for (CoreMap sentence : sentences) { 
      System.out.println(sentence.get(TokensAnnotation.class)); 
      for (CoreLabel token : sentence.get(TokensAnnotation.class)) { 
       System.out.println(token.after() != null); 
       System.out.println(token.before() != null); 
       System.out.println(token.beginPosition()); 
       System.out.println(token.endPosition()); 
      } 
     } 
    } 

}

来源

2015-11-13 01:55:49 StanfordNLPHelp

请问'前（）'，'后（）'，'beginPosition（）'和'终端位置（）'执行（即不只是返回'null's）在产生的'CoreMap's？ –

是的，所有这些都正在设置。 – StanfordNLPHelp

谢谢。有用！ –

如何将PTBTokenizer的结果分成几个句子？

回答

相关问题