2017-07-29 65 views

回答

0

一个能做到这样:

Reader reader = new StringReader(paragraphText); 
DocumentPreprocessor documentPreprocessor = new DocumentPreprocessor(reader, DocumentPreprocessor.DocType.Plain); 

TokenizerFactory<? extends HasWord> factory = PTBTokenizer.factory(); 
factory.setOptions("untokenizable=noneDelete"); 
documentPreprocessor.setTokenizerFactory(factory); 

从这里:https://github.com/stanfordnlp/CoreNLP/issues/103#issuecomment-157793500

1

如果直接用一个标记工作,答案丹尼斯Kulagin给人好;如果你是在StanfordCoreNLP管道的更高级别的操作,你可以简单地给属性(或等效的命令行选项):

tokenize.options = untokenizable=noneDelete 

(默默地删除所有未知字符),或在后台让他们:

tokenize.options = untokenizable=noneKeep