TokensRegex：重新标记后标记为空

我正在尝试使用斯坦福NLP的TokensRegex并尝试在文本中查找尺寸（例如100x120）。所以我的计划是首先重新输入，以进一步分解这些令牌（使用retokenize.rules.txt中提供的示例），然后搜索新模式。TokensRegex：重新标记后标记为空

做retokenization后，然而，只有空值留下了替换原始字符串：

The top level annotation 
[Text=100x120 Tokens=[null-1, null-2, null-3] Sentences=[100x120]]

的retokenization似乎很好地工作（在结果3代币），但值都将丢失。我能做些什么来维护令牌列表中的原始值？

我retokenize.rules.txt文件（如演示）：

tokens = { type: "CLASS", value:"edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" } 
options.matchedExpressionsAnnotationKey = tokens; 
options.extractWithTokens = TRUE; 
options.flatten = TRUE; 
ENV.defaults["ruleType"] = "tokens" 
ENV.defaultStringPatternFlags = 2 
ENV.defaultResultAnnotationKey = tokens 

{ pattern: (/\d+(x|X)\d+/), result: Split($0[0], /x|X/, TRUE) }

主要方法：

public static void main(String[] args) throws IOException { 
    //... 
    text = "100x120"; 
    Properties properties = new Properties(); 
    properties.setProperty("tokenize.language", "de"); 
    properties.setProperty("annotators", tokenize,retokenize,ssplit,pos,lemma,ner"); 
    properties.setProperty("customAnnotatorClass.retokenize", "edu.stanford.nlp.pipeline.TokensRegexAnnotator"); 
    properties.setProperty("retokenize.rules", "retokenize.rules.txt"); 
    StanfordCoreNLP stanfordPipeline = new StanfordCoreNLP(properties); 
    runPipeline(pipelineWithRetokenize, text);

}

与管道：

public static void runPipeline(StanfordCoreNLP pipeline, String text) { 
    Annotation annotation = new Annotation(text); 
    pipeline.annotate(annotation); 
    out.println(); 
    out.println("The top level annotation"); 
    out.println(annotation.toShorterString()); 
    //... 
}

来源

2016-04-27 cferner

谢谢你让你我知道。 CoreAnnotations.ValueAnnotation未被填充，我们将更新TokenRegex来填充该字段。

无论如何，您应该可以使用TokenRegex按照您的计划重新进行回复。大部分管道不取决于ValueAnnotation，而是使用CoreAnnotations.TextAnnotation。您可以使用CoreAnnotations.TextAnnotation获取新令牌的文本（每个令牌都是CoreLabel，因此您可以使用token.word（）来访问它）。

请参阅TokensRegexRetokenizeDemo有关如何获取不同注释的示例代码。

来源

2016-04-27 19:04:40

TokensRegex：重新标记后标记为空

回答

相关问题