我正在尝试使用斯坦福NLP的TokensRegex并尝试在文本中查找尺寸(例如100x120)。所以我的计划是首先重新输入,以进一步分解这些令牌(使用retokenize.rules.txt中提供的示例),然后搜索新模式。TokensRegex:重新标记后标记为空
做retokenization后,然而,只有空值留下了替换原始字符串:
The top level annotation
[Text=100x120 Tokens=[null-1, null-2, null-3] Sentences=[100x120]]
的retokenization似乎很好地工作(在结果3代币),但值都将丢失。我能做些什么来维护令牌列表中的原始值?
我retokenize.rules.txt文件(如演示):
tokens = { type: "CLASS", value:"edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" }
options.matchedExpressionsAnnotationKey = tokens;
options.extractWithTokens = TRUE;
options.flatten = TRUE;
ENV.defaults["ruleType"] = "tokens"
ENV.defaultStringPatternFlags = 2
ENV.defaultResultAnnotationKey = tokens
{ pattern: (/\d+(x|X)\d+/), result: Split($0[0], /x|X/, TRUE) }
主要方法:
public static void main(String[] args) throws IOException {
//...
text = "100x120";
Properties properties = new Properties();
properties.setProperty("tokenize.language", "de");
properties.setProperty("annotators", tokenize,retokenize,ssplit,pos,lemma,ner");
properties.setProperty("customAnnotatorClass.retokenize", "edu.stanford.nlp.pipeline.TokensRegexAnnotator");
properties.setProperty("retokenize.rules", "retokenize.rules.txt");
StanfordCoreNLP stanfordPipeline = new StanfordCoreNLP(properties);
runPipeline(pipelineWithRetokenize, text);
}
与管道:
public static void runPipeline(StanfordCoreNLP pipeline, String text) {
Annotation annotation = new Annotation(text);
pipeline.annotate(annotation);
out.println();
out.println("The top level annotation");
out.println(annotation.toShorterString());
//...
}