2015-09-05 90 views
1

How to use the chinese model, and I download the "stanford-corenlp-3.5.2-models-chinese.jar" in my classpath and I copyAbout the Stanford CoreNLP in chinese model

<dependency> 
    <groupId>edu.stanford.nlp</groupId> 
    <artifactId>stanford-corenlp</artifactId> 
    <version>3.5.2</version> 
    <classifier>models-chinese</classifier> 
</dependency> 

to pom.xml file. In additional, my input.txt is

因出席中国大陆阅兵引发争议的国民党前主席连战今晚金婚宴,立法院长王金平说,已向连战恭喜,等一下回南部。 连战夫妇今晚的50周年金婚纪念宴,正值连战赴陆出席阅兵引发争议之际,社会关注会否受到影响。 包括国民党主席朱立伦、副主席郝龙斌等人已分别对外表示另有行程,无法出席。

then I compile the program using the code

java -cp "*" -Xmx1g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators segment,ssplit -file input.txt 

and the result is as follows. But it gives the following error and how do i solve this problem?

C:\stanford-corenlp-full-2015-04-20>java -cp "*" -Xmx1g edu.stanford.nlp.pipelin 
e.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators segment, 
ssplit -file input.txt 
Registering annotator segment with class edu.stanford.nlp.pipeline.ChineseSegmen 
terAnnotator 
Adding annotator segment 
Loading Segmentation Model ... Loading classifier from edu/stanford/nlp/models/s 
egmenter/chinese/ctb.gz ... Loading Chinese dictionaries from 1 file: 
    edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz 
Done. Unique words in ChineseDictionary is: 423200. 
done [22.9 sec]. 

Ready to process: 1 files, skipped 0, total 1 
Processing file C:\stanford-corenlp-full-2015-04-20\input.txt ... writing to C:\ 
stanford-corenlp-full-2015-04-20\input.txt.xml { 
    Annotating file C:\stanford-corenlp-full-2015-04-20\input.txt Adding Segmentat 
ion annotation ... INFO: TagAffixDetector: useChPos=false | useCTBChar2=true | u 
sePKChar2=false 
INFO: TagAffixDetector: building TagAffixDetector from edu/stanford/nlp/models/s 
egmenter/chinese/dict/character_list and edu/stanford/nlp/models/segmenter/chine 
se/dict/in.ctb 
Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese 
/dict/character_list 
Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in. 
ctb 
?]?X?u????j???\?L??o??????????e?D?u?s???????B?b?A??k?|???????????A?w?V?s?????A? 
[email protected]?U?^?n???C 
?s?????????50?g?~???B?????b?A????s??u???X?u?\?L??o???????A???|???`?|?_????v?T?C 
?]?A?????D?u?????B??D?u?q?s?y???H?w???O??~???t????{?A?L?k?X?u?C 

---> 
[?, ], ?, X, ?u????j???, \, ?L??o??????????e?D?u?s???????B?b?A??k?|???????????A? 
[email protected]?U?^?n???C, , , , ?s?????????, 50, ?, g?, ~, ???B?????b?A????s??u 
???X?u?, \, ?L??o???????A???, |, ???, `, ?, |, ?_????v?T?C, , , , ?, ], ?, A???? 
?D?u???, ??, B??D?u?q, ?, s?y???H?w???O??, ~, ???t????, {, ?, A?L?k?X?u?C] 

} 
Processed 1 documents 
Skipped 0 documents, error annotating 0 documents 
Annotation pipeline timing information: 
ChineseSegmenterAnnotator: 0.1 sec. 
TOTAL: 0.1 sec. for 34 tokens at 485.7 tokens/sec. 
Pipeline setup: 0.0 sec. 
Total time for StanfordCoreNLP pipeline: 0.1 sec. 

回答

2

I edited your question to change the command to the one that you actually used to produce the output shown. It looks like you worked out that the former command:

java -cp "*" -Xmx1g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txt input.xml 

ran the English analysis pipeline, and that didn't work very well for Chinese text....

The CoreNLP support of Chinese in v3.5.2 is still a little rough, and will hopefully be a bit smoother in the next release. But from here you need to:

  • Specify a properties file for Chinese, giving appropriate models. (If no properties file is specified, CoreNLP defaults to English): -props StanfordCoreNLP-chinese.properties
  • At present, word segmentation of Chinese is not the annotator tokenize , but segment , specified as a custom annotator in StanfordCoreNLP-chinese.properties . (Maybe we'll unify the two in a future release...)
  • The current dcoref annotator only works for English. There is Chinese coreference, but it is not fully integrated into the pipeline. If you want to use it, you currently have to write some code, as explained here . So let's delete it. (Again, this should be better integrated in the future).
  • At that point, things run, but the ugly stderr output you show is that by default the segmenter has VERBOSE turned on, but your output character encoding is not right for our Chinese output. We should have VERBOSE off by default, but you can turn it off with: -segment.verbose false
  • We have no Chinese lemmatizer, so may as well delete that annotator.
  • Also, CoreNLP needs more than 1GB of RAM. Try 2GB.

At this point, all should be good! With the command:

java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators segment,ssplit,pos,ner,parse -segment.verbose false -file input.txt

you get the output in input.txt.xml . (I'm not posting it, since it's a couple of thousand lines long....)

Update for C oreNLP v3.8.0: If using the (current in 2017) CoreNLP v3.8.0,那么有一些变化/进展:(i)我们现在对所有语言使用注释器tokenize,并且不需要为中文加载自定义注释器; (ii)默认情况下正确关闭详细分段; (iii)[负面进展]的要求要求lemma注释器在ner之前,即使它不适用于中文;和(iv)现在可以用于中文的共谋,作为coref,它要求事先注释员提及,其统计模型需要相当大的记忆。把这一切放在一起,你现在很好用这个命令:

java -cp "*" -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators tokenize,ssplit,pos,lemma,ner,parse,mention,coref -file input.txt

相关问题