2017-08-24 39 views
2

我使用Python包装斯坦福NLP 的代码,以查找命名实体是:如何找到斯坦福NLP命名实体的指标

sentence = "Mr. Jhon was noted to have a cyst at his visit back in 2011." 
result = nlp.ner(sentence) 

for ne in result: 
    if ne[1] == 'PERSON': 
    print(ne) 

输出结果是一个列表类型: ( u'Jhon“ u'PERSON”)

,但它并没有给命名实体的指标像spaCy或其他NLP工具也给出了指数的结果。

>> namefinder = NameFinder.getNameFinder("spaCy") 
>> entities = namefinder.find(sentences) 
List(List((PERSON,0,13), (DURATION,15,27), (DATE,76,83)), 
    List((PERSON,4,10), (LOCATION,77,86), (ORGANIZATION,26,39)), 
    List((PERSON,0,13), (DURATION,16,28), (ORGANIZATION,52,80))) 

回答

0

我为此使用了nltk。我修改了here的答案。关键点是调用WordPunctTokenizerspan_tokenize()的方法来生成一个单独的列表,我称之为spans,它保持每个令牌的跨度。

from nltk.tag import StanfordNERTagger 
from nltk.tokenize import WordPunctTokenizer 

# Initialize Stanford NLP with the path to the model and the NER .jar 
st = StanfordNERTagger(r"C:\stanford-corenlp\stanford-ner\classifiers\english.all.3class.distsim.crf.ser.gz", 
     r"C:\stanford-corenlp\stanford-ner\stanford-ner.jar", 
     encoding='utf-8') 

sentence = "Mr. Jhon was noted to have a cyst at his visit back in 2011." 

tokens = WordPunctTokenizer().tokenize(sentence) 

# We have to compute the token spans in a separate list 
# Notice that span_tokenize(sentence) returns a generator 
spans = list(WordPunctTokenizer().span_tokenize(sentence)) 

# enumerate will help us keep track of the token index in the token lists 
for i, ner in enumerate(st.tag(tokens)): 
    if ner[1] == "PERSON": 
     print spans[i], ner