我已经安装了Spacy和en_core_web_sm数据。 如果我尝试我的代码,应该提取随机新闻文章中的人员信息,我可以获得大约50%的正确数据。其余包含问题和错误。如何提高Spacy结果的质量?
import spacy
import io
from spacy.en import English
from spacy.parts_of_speech import NOUN
from spacy.parts_of_speech import ADP as PREP
nlp = English()
ents = list(doc.ents)
for entity in ents:
if entity.label_ == 'PERSON':
print(entity.label, entity.label_, ' '.join(t.orth_ for t in entity))
在此文件,例如: http://www.abc.net.au/news/2015-10-30/is-nauru-virtually-a-failed-state/6869648 我得到这些结果:
(377, u'PERSON', u'Lukas Coch)\\nMap')
(377, u'PERSON', u'\\"never')
(377, u'PERSON', u'Julie Bishop')
(377, u'PERSON', u'Tanya Plibersek')
(377, u'PERSON', u'Mr Eames')
(377, u'PERSON', u'DFAT')
(377, u'PERSON', u'2015Andrew Wilkie')
(377, u'PERSON', u'Daniel Th\xfcrer')
(377, u'PERSON', u'Australian Aid')
(377, u'PERSON', u'Nauru')
(377, u'PERSON', u'Rule')
这怎么可能增加结果的质量?
整个en_core_web_md有帮助吗?
或者那些NLP库方法总是比像TensorFlow这样的深度学习包更糟?