2016-12-16 34 views
2

我有一个文本以及索引条目,其中一些表示文本中出现的重要多字表达式(MWE)(例如生物学文本的“海绵骨”)。我想用这些条目在SpaCy中构造一个自定义匹配器,以便我可以识别文本中MWE的出现。另一个要求是我需要匹配事件来保留MWE组成词的词性化表示和POS标签。Spacy中的多字表达式识别

我已经看过现有的spaCy例子做类似的事情,但我似乎无法得到模式。

回答

-1

Spacy文档不是很明确地使用具有多个短语的Matcher类,但在Github回购中有一个匹配example的多语句。

我最近面临着同样的挑战,而且我的工作如下。我的文本文件每行包含一条记录,其中的短语及其描述由'::'分隔。

import spacy 
import io 
from spacy.matcher import PhraseMatcher 

nlp = spacy.load('en') 
text = nlp(u'Your text here') 
rules = list() 

# Create a list of tuple of phrase and description from the file 
with io.open('textfile','r',encoding='utf8') as doc: 
    rules = [tuple(line.rstrip('\n').split('::')) for line in doc] 

# convert the phrase string to a spacy doc object 
rules = [(nlp(item[0].lower()),item[-1]) for item in rules ] 

# create a dictionary for accessing value using the string as the index which is returned by matcher class 
rules_dict = dict() 
for key,val in rules: 
    rules_dict[key.text]=val 

# get just the phrases from rules list 
rules_phrases = [item[0] for item in rules] 

# match using the PhraseMatcher class 
matcher = PhraseMatcher(nlp.vocab,rules_phrases) 
matches = matcher(text) 
result = list() 

for start,end,tag,label,m in matches: 
    result.append({"start":start,"end":end,"phrase":label,"desc":rules_dict[label]}) 
print(result)