2016-11-21 23 views
1

Spacy包含noun_chunks功能来检索一组名词短语。 功能english_noun_chunks(附后)使用word.pos == NOUNSpacy NLP - 使用正则表达式分块

def english_noun_chunks(doc): 
    labels = ['nsubj', 'dobj', 'nsubjpass', 'pcomp', 'pobj', 
       'attr', 'root'] 
    np_deps = [doc.vocab.strings[label] for label in labels] 
    conj = doc.vocab.strings['conj'] 
    np_label = doc.vocab.strings['NP'] 
    for i in range(len(doc)): 
     word = doc[i] 
     if word.pos == NOUN and word.dep in np_deps: 
      yield word.left_edge.i, word.i+1, np_label 
     elif word.pos == NOUN and word.dep == conj: 
      head = word.head 
      while head.dep == conj and head.head.i < head.i: 
       head = head.head 
      # If the head is an NP, and we're coordinated to it, we're an NP 
      if head.dep in np_deps: 
       yield word.left_edge.i, word.i+1, np_label 

我想从保持一定的正则表达式的一句话让块。例如,我的零个或多个形容词后面跟着一个或多个名词。

{(<JJ>)*(<NN | NNS | NNP>)+} 

有没有可能不重写english_noun_chunks函数?

回答

2

你可以在不损失任何性能的情况下重写这个函数,因为它是用纯python实现的,但为什么不在获取它们后过滤这些块呢?

import re 
import spacy 

def filtered_chunks(doc, pattern): 
    for chunk in doc.noun_chunks: 
    signature = ''.join(['<%s>' % w.tag_ for w in chunk]) 
    if pattern.match(signature) is not None: 
     yield chunk 

nlp = spacy.load('en') 
doc = nlp(u'Great work!') 
pattern = re.compile(r'(<JJ>)*(<NN>|<NNS>|<NNP>)+') 

print(list(filtered_chunks(doc, pattern))) 
+0

那么这个函数被Cython翻译为C的事实呢? – Serendipity

+0

你说得对,该文件具有'.pyx'扩展名,如果你改写它,你将失去一些性能。但是,你是否需要重写它,或者你可以简单地过滤最终结果? –

相关问题