Spacy NLP - 使用正则表达式分块

Spacy包含noun_chunks功能来检索一组名词短语。功能english_noun_chunks（附后）使用word.pos == NOUNSpacy NLP - 使用正则表达式分块

def english_noun_chunks(doc): 
    labels = ['nsubj', 'dobj', 'nsubjpass', 'pcomp', 'pobj', 
       'attr', 'root'] 
    np_deps = [doc.vocab.strings[label] for label in labels] 
    conj = doc.vocab.strings['conj'] 
    np_label = doc.vocab.strings['NP'] 
    for i in range(len(doc)): 
     word = doc[i] 
     if word.pos == NOUN and word.dep in np_deps: 
      yield word.left_edge.i, word.i+1, np_label 
     elif word.pos == NOUN and word.dep == conj: 
      head = word.head 
      while head.dep == conj and head.head.i < head.i: 
       head = head.head 
      # If the head is an NP, and we're coordinated to it, we're an NP 
      if head.dep in np_deps: 
       yield word.left_edge.i, word.i+1, np_label

我想从保持一定的正则表达式的一句话让块。例如，我的零个或多个形容词后面跟着一个或多个名词。

{(<JJ>)*(<NN | NNS | NNP>)+}

有没有可能不重写english_noun_chunks函数？

来源

2016-11-21 Serendipity

你可以在不损失任何性能的情况下重写这个函数，因为它是用纯python实现的，但为什么不在获取它们后过滤这些块呢？

import re 
import spacy 

def filtered_chunks(doc, pattern): 
    for chunk in doc.noun_chunks: 
    signature = ''.join(['<%s>' % w.tag_ for w in chunk]) 
    if pattern.match(signature) is not None: 
     yield chunk 

nlp = spacy.load('en') 
doc = nlp(u'Great work!') 
pattern = re.compile(r'(<JJ>)*(<NN>|<NNS>|<NNP>)+') 

print(list(filtered_chunks(doc, pattern)))

来源

2016-11-21 10:51:02

那么这个函数被Cython翻译为C的事实呢？ – Serendipity

你说得对，该文件具有'.pyx'扩展名，如果你改写它，你将失去一些性能。但是，你是否需要重写它，或者你可以简单地过滤最终结果？ –

Spacy NLP - 使用正则表达式分块

回答

相关问题