2017-04-13 39 views
0

在包含主题标签,如鸣叫一句,spacy的标记者分裂井号标签分为两个标记:作为一个整体,如何将空格标记化为标签?

import spacy 
nlp = spacy.load('en') 
doc = nlp(u'This is a #sentence.') 
[t for t in doc] 

输出:

[This, is, a, #, sentence, .] 

我想有记号化这样的井号标签:

[This, is, a, #sentence, .] 

这可能吗?

感谢

回答

2
  1. 你可以做一些前置和后置字符串操作,这将让你绕过“#”符号化基础,并且很容易实现。 e.g
> >>> import re 
> >>> import spacy 
> >>> nlp = spacy.load('en') 
> >>> sentence = u'This is my twitter update #MyTopic' 
> >>> parsed = nlp(sentence) 
> >>> [token.text for token in parsed] 
[u'This', u'is', u'my', u'twitter', u'update', u'#', u'MyTopic'] 
> >>> new_sentence = re.sub(r'#(\w+)',r'ZZZPLACEHOLDERZZZ\1',sentence) 
> >>> new_sentence u'This is my twitter update ZZZPLACEHOLDERZZZMyTopic' 
> >>> parsed = nlp(new_sentence) 
> >>> [token.text for token in parsed] 
[u'This', u'is', u'my', u'twitter', u'update', u'ZZZPLACEHOLDERZZZMyTopic'] 
> >>> [x.replace(u'ZZZPLACEHOLDERZZZ','#') for x in [token.text for token in parsed]] 
[u'This', u'is', u'my', u'twitter', u'update', u'#MyTopic'] 
  1. 您可以尝试在spacy的标记器中设置自定义分隔符。 我不知道这样做的方法。

UPDATE:您可以使用正则表达式找到令牌的跨度,你会想留单的道理,而这里提到retokenize使用span.merge方法:https://spacy.io/docs/api/span#merge

合并例如:

>>> import spacy 
>>> import re 
>>> nlp = spacy.load('en') 
>>> my_str = u'Tweet hashtags #MyHashOne #MyHashTwo' 
>>> parsed = nlp(my_str) 
>>> [(x.text,x.pos_) for x in parsed] 
[(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#', u'NOUN'), (u'MyHashOne', u'NOUN'), (u'#', u'NOUN'), (u'MyHashTwo', u'PROPN')] 
>>> indexes = [m.span() for m in re.finditer('#\w+',my_str,flags=re.IGNORECASE)] 
>>> indexes 
[(15, 25), (26, 36)] 
>>> for start,end in indexes: 
...  parsed.merge(start_idx=start,end_idx=end) 
... 
#MyHashOne 
#MyHashTwo 
>>> [(x.text,x.pos_) for x in parsed] 
[(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#MyHashOne', u'NOUN'), (u'#MyHashTwo', u'PROPN')] 
>>> 
0

这是更多的附加通过@DhruvPathak并从下面链接GitHub的线程无耻副本伟大的答案(一d @csvance更好的答案)。 spaCy功能(自V2.0起)add_pipe方法。这意味着您可以在函数中定义@DhruvPathak很好的答案,并将该步骤(方便地)添加到您的nlp处理管道中,如下所示。

引文从这里开始:

def hashtag_pipe(doc): 
    merged_hashtag = False 
    while True: 
     for token_index,token in enumerate(doc): 
      if token.text == '#': 
       if token.head is not None: 
        start_index = token.idx 
        end_index = start_index + len(token.head.text) + 1 
        if doc.merge(start_index, end_index) is not None: 
         merged_hashtag = True 
         break 
     if not merged_hashtag: 
      break 
     merged_hashtag = False 
    return doc 

nlp = spacy.load('en') 
nlp.add_pipe(hashtag_pipe) 

doc = nlp("twitter #hashtag") 
assert len(doc) == 2 
assert doc[0].text == 'twitter' 
assert doc[1].text == '#hashtag' 

引用到此结束;检查完整线程how to add hashtags to the part of speech tagger #503

PS它读取代码时很明显,但对于复制& pasters,不要禁用解析器:)

相关问题