nltk自定义标记器和标记器

这是我的要求。我想以这样的方式标记和标记段落，以使我能够实现以下内容。nltk自定义标记器和标记器

应确定日期和时间段和标记他们为DATE和TIME
应确定在一段已知的短语和标签为自定义
和休息含量应标记化应由被标记化默认nltk的word_tokenize和pos_tag函数？

例如，以下sentense

"They all like to go there on 5th November 2010, but I am not interested."

应被标记和标记化作为在自定义短语的情况下，下面是“我不感兴趣”。

[('They', 'PRP'), ('all', 'VBP'), ('like', 'IN'), ('to', 'TO'), ('go', 'VB'), 
('there', 'RB'), ('on', 'IN'), ('5th November 2010', 'DATE'), (',', ','), 
('but', 'CC'), ('I am not interested', 'CUSTOM'), ('.', '.')]

任何建议都将是有用的。

来源

2010-10-14 Software Enthusiastic

你是怎么解决这个问题？我有一个类似的用例，我需要用自定义标签在不同的句子中标记已知的短语。 – AgentX 2017-07-17 09:38:20

正确的答案是编译一个大型的数据集，以你想要的方式标记，然后训练一个机器学习的chunker就可以了。如果这太耗时，最简单的方法是运行POS标记器并使用正则表达式对其输出进行后处理。获得最长的比赛是困难的部分在这里：

s = "They all like to go there on 5th November 2010, but I am not interested." 

DATE = re.compile(r'^[1-9][0-9]?(th|st|rd)? (January|...)([12][0-9][0-9][0-9])?$') 

def custom_tagger(sentence): 
    tagged = pos_tag(word_tokenize(sentence)) 
    phrase = [] 
    date_found = False 

    i = 0 
    while i < len(tagged): 
     (w,t) = tagged[i] 
     phrase.append(w) 
     in_date = DATE.match(' '.join(phrase)) 
     date_found |= bool(in_date) 
     if date_found and not in_date:   # end of date found 
      yield (' '.join(phrase[:-1]), 'DATE') 
      phrase = [] 
      date_found = False 
     elif date_found and i == len(tagged)-1: # end of date found 
      yield (' '.join(phrase), 'DATE') 
      return 
     else: 
      i += 1 
      if not in_date: 
       yield (w,t) 
       phrase = []

TODO：扩大DATE重新插入代码搜索CUSTOM短语，使通过匹配POS标签，以及令牌这个更复杂，并决定是否5th其自己应该算作约会。（可能不会，所以过滤掉只包含序号的长度的日期。）

来源

2010-10-14 13:33:53

感谢分享代码，请让我试试这个，我会尽快回复您... – 2010-10-16 05:28:36

您应该使用nltk.RegexpParser来实现您的目标。

参考： http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html#code-chunker1

来源

2010-10-14 20:39:11 Neodawn

让我通过它，我会回到你身边... – 2010-10-18 06:53:37

nltk自定义标记器和标记器

回答

相关问题