从文本中提取特定信息

我想从文本文件中获取一些数据。我已决定使用Natural Language Toolkit来做，但如果有更好的方法可以做到这一点，我会接受建议。从文本中提取特定信息

下面是一个例子：

我需要从纽约纽约到旧金山CA的航班

从这段文字中，我想得到的城市和国家的起源和目的地。

这是我到目前为止有：

import nltk 
from nltk.text import * 
from nltk.corpus import PlaintextCorpusReader 

def readfiles():  
    corpus_root = 'C:\prototype\emails' 
    w = PlaintextCorpusReader(corpus_root, '.*') 
    t = Text(w.words()) 
    print "--- to ----" 
    print t.concordance("to") 

    print "--- from ----" 
    print t.concordance("from")

我可以读一些输入（在我的文件）的文本，然后使用concordance method找到这一切的用途。我想提取这个城市，在'到'和'从'之后提供的状态信息。

问题是查看“to”和“from”实例之后的文本的最佳方式是什么？

来源

2011-12-28 dev.e.loper

从文本中挑选类似这样的地方称为“命名实体识别” - 尽管您可能想根据地名词典（GeoNames.org可能会查找数据）来调整自己的版本，但NLTK可以执行此操作。 – winwaed 2011-12-29 00:33:06

也许你最好逐行阅读文件？
然后一些简单：

cityState = dataAfterTo.split(",") 
city = cityState[0] 
state = cityState[1].split()[0]

除非你正在处理的用户生成的教学内容。

来源

2011-12-28 16:39:14 Brian

是的，它的用户生成了。因此，可能会或可能不会有一个'，'将城市和州隔开。我希望能够使用Python语言或库找到更优雅的解决方案。 – 2011-12-28 21:08:27

从文本中提取特定信息

回答

相关问题