2015-02-06 32 views
4

我有这个python脚本,我使用nltk库来解析,标记化,标记和块一些让我们说从网上随机文本。如何输出NLTK块到文件?

我需要格式化并在文件中写入输出chunked1,chunked2,chunked3。这些有类型class 'nltk.tree.Tree'

更具体地说,我只需要写出与正则表达式chunkGram1,chunkGram2,chunkGram3匹配的行。

我该怎么做?

#! /usr/bin/python2.7 

import nltk 
import re 
import codecs 

xstring = ["An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."] 


def processLanguage(): 
    for item in xstring: 
     tokenized = nltk.word_tokenize(item) 
     tagged = nltk.pos_tag(tokenized) 
     #print tokenized 
     #print tagged 

     chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}""" 
     chunkGram2 = r"""Chunk: {<JJ\w?>*<NNS>}""" 
     chunkGram3 = r"""Chunk: {<NNP\w?>*<NNS>}""" 

     chunkParser1 = nltk.RegexpParser(chunkGram1) 
     chunked1 = chunkParser1.parse(tagged) 

     chunkParser2 = nltk.RegexpParser(chunkGram2) 
     chunked2 = chunkParser2.parse(tagged) 

     chunkParser3 = nltk.RegexpParser(chunkGram3) 
     chunked3 = chunkParser2.parse(tagged) 

     #print chunked1 
     #print chunked2 
     #print chunked3 

     # with codecs.open('path\to\file\output.txt', 'w', encoding='utf8') as outfile: 

      # for i,line in enumerate(chunked1): 
       # if "JJ" in line: 
        # outfile.write(line) 
       # elif "NNP" in line: 
        # outfile.write(line) 



processLanguage() 

对于时候,我试图运行它是我得到的错误:

`Traceback (most recent call last): 
    File "sentdex.py", line 47, in <module> 
    processLanguage() 
    File "sentdex.py", line 40, in processLanguage 
    outfile.write(line) 
    File "C:\Python27\lib\codecs.py", line 688, in write 
    return self.writer.write(data) 
    File "C:\Python27\lib\codecs.py", line 351, in write 
    data, consumed = self.encode(object, self.errors) 
TypeError: coercing to Unicode: need string or buffer, tuple found` 

编辑: @Alvas答案之后,我能够做到我想要的东西。但是现在,我想知道如何从文本语料库中去除所有非ASCII字符。例如:

#store cleaned file into variable 
with open('path\to\file.txt', 'r') as infile: 
    xstring = infile.readlines() 
infile.close 

    def remove_non_ascii(line): 
     return ''.join([i if ord(i) < 128 else ' ' for i in line]) 

    for i, line in enumerate(xstring): 
     line = remove_non_ascii(line) 

#tokenize and tag text 
def processLanguage(): 
    for item in xstring: 
     tokenized = nltk.word_tokenize(item) 
     tagged = nltk.pos_tag(tokenized) 
     print tokenized 
     print tagged 
processLanguage() 

以上是从S/O中的另一个答案中获取的。但它似乎并不奏效。什么可能是错的?我得到的错误是:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 
not in range(128) 
+1

带有行号的错误跟踪将有助于识别代码中导致“TypeError”的内容。 – 2015-02-06 12:22:30

+1

你的'line'包含一个'Tree',而不是'string'。尝试对包含的字符串进行迭代。 – Selcuk 2015-02-06 12:26:48

+0

@Selcuk你想介绍一下..吗? – kapelnick 2015-02-06 12:39:37

回答

6

您的代码h作为几个问题,虽然主要的罪魁祸首是你for循环不修改xstring的内容:

我会解决你的代码在这里的所有问题:

不能写路径一样这与单\,因为\t将被解释为一个制表符,和\f作为换行字符。你必须加倍。我知道这是这里的例子,但这样的困惑经常出现:

with open('path\\to\\file.txt', 'r') as infile: 
    xstring = infile.readlines() 

以下infile.close线错误。它不会调用close方法,它实际上没有做任何事情。此外,您的文件已经关闭由与条款,如果你看到的任何地方任何回答这一行,请你只downvote的答案直接与评论说file.close是错误的,应该是file.close()

以下应该工作,但是你需要知道它与' '会破的词语,如天真和咖啡馆

def remove_non_ascii(line): 
    return ''.join([i if ord(i) < 128 else ' ' for i in line]) 

替换每个非ASCII字符,但在这里就是为什么你的代码失败的原因unicode异常:你根本没有修改xstring的元素,也就是说,你正在计算删除ascii字符的行,是的,但是这是一个新值,从来没有存储到列表中:

for i, line in enumerate(xstring): 
    line = remove_non_ascii(line) 

相反,它应该是:

for i, line in enumerate(xstring): 
    xstring[i] = remove_non_ascii(line) 

或我的首选很Python的:

xstring = [ remove_non_ascii(line) for line in xstring ] 

虽然这些Unicode错误主要发生只是因为你正在使用用于处理纯Unicode文本的Python 2.7,som对于最近的Python 3版本来说,这是一件好事,因此我建议你,如果你刚开始的任务很快就会升级到Python 3.4+。

+0

感谢您的回答我一旦我有一些时间,我会仔细看看它。 – kapelnick 2015-02-14 14:31:51

7

首先,看这个视频:https://www.youtube.com/watch?v=0Ef9GudbxXY

enter image description here

现在的正确答案:

import re 
import io 

from nltk import pos_tag, word_tokenize, sent_tokenize, RegexpParser 


xstring = u"An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system." 


chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}""" 
chunkParser1 = RegexpParser(chunkGram1) 

chunked = [chunkParser1.parse(pos_tag(word_tokenize(sent))) 
      for sent in sent_tokenize(xstring)] 

with io.open('outfile', 'w', encoding='utf8') as fout: 
    for chunk in chunked: 
     fout.write(str(chunk)+'\n\n') 

[出]:

[email protected]:~$ python test2.py 
Traceback (most recent call last): 
    File "test2.py", line 18, in <module> 
    fout.write(str(chunk)+'\n\n') 
TypeError: must be unicode, not str 
[email protected]:~$ python3 test2.py 
[email protected]:~$ head outfile 
(S 
    An/DT 
    (Chunk electronic/JJ library/NN) 
    (/: 
    also/RB 
    referred/VBD 
    to/TO 
    as/IN 
    (Chunk digital/JJ library/NN) 
    or/CC 

如果你要坚持python2.7:

with io.open('outfile', 'w', encoding='utf8') as fout: 
    for chunk in chunked: 
     fout.write(unicode(chunk)+'\n\n') 

[出]:

[email protected]:~$ python test2.py 
[email protected]:~$ head outfile 
(S 
    An/DT 
    (Chunk electronic/JJ library/NN) 
    (/: 
    also/RB 
    referred/VBD 
    to/TO 
    as/IN 
    (Chunk digital/JJ library/NN) 
    or/CC 
[email protected]:~$ python3 test2.py 
Traceback (most recent call last): 
    File "test2.py", line 18, in <module> 
    fout.write(unicode(chunk)+'\n\n') 
NameError: name 'unicode' is not defined 

,并强烈建议,如果你必须坚持py2.7:

from six import text_type 
with io.open('outfile', 'w', encoding='utf8') as fout: 
    for chunk in chunked: 
     fout.write(text_type(chunk)+'\n\n') 

[出]:

[email protected]:~$ python test2.py 
[email protected]:~$ head outfile 
(S 
    An/DT 
    (Chunk electronic/JJ library/NN) 
    (/: 
    also/RB 
    referred/VBD 
    to/TO 
    as/IN 
    (Chunk digital/JJ library/NN) 
    or/CC 
[email protected]:~$ python3 test2.py 
[email protected]:~$ head outfile 
(S 
    An/DT 
    (Chunk electronic/JJ library/NN) 
    (/: 
    also/RB 
    referred/VBD 
    to/TO 
    as/IN 
    (Chunk digital/JJ library/NN) 
    or/CC 
+0

我会接受你的回答,因为我重视你提供的反馈。也许你可以帮助我做另一件小事。看看问题的编辑部分。 – kapelnick 2015-02-08 12:02:02

+2

我会回答你的编辑,但我认为这是另一个问题本身。最好在SO版主出现之前提出另一个问题,并由于某种原因删除您的问题。 hahahaaa =) – alvas 2015-02-08 12:56:05

+0

你可以上传你的文件到某个地方,然后问另一个关于数据清理的问题吗?如果我不知道文件的外观如何或文件是什么,我无法提供帮助。根据文件和内容,可以有101种方法来清理数据。 – alvas 2015-02-08 12:59:20