我有这个python脚本,我使用nltk库来解析,标记化,标记和块一些让我们说从网上随机文本。如何输出NLTK块到文件?
我需要格式化并在文件中写入输出chunked1
,chunked2
,chunked3
。这些有类型class 'nltk.tree.Tree'
更具体地说,我只需要写出与正则表达式chunkGram1
,chunkGram2
,chunkGram3
匹配的行。
我该怎么做?
#! /usr/bin/python2.7
import nltk
import re
import codecs
xstring = ["An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."]
def processLanguage():
for item in xstring:
tokenized = nltk.word_tokenize(item)
tagged = nltk.pos_tag(tokenized)
#print tokenized
#print tagged
chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}"""
chunkGram2 = r"""Chunk: {<JJ\w?>*<NNS>}"""
chunkGram3 = r"""Chunk: {<NNP\w?>*<NNS>}"""
chunkParser1 = nltk.RegexpParser(chunkGram1)
chunked1 = chunkParser1.parse(tagged)
chunkParser2 = nltk.RegexpParser(chunkGram2)
chunked2 = chunkParser2.parse(tagged)
chunkParser3 = nltk.RegexpParser(chunkGram3)
chunked3 = chunkParser2.parse(tagged)
#print chunked1
#print chunked2
#print chunked3
# with codecs.open('path\to\file\output.txt', 'w', encoding='utf8') as outfile:
# for i,line in enumerate(chunked1):
# if "JJ" in line:
# outfile.write(line)
# elif "NNP" in line:
# outfile.write(line)
processLanguage()
对于时候,我试图运行它是我得到的错误:
`Traceback (most recent call last):
File "sentdex.py", line 47, in <module>
processLanguage()
File "sentdex.py", line 40, in processLanguage
outfile.write(line)
File "C:\Python27\lib\codecs.py", line 688, in write
return self.writer.write(data)
File "C:\Python27\lib\codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
TypeError: coercing to Unicode: need string or buffer, tuple found`
编辑: @Alvas答案之后,我能够做到我想要的东西。但是现在,我想知道如何从文本语料库中去除所有非ASCII字符。例如:
#store cleaned file into variable
with open('path\to\file.txt', 'r') as infile:
xstring = infile.readlines()
infile.close
def remove_non_ascii(line):
return ''.join([i if ord(i) < 128 else ' ' for i in line])
for i, line in enumerate(xstring):
line = remove_non_ascii(line)
#tokenize and tag text
def processLanguage():
for item in xstring:
tokenized = nltk.word_tokenize(item)
tagged = nltk.pos_tag(tokenized)
print tokenized
print tagged
processLanguage()
以上是从S/O中的另一个答案中获取的。但它似乎并不奏效。什么可能是错的?我得到的错误是:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
not in range(128)
带有行号的错误跟踪将有助于识别代码中导致“TypeError”的内容。 – 2015-02-06 12:22:30
你的'line'包含一个'Tree',而不是'string'。尝试对包含的字符串进行迭代。 – Selcuk 2015-02-06 12:26:48
@Selcuk你想介绍一下..吗? – kapelnick 2015-02-06 12:39:37