python，从文本文档创建一个过滤列表

每当我尝试运行该程序时，Python IDLE都会通过告诉我它没有响应并且必须关闭来做出响应。有关如何改进此代码以使其按我想要的方式工作的任何建议？python，从文本文档创建一个过滤列表

#open text document 
#filter out words in the document by appending to an empty list 
#get rid of words that show up more than once 
#get rid of words that aren't all lowercase 
#get rid of words that end in substring 'xx' 
#get rid of words that are less than 5 characters 
#print list 

fin = open('example.txt') 
L = [] 
for word in fin: 
    if len(word) >= 5: 
     L.append(word) 
    if word != word: 
     L.append(word) 
    if word[-2:-1] != 'xx': 
     L.append(word) 
    if word == word.lower(): 
     L.append(word) 
print L

来源

2011-09-30 SWebbz

多大是example.txt？ – geoffspear

（另外，你的逻辑还没有完成;你没有摆脱满足其中一个标准的单词;你保留每个单词的一个副本，以符合你不违反的每个标准。） – geoffspear

文本文档非常大，是一本书的几个章节。是的，我需要很多的帮助:( – SWebbz

我做了你的功课，我很无聊。可能有一个错误。

homework_a_plus = [] 
#open text document 
with open('example.txt', 'r') as fin: 
    for word in fin: 
     #get rid of words that show up more than once 
     if word in homework_a_plus: 
      continue 
     #get rid of words that aren't all lowercase 
     for c in word: 
      if c.isupper(): 
       continue 
     #get rid of words that end in substring 'xx' 
     if word[-2:] == 'xx': 
      continue 
     #get rid of words that are less than 5 characters 
     if len(word) < 5: 
      continue 
     homework_a_plus.append(word) 
print homework_a_plus

编辑：就像wooble说的，你的逻辑是在你提供的代码中关闭。比较你的代码与我的，我想你会明白为什么你的问题。

来源

2011-09-30 17:59:16 Jake

我看到一些我做了错误，谢谢！ – SWebbz

一些一般性的帮助：

而不是

fin = open('example.txt')

您应该使用

with open('example.txt', 'r') as fin:

然后缩进代码的其余部分，但你的版本会工作。

L = [] 
for word in fin:

它不通过文字迭代，但线。如果每行一个字，每一个仍然有在最后一个换行符，所以你应该做

word = word.rstrip()

到单词结束后清除任何空白。如果你真的想在同一时间做这一个字，你需要 for循环，如：

for line in fin: 
    for word in line.split():

，然后把逻辑内环内。

if len(word) >= 5: 
    L.append(word)

随着剥离空白，这将添加任何单词五个字母或更长的名单。

if word != word: 
    L.append(word)

word将总是等于话，那么这个什么也不做。如果您想消除重复项，请将L设为set()，并使用L.add(word)而不是L.append(word)表示要添加到列表中的单词（假定顺序无关紧要）。

if word[-2:-1] != 'xx': 
    L.append(word)

如果你想看看它是否与'xx'结束，使用

if not word.endswith('xx'):

代替，或word[-2:]没有-1，否则你只是比较下一个到最后所而不是整个事情。

if word == word.lower(): 
    L.append(word)

如果单词全部为小写，则将该单词添加到列表中。

请记住，所有这些if测试将适用于每一个字，所以你会为它传递每次测试的单词添加到列表一次。如果您只想添加一次，则除第一个测试外，您可以使用elif而不是if进行测试。

你的意见还意味着你在某种程度上通过将它们添加到列表中来“摆脱”单词 - 你不是。你是保留你添加到列表中的其他人，剩下的就会消失;你不会以任何方式更改文件。

来源

2011-09-30 18:01:16 agf

谢谢你将它分解为我！:) – SWebbz

words = [inner for outer in [line.split() for line in open('example.txt')] for inner in outer] 

for word in words[:]: 
    if words.count(word) > 1 or word.lower() != word or word[-2:] == 'xx' or len(word) < 5: 
     words.remove(word) 
print words

来源

2011-09-30 18:04:34 infrared

我尝试过类似这样的东西，它没什么都不做，但你的方法是有道理的。 – SWebbz

如果你想写更多的过滤器...我会采取一个稍微不同的方法。

fin = open('example.txt','r') 
seenList = [] 
for line in fin: 
    for word in line.split(): 
     if word in seenList: continue 
     if word[-2:] == 'xx': continue 
     if word.lower() != word: continue 
     if len(word) < 5: continue 
     seenList.append(word) 
     print word

这有利于向您显示每行，因为它的输出。如果您想要输出到文件，请适当修改print word行或使用shell重定向。

编辑：如果你真的不想打印任何重复的单词（以上仅仅是第一跳过后，每个实例），比像这样的作品...

fin = open('example.txt','r') 
seenList = [] 
for line in fin: 
    for word in line.split(): 
     if word in seenList: 
      seenList.remove(word) 
      continue 
     if word[-2:] == 'xx': continue 
     if word.lower() != word: continue 
     if len(word) < 5: continue 
     seenList.append(word) 

print seenList

来源

2011-09-30 18:05:46 jkerian

谢谢，这是非常有益的！ – SWebbz

使'seenList'成为'defaultdict（int）'并且执行'seenList [word] + = 1'。当前版本将打印出现多次出现的单词。如果OP真的想要显示每个单词（甚至是重复的单词）一次，则将'seenList'设置为'set'并添加单词而不测试它们是否已经存在。 – patrys

import re 

def by_words(it): 
    pat = re.compile('\w+') 
    for line in it: 
     for word in pat.findall(line): 
      yield word 

def keepers(it): 
    words = set() 
    for s in it: 
     if len(s)>=5 and s==s.lower() and not s.endswith('xx'): 
      words.add(s) 
    return list(words)

要获得从战争与和平5个字：

from urllib import urlopen 
source = urlopen('http://www.gutenberg.org/ebooks/2600.txt.utf8') 
print keepers(by_words(source))[:5]

打印

['raining', 'divinely', 'hordes', 'nunnery', 'parallelogram']

这并不占用太多内存。战争与和平只有14,361个符合你的标准的词。迭代器在非常小的块上工作。

来源

2011-09-30 18:34:02

这是一个很好的解决方案，但您应该注意它会返回每个单词的一个实例，而不是仅显示一次的单词。 “摆脱不止一次出现的单词”或许含糊不清，但你的解释与我的不同。 – Gabe

做它最简单的方式用正则表达式：

import re 

li = ['bubble', 'iridescent', 'approxx', 'chime', 
     'Azerbaidjan', 'moon', 'astronomer', 'glue', 'bird', 
     'plan_ary', 'suxx', 'moon', 'iridescent', 'magnitude', 
     'Spain', 'through', 'macGregor', 'iridescent', 'ben', 
     'glomoxx', 'iridescent', 'orbital'] 

reg1 = re.compile('(?!\S*?[A-Z_]\S*(?=\Z))' 
       '\w{5,}' 
       '(?<!xx)\Z') 

print set(filter(reg1.match,li)) 

# result: 

set(['orbital', 'astronomer', 'magnitude', 'through', 'iridescent', 'chime', 'bubble'])

如果数据不在列表中，但在一个字符串：

ss = '''bubble iridescent approxx chime 
Azerbaidjan moon astronomer glue bird 
plan_ary suxx moon iridescent magnitude 
Spain through macGregor iridescent ben 
glomoxx iridescent orbital''' 

print set(filter(reg1.match,ss.split()))

或

reg2 = re.compile('(?:(?<=\s)|(?<=\A))' 
       '(?!\S*?[A-Z_]\S*(?=\s|\Z))' 
       '\w{5,}' 
       '(?<!xx)' 
       '(?=\s|\Z)') 

print set(reg2.findall(ss))

来源

2011-10-01 11:20:09 eyquem

python，从文本文档创建一个过滤列表

回答

相关问题