2012-06-07 137 views
0

我知道有大量关于删除标点符号的示例,但我想知道执行此操作的最有效方法。我有一个单词列表,我从一个txt文件读取和拆分将标点符号从python列表中删除

wordlist = open('Tyger.txt', 'r').read().split() 

什么是检查每个单词,并删除任何标点的最快方法?我可以用一堆代码来做,但我知道这不是最简单的方法。

谢谢!

+0

你能提供一个样本输入和输出(或描述什么构成你的标点符号)? – Levon

+0

肯定没问题。该文本文件是一首诗。前两行是:Tyger!泰格!燃烧明亮 在夜晚的森林里,我希望他们最后没有逗号或惊叹号列表。我需要删除的一组puntuation是“ - ,!?。谢谢! –

+0

看起来像一个重复到这个http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string在python –

回答

2

我认为最简单的方法是只由字母首先提取词:

import re 

with open("Tyger.txt") as f: 
    words = re.findall("\w+", f.read()) 
+0

这将如何处理不是标点符号的特殊字符? – luke14free

+0

这很有用,非常感谢。我非常感谢你们的帮助 –

+1

@EnglishGrad:注意Sven使用'with'关键字来打开输入文件,使用'with'代码块比使用'f = open()更好。 close()'和_much_首选使用'stuff = open()。read()...'。在最后一个例子中,你无法在读/写后显式地关闭文件 –

1

例如:

text = """ 
Tyger! Tyger! burning bright 
In the forests of the night, 
What immortal hand or eye 
Could frame thy fearful symmetry? 
""" 
import re 
words = re.findall(r'\w+', text) 

import string 
ps = string.punctuation 
words = text.translate(string.maketrans(ps, ' ' * len(ps))).split() 

第二个速度要快得多。

+0

请注意,您的两个解决方案做不同的事情。 “可怕的,对称的”最终会成为第二种方法的一个单词。 –

+1

@SvenMarnach:是的,正确的。尽管如此,翻译比re要快4倍。 – georg

1

我会去像这样的东西:

import re 
with open("Tyger.txt") as f: 
    print " ".join(re.split("[\-\,\!\?\.]", f.read()) 

它将删除只在真正需要什么,并不会由于过匹配产生过多过载。

1
>>> import re 

>>> the_tyger 
'\n Tyger! Tyger! burning bright \n In the forests of the night, \n What immortal hand or eye \n Could frame thy fearful symmetry? \n \n In what distant deeps or skies \n Burnt the fire of thine eyes? \n On what wings dare he aspire? \n What the hand dare sieze the fire? \n \n And what shoulder, & what art. \n Could twist the sinews of thy heart? \n And when thy heart began to beat, \n What dread hand? & what dread feet? \n \n What the hammer? what the chain? \n In what furnace was thy brain? \n What the anvil? what dread grasp \n Dare its deadly terrors clasp? \n \n When the stars threw down their spears, \n And watered heaven with their tears, \n Did he smile his work to see? \n Did he who made the Lamb make thee? \n \n Tyger! Tyger! burning bright \n In the forests of the night, \n What immortal hand or eye \n Dare frame thy fearful symmetry? \n ' 

>>> print re.sub(r'["-,!?.]','',the_tyger) 

打印:

Tyger Tyger burning bright 
In the forests of the night 
What immortal hand or eye 
Could frame thy fearful symmetry 

In what distant deeps or skies 
Burnt the fire of thine eyes 
On what wings dare he aspire 
What the hand dare sieze the fire 

And what shoulder what art 
Could twist the sinews of thy heart 
And when thy heart began to beat 
What dread hand what dread feet 

What the hammer what the chain 
In what furnace was thy brain 
What the anvil what dread grasp 
Dare its deadly terrors clasp 

When the stars threw down their spears 
And watered heaven with their tears 
Did he smile his work to see 
Did he who made the Lamb make thee 

Tyger Tyger burning bright 
In the forests of the night 
What immortal hand or eye 
Dare frame thy fearful symmetry 

或者与文件:

>>> with open('tyger.txt', 'r') as WmBlake: 
... print re.sub(r'["-,!?.]','',WmBlake.read()) 

如果你想创建行的列表:

>>> lines=[] 
>>> with open('tyger.txt', 'r') as WmBlake: 
... lines.append(re.sub(r'["-,!?.]','',WmBlake.read())) 
+1

+1发布完整的诗;)虽然现在看起来更像布科夫斯基,而不是布莱克。 – georg