2014-02-24 37 views
1

不知道,如果这个问题已经被问过,但我找不到它,所以这里是:Python的 - 除了从某些字符的字符串去除一切

randomList = ["ACGT","A#$..G","..,/\]AGC]]]T"] 
randomList2 = [] 
for i in randomList: 
    if i <contains any characters other than "A",C","G", or "T">: 
    <add a string without junk to randomList2> 

我会怎么做所有的事情在<>? 谢谢,

+0

http://stackoverflow.com/a/10017169/2282538 – Tyler

回答

0

您可以使用正则表达式:

import re 
randomList = ["ACGT","A#$..G","..,/\]AGC]]]T"] 
nonACGT = re.compile('[^ACGT]') 
for i in range(len(randomList)): 
    randomList[i] = nonACGT.sub('', randomList[i]) 
print randomList 
4
>>> randomList = ["ACGT","A#$..G","..,/\]AGC]]]T"] 
>>> import re 
>>> [re.sub("[^ACGT]+", "", s) for s in randomList] 
['ACGT', 'AG', 'AGCT'] 

[^ACGT]+匹配一个或多个(+)除了ACGT字符。

一些计时:

>>> import timeit 
>>> setup = '''randomList = ["ACGT","A#$..G","..,/\]AGC]]]T"] 
... import re''' 
>>> timeit.timeit(setup=setup, stmt='[re.sub("[^ACGT]+", "", s) for s in randomList]') 
8.197133132976195 
>>> timeit.timeit(setup=setup, stmt='[re.sub("[^ACGT]", "", s) for s in randomList]') 
9.395620040786165 

没有re,它的速度更快(见@ CMD的答案):

>>> timeit.timeit(setup=setup, stmt="[''.join(c for c in s if c in 'ACGT') for s in randomList]") 
6.874829817476666 

甚至更​​快(见@ JonClement的评论):

>>> setup='''randomList = ["ACGT","A#$..G","..,/\]AGC]]]T"]\nascii_exclude = ''.join(set('ACGT').symmetric_difference(map(chr, range(256))))''' 
>>> timeit.timeit(setup=setup, stmt="""[item.translate(None, ascii_exclude) for item in randomList]""") 
2.814761871275735 

而且可能的:

>>> setup='randomList = ["ACGT","A#$..G","..,/\]AGC]]]T"]' 
>>> timeit.timeit(setup=setup, stmt="[filter(set('ACGT').__contains__, item) for item in randomList]") 
4.341086316883207 
+0

不要认为'+'需要在那里...... –

+0

@JonClements:它加快了比赛速度,因为角色不必逐一替换。会添加一些时间。 –

+0

虽然它确实有道理,但对于简单的字符替换,我不会认为会有这样的区别。感谢您花时间发布'timeit'。 –

4

re是矫枉过正这

randomList2 = [''.join(c for c in s if c in 'ACGT') for s in randomList] 

,如果你不想说最初没有的那些垃圾

valid = set("ACGT") 
randomList2 = [''.join(c for c in s if c in valid) for s in randomList if any(c2 not in valid for c2 in s)] 
+1

好点,非常优雅。也更快(见我编辑的答案)。 –

相关问题