有没有什么办法根据模式删除字符串中的重复字符串？

我用这个格式文件的工作：有没有什么办法根据模式删除字符串中的重复字符串？

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true 


=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true 

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true

正如你可以看到，每一个SPEC线是不同的，但有两个地方重复串频谱的数量。我想要做的是将模式=Cluster=之间的每一块信息，并检查是否有频谱值重复行。如果有多行重复，则除去一行。

输出文件应该是这样的：

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true 


=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true 

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true

我用groupby从itertools模块里。我假设我的输入文件叫做f_input.txt，输出文件叫做new_file.txt，但是这个脚本也删除了SPEC的单词......而且我不知道我可以改变什么，以便不这样做。编号：新的条件。有时部分行号可能会发生变化，例如：

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true 
SPEC PRD000682;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true

正如您所看到的，最后一行已更改零件PRD号。一种解决方案是检查光谱数字，并根据重复频谱删除线条。

这将是一个解决方案：

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true

来源

2017-02-24 Enrique

你问为什么你的代码是不是会工作的任何代码工作还是？ –

你可以尝试迭代整个文件并逐行检查，i = file.read（）。split（'\ n'），现在当我[1]在其他行像i [2]或i [3]时，然后删除我，然后对整个拆分的字符串逐个执行此操作。但是，它会是很多代码。我敢打赌会有一个很好的解决方案！ –

你的代码工作正常，没有看到任何问题 –

在Python最短溶液：P

import os 
os.system("""awk 'line != $0; { line = $0 }' originalfile.txt > dedup.txt""")

输出：

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true 

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true 

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true

（如果你使用的是Windows，AWK可以很容易地与Gow安装。）

来源

2017-02-24 16:21:31

非常容易的解决方案。谢谢！ – Enrique

请注意，只有重复连续时，此技巧才有效。 –

这将打开包含原始代码的文件，以及一个新的文件，将输出每个组的唯一线路。

seen是set，非常适合查看是否已经存在某些内容。

data是list，并将跟踪"=Cluster="组的迭代。

然后您只需查看每个组的每一行（在data内指定为i）。

如果该行不在seen内，则会添加该行。

with open ("input file", 'r') as in_file, open("output file", 'w') as out_file: 
    data = [k.rstrip().split("=Cluster=") for k in in_file] 
    for i in data: 
     seen = set() 
     for line in i: 
      if line in seen: 
       continue 
      seen.add(line) 
      out_file.write(line)

编辑：感动seen=set()到for i in data内重置设定每次否则"=Cluster="将始终存在并在data不会打印每个组。

来源

2017-02-24 15:13:52 pstatix

是的，看起来很酷，你试过的代码？ –

你必须重置'seen'集合。 –

@ Ev。当你发布这个时，我正在更新Kounis。意识到我错了！ – pstatix

这就是我该怎么做的。

file_in = r'someFile.txt' 
file_out = r'someOtherFile.txt' 
with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out: 
    seen_spectra = set() 
    for line in f_in: 
     if '=Cluster=' in line or line.strip() == '': 
      seen_spectra = set() 
      f_out.write(line) 
     else: 
      new_spectrum = line.rstrip().split('=')[-1].split()[0] 
      if new_spectrum in seen_spectra: 
       continue 
      else: 
       f_out.write(line) 
       seen_spectra.add(new_spectrum)

这不是一个groupby的解决方案，但你可以轻松地跟踪和调试，如果你有一个解决方案。正如你在评论中提到的那样，你的这个文件是16GB大并且将其加载到内存中可能不是最好的主意。

EDIT: "Each cluster has a specific spectrum. It is not possible to have one spec in one cluster and the same in another"

file_in = r'someFile.txt' 
file_out = r'someOtherFile.txt' 
with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out: 
    seen_spectra = set() 
    for line in f_in: 
     if line.startswith('SPEC'): 
      new_spectrum = line.rstrip().split('=')[-1].split()[0] 
      if spectrum in seen_spectra: 
       continue 
      else: 
       seen_spectra.add(new_spectrum)  
       f_out.write(line)   
     else: 
      f_out.write(line)

来源

2017-02-24 15:17:36

是的。你的代码工作完美。谢谢！ – Enrique

嗨Ev。 Kounis。我只是跟我的主管交谈，他说我内部= Cluster =的模式应该是spectrum = number，因为（例如PRD0013和PRD0014）的数字可以改变，但不是谱数，所以脚本不会考虑这个重复。我怎么能改变你的脚本来考虑频谱部分？ – Enrique

@ Enrique恐怕我不明白.. –

使用re.search()功能和定制spectrums组对象中的溶液用于保持仅独特spectrum数字：

with open('f_input.txt') as oldfile, open('new_file.txt', 'w') as newfile: 
    spectrums = set() 
    for line in oldfile: 
     if '=Cluster=' in line or not line.strip(): 
      newfile.write(line) 
     else: 
      m = re.search(r'spectrum=(\d+)', line) 
      spectrum = m.group(1) 
      if spectrum not in spectrums: 
       spectrums.add(spectrum) 
       newfile.write(line)

来源

2017-02-24 15:33:11 RomanPerekhrest

我得到了这个错误：AttributeError：'NoneType'对象没有属性'组' – Enrique

@ Enrique，有什么意义？您已经接受了他的回答 – RomanPerekhrest

我正在比较几种解决方案并查看哪种解决方案效率最高。 – Enrique

有没有什么办法根据模式删除字符串中的重复字符串？

回答

相关问题