2015-10-18 22 views
2

我有一段数据,为了分析它,我必须先组织这些数据。如何防止通过for循环覆盖字典中的数据

数据(样本):

3.30.67.10 [ '2i69A', '1sfkA', '1sfkB', '1sfkH', '2hcnA', '2hcsA', '2hfzA', '2of6A' ,2qeqA','2qeqB','2wa1A','2wa1B','2wa2A','2wa2B','4r05A','4r8rA','4r8sA','1pjwA','2m0sA','4uifA' '3c1A','3j4A','3u1jB','3u1jA','3u1jB','3u1jA','3u1jB','1ztxE','3c5xC','4ctjA','4ctkA',' ,'3l6pA','3lkwA','4al8C','4gsxA','4gt0A','4oigA','3j8dB','1p58A','2m9pA','2m9qA','3uzvA','1uzgA',' 3p8A',3p97A,3uzeC,3uzeD,3vttA,2bhrA,2bmfA,2fomA,2fomB,4m9fA,4m9iA,4m9kA,4m9mA, ,'2jlqA','2jlrA','2jlsA','2jluA','2jlvA','2jlwA','2jlxA','2jlyA','2jlzA','2whxA','2whxC' 2wzqA','2wzqC','3uypB','1yzoA','1k4rA ','4alaC','1z66A','2gg1A','3ixyA','2hg0A','2v6iA','2v6jA','4r8tA','1yksA','1ymfA','3evaA','3evbA', '3evcA','3evdA','3eveA','3evfA','1p58D','2b6bA','1s6nA','119kA','1oamA','1anA','1ok8A','1okeA','1r6aA '','1r6rA','1thdA','2p1dA','3j8dG','3j8dH','3zkoA','4tplA','4uifB','4ut6A','4ut9A','4utaA','4utbA', '4utcA','3g7tA','3j05A','4cauA','4cbfA','4cbfB','3j35A','1tc7C','2fp7A','2fp7B','2g05D','2ggvA','2ggvB '','2joA','2ijoB','2p5pA','2yolA','3e90A','3e90B','3e90C','4r8tB','4c2iA','2oxtA','3egpA','3ircA', '4ffyA','4ffzA','4l5fE','2jqmA','2jv6A','4am0A','4am0Q','4am0R','4fg0A','4bz1A','4bz2A','2p3qA','2xbmA '','3c5xA','2oy0A','3i50E','3iywA','3j0bA','3lkzA','4o6cA','4o6dA','4oieA','4oiiA','3c6dA','2r6pA', '2p3A','2p40A','2p41A','1urzA','2px2A','2px4A','2px5A','2px8A','2pxaA','2pxcA','2v8oA','2wv9A '','3c6dD','3c6eC','3j42D','2z83A','3p54A','4hdgA','4hdhA','4k6mA','4mtpA','4mtpD','2jsfA' ,'2r69A','3c6rA','3c6rD','3ixyD','3iyaA','3iyaD','3ixxA','2r29A','3evgA','1df9A','2qidA','3j27A',' 3j2bB','4uihA','3uzqB','1befA','1tg8A','tgeA','3ixxD','5a1zA','1n6gA','1na4A','1svbA' '4azxA','4azxD','4b03A','4b03D','4c2iB','4cctA','4cctD','2h0pA','3uajB','3uc0A','3uc0B','3we1A',' 2j7uA,2j7wA,3j6sA,3j6sB,3j6tA,3j6tB,3j6uA,3j6uB,3vwsA,4c11A,4hhjA,4v0qA,4v0rA, ]

正如你可以看到一些数据的前4位数字相似,如“1sfk”。如果他们共享前4位,这意味着它们属于相同的结构,并且我需要为每个完整蛋白质代码(5位,如1sfkA或1sfkB)(在PDBSum数据库中找到)存储唯一的UniProt代码数字代码。

对于我创造了这个和平的代码:

for domain in dDomainSeqSum.keys():# CHANGE TO COMPRESS FILE 
     dDomainSeqSumSWS[domain]={} 
     for pdb in dDomainSeqSum[domain]:#add sws of a pdb in a variable and later add that variable to the domain thing 
      pdb1 = list(pdb)#split is not working 
      pdb2 = pdb1[0]+pdb1[1]+pdb1[2]+pdb1[3] 
      dDomainSeqSumSWS[domain][pdb2]=[] 
      for i in range(len(PDBSum)): #make pdb3 search and then compare to the pdb stored 
       if pdb in PDBSum[i]: 
        if "SWS_ID" in PDBSum[i]: 
         line = PDBSum[i].split() 
         if pdb2 not in dDomainSeqSumSWS: 
          dDomainSeqSumSWS[domain][pdb2]=[line[2]] 
         else: 
          dDomainSeqSumSWS[domain][pdb2].append(line[2]) 

同时运行的代码之后,这是我得到的结果:

{“67年3月30日。10 ':{' 4c2i ':[' G3F5K5 '],' 2p3l ':[' Q9WLZ5 '],' 4uta ':[' Q68Y26 '],' 4utc ':[' Q68Y26 '],' 4utb ':[' Q68Y26 '],' 1urz ':[' Q80E47 '],' 3l6p ':[' P17763 '],' 1tge ':[' P27914 '],' 3evb ':[' P03314 '],' 2vbc ':[' Q2TN89 '],' 3eva ':[' P03314 '],' 3evf ':[' P03314 '],' 3evg ':[' P29991 '],' 3evd ':[' P03314 '],' 3eve ':[' P03314 '],' 2p1d ':[' P12823 '],' 3j42 ':[' Q3BCY5 '],' 2jlx ':[' Q2YHF0 '],' 2jly ':[' Q2YHF0 '],' 2jlz ':[' Q2YHF0 '],' 2jlu ':[' Q2YHF0 '],' 2jlv ':[' Q2YHF0 '],' 2jlw ':[' Q2YHF0 '],' 1oke ':[' P12823 '],' 2jlq ':[' Q2YHF0 '],' 2jlr ':[' Q2YHF0 '],' 2jls ':[' Q2YHF0 '],' 2wv9 ':[' P05769 '],' 2z83 ':[' P27395 '],' 4hdh ':[' P27395 '],' 2hcn ':[' P14335 '],' 2oxt ':[' A0EKU1 '],' 1tg8 ':[' P27914 '],' 4hdg ':[' P27395 '],' 4ut9 ':[' Q68Y26 '],' 3e90 ':[' P06935 '],' 4am0 ':[' Q58HT7 '],' 4ut6 ':[' Q68Y26 '],' 1ok8 ':[' P12823 '],' 4ffy ':[' Q9J7C6 '],' 4ffz ':[' Q88640 '],' 4b03 ':[' G3F5K5 '],' 2m9p ':[' P14337 '],' 2m9q ':[' P14337 '],' 4fg0 ':[' P09732 '],' 4azx ':[' G3F5K5 '],' 2hcs ':[' P14335 '],' 4hhj ':[' Q6DLV0 '],' 4mtp ':[' P27395 '],' 3j8d ':[' P128 23 '],' 3uc0 ':[' P09866 '],' 4l5f ':[' Q8BE40 '],' 4m9t ':[' Q91H74 '],' 4m9k ':[' Q91H74 '],' 4m9i ':[' Q91H74 '],' 2of6 ':[' P14335 '],' 2px5 ':[' P05769 '],' 4m9m ':[' Q91H74 '],' 4m9f ':[' Q91H74 '],' 3j0b ':[' Q9Q6P4 '],' 5a1z ':[' G9FRP5 '],' 4r8r ':[' C1KBQ3 '],' 4r8s ':[' C1KBQ3 '],' 1l9k ':[' P12823 '],' 1svb ':[' P14336 '],' 4r8t ':[' O90417 '],' 2hfz ':[' P14335 '],' 2v6j ':[' Q32ZD5 '],' 3zko ':[' P12823 '],' 2ggv ':[' P06935 '],' 2v6i ':[' Q32ZD5 '],' 3u1j ':[' Q5UB51 '],' 3u1i ':[' Q5UB51 '],' 4oig ':[' P17763 '],' 4ala ':[' Q7TGD1 '],' 3P97 ':[' P27915 '],' 3p8z ':[' P27915 '],' 2pxc ':[' P05769 '],' 4gsx ':[' P17763 '],' 2pxa ':[' P05769 '],' 4oii ':[' Q9Q6P4 '],' 1bef ':[' Q9Q4T1 '],' 3evc ':[' P03314 '],' 3j05 ':[' Q689G3 '],' 3egp ':[' Q9J7C6 '],' 2yol ':[' P06935 '],' 2v8o ':[' P05769 '],' 4r05 ':[' C1KBQ3 '],' 1n6g ':[' P14336 '],' 3lkz ':[' Q9Q6P4 '],' 4cau ':[' Q689G3 '],' 2px2 ':[' P05769 '],' 2gg1 ':[' P29837 '],' 4al8 ':[' P17763 '],' 2px4 ':[' P05769 '],' 3lkw ':[' P17763 '],' 2r69 ':[' P18356 '],' 2r6p ':[' Q66394 '],' 3j6s ':[' Q6DLV0 '],' 3j6u ':[' Q6DL V0 '],' 1sfk ':[' P14335 '],' 1z66 ':[' P29837 '],' 3uaj ':[' P09866 '],' 3iyw ':[' Q9Q6P4 '],' 3j35 ':[' E7FLK7 '],' 4k6m ':[' P27395 '],' 2fom ':[' Q91H74 '],' 3vws ':[' Q6DLV0 '],' 3vtt ':[' P27915 '],' 3iya ':[' P18356 '],' 2p5p ':[' P06935 '],' 2hg0 ':[' Q91R00 '],' 2jqm ':[' Q6DV88 '],' 2p41 ':[' Q9WLZ5 '],' 4v0r ':[' Q6DLV0 '],' 4tpl ':[' Q5SBG8 '],' 1yks ':[' P03314 '],' 4bz1 ':[' Q7TGC7 '],' 4bz2 ':[' Q7TGC7 '],' 1thd ':[' P12823 '],' 2m0s ':[' Q9YKL3 '],' 4cbf ':[' E0WXI2 '],' 3ixx ':[' Q3I100 '],' 3ixy ':[' P18356 '],' 2px8 ':[' P05769 '],' 1ztx ':[' Q91KZ4 '],' 2fp7 ':[' P06935 '],' 4uif ':[' E0WXJ3 '],' 4uih ':[' P14340 '],' 3uzq ':[' P27909 '],' 4c11 ':[' Q6DLV0 '],' 1p58 ':[' Q9WDA7 '],' 4cct ':[' G3F5K5 '],' 2r29 ':[' P29991 '],' 2p40 ':[' Q9WLZ5 '],' 1na4 ':[' P14336 '],' 1ymf ':[' P03314 '],' 3uzv ':[' P07564 '],' 1r6r ':[' P12823 '],' 3c5x ':[' Q6H1E5 '],' 2xbm ':[' C0LMU5 '],' 3g7t ':[' Q689G3 '],' 2g05 ':[' P06935 '],' 1r6a ':[' P12823 '],' 3uze ':[' P27915 '],' 2whx ':[' Q2YHF0 '],' 3p54 ':[' P27395 '],' 1k4r ':[' C3V005 '],' 3i50 ':[' Q9Q6P4 '],' 3c6d ':[' Q3BC Y5 '],' 3c6e ':[' Q3BCY5 '],' 4o6c ':[' Q9Q6P4 '],' 4o6b ':[' P29990 '],' 4o6d ':[' Q9Q6P4 '],' 2ijo ':[' P06935 '],' 2wa2 ':[' Q8QL64 '],' 1tc7 ':[' P06935 '],' 3j27 ':[' P14340 '],' 2wa1 ':[' Q8QL64 '],' 3gcz ':[' Q7T918 '],' 2p3q ':[' Q20IJ2 '],' 2jsf ':[' P18356 '],' 3we1 ':[' P09866 '],' 1df9 ':[' P14340 '],' 4gt0 ':[' P17763 '],' 3c6r ':[' P18356 '],' 3j2p ':[' P14340 '],' 3irc ':[' Q9J7C6 '],' 2oy0 ':[' Q9Q6P4 '],' 3uyp ':[' Q2YHF0 '],' 2qeq ':[' P14335 '],' 2jv6 ':[' Q6DV88 '],' 2qid ':[' P14340 '],' 1oan ':[' P12823 '],' 1oam ':[' P12823 '],' 2b6b ':[' Q9WDA7 '],' 2bmf ':[' Q91H74 '],' 2i69 ':[' Q80QJ9 '],' 2j7w ':[' Q6DLV0 '],' 4v0q ':[' Q6DLV0 '],' 1yzo ':[' P29838 '],' 1s6n ':[' Q913C7 '],' 4oie ':[' Q9Q6P4 '],' 2bhr ':[' Q91H74 '],' 3j6t ':[' Q6DLV0 '],' 2p3o ':[' Q9WLZ5 '],' 4ctk ':[' A9LIE0 '],' 4ctj ':[' A9LIE0 '],' 2j7u ':[' Q6DLV0 '],' 1pjw ':[' Q9J0X3 '],' 1uzg ':[' P27915 '],' 2h0p ':[' P09866 '],' 2wzq ':[' Q2YHF0“]}}

正如你所看到的,1sfk被覆盖,它应该有3个个人的UniProt代码

+0

请创建一个**最小**工作示例,显示您的问题。 – Jasper

回答

3

有你有问题(如对方的回答也预示)两个地方 -

  1. 首先是你写dDomainSeqSumSWS[domain][pdb2]为空列表 - dDomainSeqSumSWS[domain][pdb2]=[]

  2. 二是在条件 - if pdb2 not in dDomainSeqSumSWS: - 这将永远是False,因为pdb2dDomainSeqSumSWS[domain]字典不dDomainSeqSumSWS字典的关键。

你其实并不需要或者上面的东西,而不是你应该看看dict.setdefault,这是本作。实施例 -

for domain in dDomainSeqSum.keys():# CHANGE TO COMPRESS FILE 
    dDomainSeqSumSWS[domain]={} 
    for pdb in dDomainSeqSum[domain]:#add sws of a pdb in a variable and later add that variable to the domain thing 
     pdb2 = pdb[:4] #you do not need to convert to list for indexing and you can slice the first four characters off. 
     dDomainSeqSumSWS[domain][pdb2]=[] 
     for i in range(len(PDBSum)): #make pdb3 search and then compare to the pdb stored 
      if pdb in PDBSum[i]: 
       if "SWS_ID" in PDBSum[i]: 
        line = PDBSum[i].split() 
        dDomainSeqSumSWS[domain].setdefault(pdb2,[]).append(line[2]) 

dict.setdefault需要key作为第一个参数和默认值作为第二个参数,并将该值,如果密钥不存在在字典中,并返回该值。否则,如果密钥存在于字典中,它只会返回该值的值。

另外,我改变了,你转换pdb到不需要索引编list()(你可以索引串)的线,你可以使用切片从字符串取前四个字符。

+0

谢谢,我曾尝试过使用dic.setdefault,但我无法正确使用它。您的洞察力帮助了我很多,并帮助我使代码更简单。我非常感谢您在纠正和提高我的计划方面的帮助。谢谢 –

3
dDomainSeqSumSWS[domain][pdb2]=[] 

有你覆盖以前的列表。你应该检查PDB2关键在dDomainSeqSumSWS [域]字典已经存在。

+0

谢谢,我完全忘了它。你的建议可以帮助我很多! –