2017-02-15 207 views
0

你可以看到文件中的那样:嵌套的字典和值

LOC_Os06g07630.1 cyto 8, chlo 2, extr 2, nucl 1, cysk 1, chlo_mito 1, cysk_nucl 1 
LOC_Os06g12160.1 chlo 7, nucl 3, mito 2.5, cyto_mito 2 
LOC_Os06g39870.1 chlo 7, cyto 4, nucl 1, E.R. 1, pero 1 
LOC_Os06g48240.1 chlo 9, mito 4 
LOC_Os06g48250.1 cyto 5, chlo 4, mito 2, pero 2 

我在乎“氯仿溶剂”和“chlo_mito”和“美图”,而和值每一行中

像行LOC_Os06g07630.1,我将使用氯仿溶剂2和chlo_mito 1, 总和值是3 =(氯仿溶剂)2+(chlo_mito)1个

所述行总和值是

(细胞学)8+(氯仿溶剂)2+(抽)2+(NUCL)1+(CY SK)1+(chlo_mito)1+(cysk_nucl)1 = 16,然后打印3/16

我想下一个内容:

LOC_Os06g07630.1 chlo 2 chlo_mito 1 3/16 
LOC_Os06g12160.1 chlo 7 mito 2.5 9.5/14.5 
LOC_Os06g39870.1 chlo 7 7/15 
LOC_Os06g48240.1 chlo 9 mito 4 13/13 
LOC_Os06g48250.1 chlo 4 mito 2 6/13 

我的代码是:

import re 
dic={} 
b=re.compile("chlo|mito|chlo_mito") 
with open("~/A","r") as f1: 
    for i in f1: 
     if i.startswith("#"):continue 
     a=i.replace(',',"").replace(" ","/") 
     m=b.search(a) 
     if m is not None: 
      dic[a.strip().split("/")[0]]={} 
      temp=a.strip().split("/")[1:] 
      c=range(1,len(temp),2) 
      for x in c: 
       dic[a.strip().split("/")[0]][temp[x-1]]=temp[x] 
       #print dic 
lis=["chlo","mito","chlo_mito"] 
for k in dic: 
    sum_value=0 
    sum_values=0  
    for x in dic[k]:       
     sum_value=sum_value+float(dic[k][x]) 
     for i in lis: 
     #sum_values=0 
     if i in dic[k]: 
      #print i,dic[k][i] 
      sum_values=sum_value+float(dic[k][i]) 
      print k,dic[k],i,sum_values 
     #print k,dic[k] 

回答

0

你在描述你有什么问题时不太清楚。但是我会做什么:编写一个函数,它将文件中的一行作为输入,并返回带有“chlo”,“chlo_mito”,“mito”和“total sum”键的字典。这应该让你的生活更轻松。

+0

但是每一行都有其他像“nucl”等等,它们的数目是不同的 – zychen

0

这样的代码的东西可以帮助你:

我假设你的输入文件被称为f_input.txt

from ast import literal_eval as eval 

data = (k.rstrip().replace(',', '').split() for k in open("f_input.txt", 'r')) 

for k in data: 
    chlo = sum(eval(k[j+1]) for j in range(len(k)-1) if k[j] == 'chlo') 
    mito = sum(eval(k[j+1]) for j in range(len(k)-1) if k[j] == 'mito') 
    chlo_mito = sum(eval(k[j+1]) for j in range(len(k)-1) if k[j] == 'chlo_mito') 
    total = sum(eval(k[j]) for j in range(2, len(k), 2)) 
    if mito == 0 and chlo_mito != 0: 
     print("{0} chlo {1} chlo_mito {2} {3}/{4}".format(k[0], chlo, chlo_mito, chlo + chlo_mito, total)) 
    elif mito != 0 and chlo_mito == 0: 
     print("{0} chlo {1} mito {2} {3}/{4}".format(k[0], chlo, mito, chlo + mito, total)) 
    elif mito !=0 and chlo_mito != 0: 
     print("{0} chlo {1} mito {2} chlo_mito {3} {4}/{5}".format(k[0], chlo, mito, chlo_mito, chlo + mito + chlo_mito, total)) 
    elif mito ==0 and chlo_mito == 0: 
     print("{0} chlo {1} {2}/{3}".format(k[0], chlo, chlo , total)) 

输出:

LOC_Os06g07630.1 chlo 2 chlo_mito 1 3/16 
LOC_Os06g12160.1 chlo 7 mito 2.5 9.5/14.5 
LOC_Os06g39870.1 chlo 7 7/14 
LOC_Os06g48240.1 chlo 9 mito 4 13/13 
LOC_Os06g48250.1 chlo 4 mito 2 6/13 
0

我不知道有多少速度对你的关注,但通常是基因组学。如果可以避免的话,你应该不要使用太多的字符串操作,并尽可能少地使用正则表达式。

这是一个不使用regexen的版本,并且尽量不花时间构造临时对象。我选择使用不同于输出格式的输出格式,因为您的输出格式很难再次解析。您可以通过修改.format字符串轻松地将其更改。

Test_data = """ 
LOC_Os06g07630.1 cyto 8, chlo 2, extr 2, nucl 1, cysk 1, chlo_mito 1, cysk_nucl 1 
LOC_Os06g12160.1 chlo 7, nucl 3, mito 2.5, cyto_mito 2 
LOC_Os06g39870.1 chlo 7, cyto 4, nucl 1, E.R. 1, pero 1 
LOC_Os06g48240.1 chlo 9, mito 4 
LOC_Os06g48250.1 cyto 5, chlo 4, mito 2, pero 2 
""" 

def open_input(): 
    """ 
    Return a file-like object as input stream. In this case, 
    it is a StringIO based on your test data. If you have a file 
    name, use that instead. 
    """ 

    if False: 
     return open('inputfile.txt', 'r') 
    else: 
     import io 
     return io.StringIO(Test_data) 

SUM_FIELDS = set("chlo mito chlo_mito".split()) 

with open_input() as infile: 

    for line in infile: 

     line = line.strip() 
     if not line: continue 

     cols = line.split(maxsplit=1) 
     if len(cols) != 2: continue 

     test_id,remainder = cols 
     out_fields = [] 

     fld_sum = tot_sum = 0.0 

     for pair in remainder.split(', '): 
      k,v = pair.rsplit(maxsplit=1) 
      vf = float(v) 
      tot_sum += vf 

      if k in SUM_FIELDS: 
       fld_sum += vf 
       out_fields.append(pair) 

     print("{0} {2}/{3} ({4:.0%}) {1}".format(test_id, ', '.join(out_fields), fld_sum, tot_sum, fld_sum/tot_sum))