2011-11-23 29 views
1

我试图使用Python以操纵从格式一个文本文件:使用Python来操纵的键值分组txt文件演示

Key1 
Key1value1 
Key1value2 
Key1value3 
Key2 
Key2value1 
Key2value2 
Key2value3 
Key3... 

给format B:

Key1 Key1value1 
Key1 Key1value2 
Key1 Key1value3 
Key2 Key2value1 
Key2 Key2value2 
Key2 Key2value3 
Key3 Key3value1... 

具体而言,这里是一个简单介绍一下文件本身(只显示了一个键,数千人在完整的文件):

chr22:16287243: PASS 
patientID1 G/G 
patientID2 G/G 
patient ID3 G/G 

而且所需要的输出的位置:

chr22:16287243: PASS patientID1 G/G 
chr22:16287243: PASS patientID2 G/G 
chr22:16287243: PASS patientID3 G/G 

我写以下代码可检测/显示键,但我无法编写代码来存储与每个键相关联的值,并且随后印刷这些键 - 值对。任何人都可以请这个任务协助我吗?

import sys 
import re 

records=[] 

with open('filepath', 'r') as infile: 
    for line in infile: 
     variant = re.search("\Achr\d",line, re.I) # all variants start with "chr" 
     if variant: 
      records.append(line.replace("\n","")) 
      #parse lines until a new variant is encountered 

for r in records: 
    print (r) 

回答

5

做在一个通,不存储线:

with open("input") as infile, open("ouptut", "w") as outfile: 
    for line in infile: 
     if line.startswith("chr"): 
      key = line.strip() 
     else: 
      print >> outfile, key, line.rstrip("\n") 

此代码假定第一行包含一个键,否则会失败。

+0

我不得不改变打印语句是如何被格式化了一点,但现在它的伟大工程!我也不知道“startswith”,所以也谢谢你:) – alexhli

0

首先,如果字符串以字符序列开始,请不要使用正则表达式。更简单,更易于阅读:

if line.startswith("chr") 

下一步是使用一个非常简单的状态机。像这样:

current_key = "" 

for line in file: 
    if line.startswith("chr"): 
     current_key = line.strip() 

    else: 
     print " ".join([current_key, line.strip()]) 
+0

这个角色提示很有用,谢谢 – alexhli

0

如果总是有相同数量的每个键值,islice是有用的:

from itertools import islice 

with open('input.txt') as fin, open('output.txt','w') as fout: 
    for k in fin: 
     for v in islice(fin,3): 
      fout.write(' '.join((k.strip(),v)))