2016-02-06 54 views
0

基本上,问题是要求找出DNA字符串集合中不超过d个不匹配的所有可能的基序(k-mers long)。我可以编写下面的代码来查找一个字符串DNA的所有基序(k,d)。当它出现多行字符串DNA时,我不知道如何修改我的代码。查找DNA字符串集合中的所有(k,d) - 基元

样品输入:

K = 3,d = 1

ATTTGGC

TGCCTTA

CGGTATC

GAAAATT

样本输出:

ATA

ATT

GTT

TTT

import collections 

    kmer = 5; 
    in_genome = "GGGGCTTCACAGCGCCCCTACAATACAATAGCCCTCGAATACCTACTTGCCACTATGTTCGGCGTCATTACATACGACCCGCATGCTCGGCAGTATGTCTCTACTCAGGATCCCTCAATATTACTTACGCCAATATGTCTAAGGTTTAGA"; 
    in_mistake = 1; 
    out_result = []; 
    mismatch_list = [] 

    def hamming_distance(s1, s2): 
     # Return the Hamming distance between equal-length sequences 
     if len(s1) != len(s2): 
      raise ValueError("Undefined for sequences of unequal length") 
     else: 
      return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2)) 

    for i in xrange(len(in_genome)-kmer + 1): 
     v = in_genome[i:i + kmer] 
     out_result.append(v) 

    for t_kmer in set(out_result): 
     for s_kmer in out_result: 
      if hamming_distance(t_kmer, s_kmer) <= in_mistake: 
       mismatch_list.append(t_kmer) 

    mismatch_count = collections.Counter(mismatch_list) 

    print mismatch_count 
+0

什么问题PLZ? – Aprillion

+0

能否详细说明'd'的含义?定义一个不匹配 – Pynchia

+0

你可以将所有这些行连接到字符串in_genome –

回答

0

问题似乎是将代码从使用内部变量切换到从文件读取输入。您不能只将文件的DNA链连接在一起并像以前一样运行,因为这会改变链的末端相遇处的结果。你还必须处理输入的第一行不同于其他,因为它包含程序参数,其余的是原始数据:

import re 
import sys 
import collections 

mismatch_list = [] 

def hamming_distance(s1, s2): 
    """ Returns the Hamming distance between equal-length sequences """ 
    if len(s1) != len(s2): 
     raise ValueError("Undefined for sequences of unequal length") 
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2)) 

with open(sys.argv[1]) as file: 

    kmer = None 
    in_mistake = None 

    parameters = file.readline().rstrip() # first line of file has parameters 

    matchobj = re.search(r"k\s*=\s*(\d+)", parameters) 
    if matchobj: 
     kmer = int(matchobj.group(1)) 

    matchobj = re.search(r"d\s*=\s*(\d+)", parameters) 
    if matchobj: 
     in_mistake = int(matchobj.group(1)) 

    assert kmer is not None and in_mistake is not None, "file parameters misread" 

    for sequence in file: # subsequent lines of file are DNA strands 
     sequence = sequence.rstrip() 
     if not sequence: 
      continue # ignore blank lines 

     result = [] 

     for i in range(len(sequence) - kmer + 1): 
      v = sequence[i:i + kmer] 
      result.append(v) 

     for t_kmer in set(result): 
      for s_kmer in result: 
       if hamming_distance(t_kmer, s_kmer) <= in_mistake: 
        mismatch_list.append(t_kmer) 

mismatch_count = collections.Counter(mismatch_list) 

print(mismatch_count) 
相关问题