0
基本上,问题是要求找出DNA字符串集合中不超过d个不匹配的所有可能的基序(k-mers long)。我可以编写下面的代码来查找一个字符串DNA的所有基序(k,d)。当它出现多行字符串DNA时,我不知道如何修改我的代码。查找DNA字符串集合中的所有(k,d) - 基元
样品输入:
K = 3,d = 1
ATTTGGC
TGCCTTA
CGGTATC
GAAAATT
样本输出:
ATA
ATT
GTT
TTT
import collections
kmer = 5;
in_genome = "GGGGCTTCACAGCGCCCCTACAATACAATAGCCCTCGAATACCTACTTGCCACTATGTTCGGCGTCATTACATACGACCCGCATGCTCGGCAGTATGTCTCTACTCAGGATCCCTCAATATTACTTACGCCAATATGTCTAAGGTTTAGA";
in_mistake = 1;
out_result = [];
mismatch_list = []
def hamming_distance(s1, s2):
# Return the Hamming distance between equal-length sequences
if len(s1) != len(s2):
raise ValueError("Undefined for sequences of unequal length")
else:
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
for i in xrange(len(in_genome)-kmer + 1):
v = in_genome[i:i + kmer]
out_result.append(v)
for t_kmer in set(out_result):
for s_kmer in out_result:
if hamming_distance(t_kmer, s_kmer) <= in_mistake:
mismatch_list.append(t_kmer)
mismatch_count = collections.Counter(mismatch_list)
print mismatch_count
什么问题PLZ? – Aprillion
能否详细说明'd'的含义?定义一个不匹配 – Pynchia
你可以将所有这些行连接到字符串in_genome –