2017-10-20 55 views
1

我试图用Biopython提取所有DNA序列从包含有以下的短DNA序列匹配一个FASTA文件:“GGCTCAACCCTGGA”使用Biopython发现并提取FASTA匹配精确DNA序列

以下是我迄今为止:

from Bio import SeqIO 

source = "rep_set_no_spaces.fasta" 
outfile = "rep_set_PNA_matches.fasta" 
seq1 = "GGCTCAACCCTGGA" 

# basically a function to check whether seq contains sub1 
def seq_check(seq, seq1): 
    return seq.find(seq1) 

seqs = SeqIO.parse(source, 'fasta') 
filtered = (seq for seq in seqs if seq_check(seq.seq, seq1)) 
SeqIO.write(filtered, outfile, 'fasta') 

我想从这个岗位(Filtering a FASTA file based on sequence with BioPython)适应代码,但我感兴趣的序列既不是在一开始也没有序列结束......

例如,这里是我的一些顺序ces ...第一和第四顺序匹配,但第二和第三顺序不匹配。我想拔出序列作出新的fasta文件只有那些包含“GGCTCAACCCTGGA”序列:

>110148arco.1D_184193 
TACGGAGGGGGTTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGTGGATTGGAAAGTATGGGGTGAAATCCCAGGGCTCAACCCTGGAACTGCCTCATAAACTATCAGTCTAGAGTTCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACACTGAGGTGCGAAAGTGTGGGGAGCAAACAGG 
>110475arco.1D_40770 
TACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTGTTAAGTCAGCTGTGAAAGCCCTGGGCTCAACCTGGGAATTGCAGTTGATACTGGCAAGCTGGAGTACGAGAGAGGGAGGTAGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAATACCAGTGGCGAAGGCGGCCTCCTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG 
>110484arco.1D_190999 
TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTGTTAAGTCAGCTGTGAAAGCCCTGGGCTCAACCTGGGAATTGCAGTTGATACTGATCGACTAGAGTACGAGAGAGGGAGGTAGAATTCCACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAATACCGGTGGCGAAGGCGGCCTCCTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG 
>110525amin.3D_40107 
TACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTACGTAGGCGGATTAGTAAGTAAGATGTGAAATCCCAGGGCTCAACCCTGGAACTGCATTTTAAACTGCTAGTCTAGAGTTATGGAGAGGTAAGTGGAATTCCTAGTGTAGAGGTGAAATTCGTAGATATTAGGAGGAACACCAGAGGCGAAGGCGACTTACTGGACATATACTGACGCTGAGGTACGAAAGTGTGGGTAGCAAACAGG 

谢谢!

回答

1

其实,这个问题是不是Biopython而是Python

def seq_check(seq, seq1): 
    if seq1 in seq: 
     return True 
    else: 
     return False 

你也可以把它直接进入你的生成器表达式:

filtered = (seq for seq in seqs if seq1 in seq) 
+1

更succinent:'高清seq_check(SEQ, seq1):返回seq1 seq' – BioGeek

+0

谢谢你的答案!这工作完美:) –

+0

@Brooke_W如果答案解决了你的问题,你应该接受答案 – Markus