2012-06-20 51 views
1

我必须搜索图案wTTTAYRTTTW,其中W = ATY = CTR = AR,在基因组序列的一个FASTA file。应该存在一些不匹配的情况,即完全匹配字符串及其位置。我的做法是:如何获得FASTA精确匹配的正确计数?

#!user/bin/perl -w 
use strict; 
my @name = qw(NC_004314.2_10); 
for (my $i = 0 ; $i< scalar(@name);$i++) 
{ 
my $fname = $name[$i]; 
print "$fname\n"; 
my $read_pat= "WTTTAYRTTTW"; 
#print"\nPlease enter how many mismatch is allowd : "; 
my $m =<STDIN>; 
chomp $m; 
unless(open(fh1, "$fname")){ 
    print "Cannot open file \"$fname\"\n\n"; 
    exit; 
       } 
my @fh=<fh1>; 
close fh1; 
if ($fh[0] !~ /^>/) 
    { 
     print "not fasta file\n"; 
     exit; 
    } 
my $seq=''; 
foreach my $line(@fh) 
    { 
     if($line =~ /^>/) 
     { 
     next; 
     } 
     else 
     { 
     $seq=$seq.$line; 
     } 
    } 
sub trans_pat 
    { 
    my $pat=shift; 
    $pat=~s/R/\[CG\]/g; 
     $pat=~s/W/\[AT\]/g; 
     $pat=~s/Y/\[AG\]/g; 
    return $pat; 
    } 
open(FH1,">$fname.csv"); 
sub find_pat 
{ 
my ($pat,$seq) = (@_); 
#print FH1 "Looking for pattern $pat\n"; 
} 

find_pat (trans_pat($read_pat),$seq); 

# Allowing for a single mismatch 

my $pat=trans_pat($read_pat); 
print FH1 "Looking for pattern $pat\n"; 
while ($seq=~m/(?=$pat)/g) 
{ 
print FH1"match at\t$-[0]\t$&\n" 
} 
foreach my $i (1..(length $read_pat)-($m-1)) 
{ 
my $mis_pat = $read_pat; 
substr($mis_pat,$i-1,$m)=".{$m}"; 
my $pat1=trans_pat($mis_pat); 
print FH1 "Looking for pattern $pat1\n"; 
while ($seq=~m/(=?$pat1)/g) 

{ 
print FH1 "match at\t$-[0]\t$&\n"; 
} 
#print FH1"$& \n"; 
} 
close FH1; 

结果这段代码发现是错误的在完全匹配的FASTA文件中给定的顺序NC_004314.2,总比赛数应该是829任何一个可以更正此代码?

回答