2013-05-14 45 views
1

我是perl的新手。仍在学习。根据位置提取fasta序列

我有一个fasta格式的文件。我想提取跨越特定位置的序列。例如,从位置200至300

>Contig[0001] 
TGCATCAAAAGCTGAAAATATGTAGTCGAGAAGTCATTTCGAGAAATTGACGTTTTAAGT 
TTCGGTTTCCAAATTCAACCGGATGTATCTTCGCCAATAATTGTCAGCAGTTAGAATTTC 
TTTCAACATTATGAAGCCCTTTTTATATATTTTGATTCTGCATCAAAAGCTGAAAATATG 
TAGTCTTGAAGTCATTTCGAGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGG 
ATGTATCTTCGCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTT 
TTATATATTTTGATTCTGCATCAAAAGCTGAAAATGTGTAGTCTCGAAGTCATTTCGAGA 
AATTGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAATTGT 
CAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTTTTACATATTTTGACCCTGCATC 
AAAAGCTGAAAATATGTAGTCTCGAAGTCATTTTGAGAAGTTAGAATTTCTTTCAACATT 
ATGAAGCCCTTTTTATATATTTTGATTCTGCATCAAAAGCTGAAAATATGTAGTCTCGAA 
GTCWTTTCRAGAAATTGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTC 
GCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTTTTATATATTT 
TGACTCTGCATCAAAAGCTGAAAATATGTAGTCTCGAAGTCATTTCGAGAAATTGACGTT 

我想从序列Contig[0001]提取200-300位置的序列。输出将是:

>Contig[0001]_200-300 
AGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAATT 
GTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTT 

我在FASTA文件近500序列和我有包含的ID开始结束制表符分隔的文件所需要的志愿服务岗位。

如果有人能帮助我,这将是一件好事。

非常感谢您的帮助。我不确定我可以提供包含职位信息的文件。

新手

+0

欢迎SO。这是一个关于编程的问答网站。请看看[faq#howtoask]。你应该总是提供一些代码给你的问题,并告诉我们你已经尝试了什么,或者你做了什么努力。我已经回答了这个问题,因为我觉得它很有趣。 – simbabque

+0

另请参阅类似问题** [这里](http://stackoverflow.com/questions/16520781/select-bases-between-100-200-and-printing-them-along-with-header)** –

回答

2

我假设你知道如何将它加入到你的程序中。看看substr函数。它做你想做的事。

substr EXPR,OFFSET,LENGTH 

所以基本上你需要做的是:

my $snippet = substr($sequence, 200, 100); 

退一步,看看CPAN:有一个名为Bio::SeqReader::Fasta,您可以使用模块读取文件并获取序列。这在我看来是很好的记录,我对此很感兴趣。所以这里有一个解决方案。

use strict; 
use warnings; 
use feature qw(say); 
use Bio::SeqReader::Fasta; 

# Put your positions here (start, end) 
my @positions = (
    [ 200, 300 ], 
    [ 50, 180 ], 
); 

open my $fh, '<', '/path/to/file.fasta' or die $!; 
my $in = new Bio::SeqReader::Fasta(fh => $fh); # create the B::SR::F object 

my $i = 0; 
while (my $so = $in->next()) { 
    # $so represents one dataset and is a Bio::SeqReader::FastaRecord 

    say substr($so->seq(),    # we want a part of the sequence string 
    $positions[$i]->[0],    # starting position 
    $positions[$i]->[1] - $positions[$i]->[0]); # end - start --> length 
} 

close $fh; 
3

单程。的script.pl内容:

#!/usr/bin/env perl 

use warnings; 
use strict; 

my ($adn, $l, $header); 

while (<>) { 
    chomp; 

    ## First line is known, a header, so print it and process next one. 
    if ($. == 1) { 
     printf qq|%s_%s\n|, $_, q|200-300|; 
     next; 
    } 

    ## Concat adn while not found a header. 
    if ('>' ne substr $_, 0, 1) { 
     if (! $l) { $l = length } 
     $adn .= $_; 
     if (! eof) { next } 
    } 
    else { 
     $header = sprintf qq|%s_%s\n|, $_, q|200-300|; 
    } 

    ## Extract range 200-300 and insert newlines to set same length of 
    ## line than before. 
    my $s = substr $adn, 199, 100; 
    $s =~ s/(.{$l})/$1\n/g; 
    printf qq|%s\n|, $s; 
    undef $adn; 

    ## If not end of file, print the header of the following adn. 
    if (! eof) { print $header } 
} 

运行它想:

perl script.pl infile 

能产生:使用Perl和生物::用法类似于SeqIO模块

>Contig[0001]_200-300 
AGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAAT 
TGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTT 
2

的一种方式。运行像:的./process_fasta.pl

./process_fasta.pl file.fa 200 300 

内容:

#!/usr/bin/perl 

use strict; 
use warnings; 
use Bio::SeqIO; 

my $in_file = $ARGV[0]; 
my $start_pos = $ARGV[1]; 
my $end_pos = $ARGV[2]; 

my $in = Bio::SeqIO->new (-file => $in_file, -format => 'fasta'); 
my $out = Bio::SeqIO->new(-file => ">$in_file.out", -format => 'fasta'); 


while (my $seq = $in->next_seq()) { 

    $seq->display_id($seq->display_id() . "_$start_pos-$end_pos"); 
    $out->write_seq($seq->trunc($start_pos, $end_pos)); 
} 

结果:

>Contig[0001]_200-300 
AGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAAT 
TGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTTT 
1

我赞赏那些谁使用Bio::模块,我喜欢他们在写新的东西。不过,也许下面会有所帮助:

use strict; 
use warnings; 

my $end = pop; 
my $start = pop; 
local $/ = '>'; 

while (<>) { 
    chomp; 
    next unless /(Contig.+)/; 
    my ($header) = "$/$1_$start-$end\n"; 
    my $seq = ${^POSTMATCH}; 
    $seq =~ s/\s//g; 
    print $header; 
    print +(substr $seq, $start - 1, $end - $start) . "\n"; 
} 

用法:perl script.pl inFile start end [>outFile]

最后,可选参数输出定向到一个文件中。

实施例:perl script.pl data.fasta 200 300

输出上的数据集:

>Contig[0001]_200-300 
AGAAATCGACGTTTTAAGTTTCTGTTTCCAAATTCAAACGGATGTATCTTCGCCAATAATTGTCAGAAGTTAGAATTTCTTTCAACATTATGAAGCCCTT 

的开始和结束参数pop PED关闭@ARGV,然后将记录分隔符被设置为 “>”。由于文件的读取 - 一次fasta记录 - 头部信息被捕获,将序列保留在${^POSTMATCH}。从序列中删除所有空白区域。最后,重新格式化的标题为printed,正如序列中的字符范围一样。

希望这会有所帮助!

1

完整的工作轻量级脚本:

#!/usr/bin/env perl 

use strict; 
use warnings; 

my $first=1; 
if ($ARGV[0] eq '-0') 
{ 
    shift @ARGV; 
    $first=0; 
} 

die("Usage: cat <coords> | get_subseqs.pl (-0) <fasta files> > out; where coords is id, start, end. Use -0 if coords start with 0 instead of 1.\n") unless @ARGV; 

# read coords 
my %contigs =(); # id => [ start, end ] 
while (<STDIN>) { 
    chomp; 
    my @row = split(/\t/); 
    die("Require tab-delim: id, start, end\n") unless @row == 3; 
    $contigs{$row[0]} = [ $row[1], $row[2] ]; 
} 

# get subseqs 
my ($id,$seq,$start,$end); 
foreach my $infile (@ARGV) { 
    open(IN, '<', $infile) or die($!); 
    while (<IN>) { 
     if (/>(\S+)/) { 
      my $id2 = $1; 
      print ">${id}_$start-$end\n", reformat(substr($seq, $start-$first, $end-$start+1)) if $id; 
      if (exists($contigs{$id2})) { 
       ($id,$seq,$start,$end) = ($id2,'',@{$contigs{$id2}}); 
       delete($contigs{$id2}); 
      } else { $id=undef } 
     } elsif($id) { $seq .= $_ } 
    } 
    close(IN); 
    print ">${id}_$start-$end\n", reformat(substr($seq, $start-$first, $end-$start+1)) if $id; 
    $id = undef; 
} 

sub reformat { # add newline every 60 bases 
    my $seq = shift; 
    my $seq2 = ''; 
    while (length($seq) > 60) { 
     $seq2 .= substr($seq,0,60)."\n"; 
     $seq = substr($seq,60); 
    } 
    $seq2 .= $seq."\n"; 
    return $seq2; 
}