2013-07-13 40 views
0

因此,新的任务是从网站下载文件(http://ceres.primus-fatum.de/~fate/scriptsprachen/uniprotDB_part.txt),然后我必须得到一个子程序逐行保存,然后搜索ID和Sq ..和所有那应该保存在新的Txt文件中:1. Id线应该是最初的,2. SQ最后3.其他所有内容都应该在ID和SQ之间,并且在End应该是Salsh ....这里是一个例子..但是文件有1000例输出Perl蛋白质Seq。 ID和SQ和“//”

例预期:

ID 001R_FRG3G    Reviewed;   256 AA. -> ID First place ***** 

AC Q6GZX4; 

DT 28-JUN-2011, integrated into UniProtKB/Swiss-Prot. 

DT 19-JUL-2004, sequence version 1. 

DT 18-APR-2012, entry version 24. 

DE RecName: Full=Putative transcription factor 001R; 

GN ORFNames=FV3-001R; 

OS Frog virus 3 (isolate Goorha) (FV-3). 

OC Viruses; dsDNA viruses, no RNA stage; Iridoviridae; Ranavirus. 

OX NCBI_TaxID=654924; 

OH NCBI_TaxID=8295; Ambystoma (mole salamanders). 

OH NCBI_TaxID=30343; Hyla versicolor (chameleon treefrog). 

OH NCBI_TaxID=8316; Notophthalmus viridescens (Eastern newt) (Triturus viridescens). 

OH NCBI_TaxID=8404; Rana pipiens (Northern leopard frog). 

OH NCBI_TaxID=45438; Rana sylvatica (Wood frog). 

RN [1] 

RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA]. 

RX PubMed=15165820; DOI=10.1016/j.virol.2004.02.019; 

RA Tan W.G., Barkman T.J., Gregory Chinchar V., Essani K.; 

RT "Comparative genomic analyses of frog virus 3, type species of the 

RT genus Ranavirus (family Iridoviridae)."; 

RL Virology 323:70-84(2004). 

CC -!- FUNCTION: Transcription activation (Potential). 

CC ----------------------------------------------------------------------- 

CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms 

CC Distributed under the Creative Commons Attribution-NoDerivs License 

CC ----------------------------------------------------------------------- 

DR EMBL; AY548484; AAT09660.1; -; Genomic_DNA. 

DR RefSeq; YP_031579.1; NC_005946.1. 

DR ProteinModelPortal; Q6GZX4; -. 

DR GeneID; 2947773; -. 

DR ProtClustDB; CLSP2511514; -. 

DR GO; GO:0006355; P:regulation of transcription, DNA-dependent; IEA:UniProtKB-KW. 

DR GO; GO:0046782; P:regulation of viral transcription; IEA:InterPro. 

DR GO; GO:0006351; P:transcription, DNA-dependent; IEA:UniProtKB-KW. 

DR InterPro; IPR007031; Poxvirus_VLTF3. 

DR Pfam; PF04947; Pox_VLTF3; 1. 

PE 4: Predicted; 

KW Activator; Complete proteome; Reference proteome; Transcription; 

KW Transcription regulation. 

FT CHAIN   1 256  Putative transcription factor 001R. 

FT        /FTId=PRO_0000410512. 

FT COMPBIAS  14  17  Poly-Arg. 

SQ SEQUENCE 256 AA; 29735 MW; B4840739BF7D4121 CRC64; -> SQ at LAST and then "//" 
    MAFSAEDVLK EYDRRRRMEA LLLSLYYPND RKLLDYKEWS PPRVQVECPK APVEWNNPPS 
    EKGLIVGHFS GIKYKGEKAQ ASEVDVNKMC CWVSKFKDAM RRYQGIQTCK IPGKVLSDLD 
    AKIKAYNLTV EGVEGFVRYS RVTKQHVAAF LKELRHSKQY ENVNLIHYIL TDKRVDIQHL 
    EKDLVKDFKA LVESAHRMRQ GHMINVKYIL YQLLKKHGHG PDGPDILTVK TGSKGVLYDD 
    SFRKIYTDLG WKFTPL 

// 

我已经试过这样:

use strict; 
use warnings; 

sub main { 
    my @file_data=(); 
    my $motif =''; 
    my $protein_seq=''; 
    my $h= '[VLIM]'; 
    my $s= '[AG]'; 
    my $x= '[ARNDCEQGHILKMFPSTWYV]'; 
    my $regexp = "($I){1}D"; ->motif to be searched is ID 
    my $regexp = "($S){1}Q"; ->motif to be searched is SQ 

    my @locations=(); 

    @file_data= get_file_data("seq.txt"); 
    $protein_seq= extract_sequence(@file_data); 

    foreach my $line(@file_data){ 
     if ($motif=~ /$regexp/){ 
     print "found motif \n\n"; 
    } else { 
     print "not found \n\n"; 
    } 
} 

录制主题的定位/位置被outputed ..

@locations= match_position($regexp,$seq); 
    if (@locations){ 
    print "Searching for motifs $regexp \n"; 
    print "Catalytic site is at location:\n"; 
    } 
    else{ 
    print "motif not found \n\n"; 
    } 
    exit; 

    sub get_file_data{ 
     #body... 

    my ($filename)[email protected]_; 
    my $sequence=''; 

    foreach my $line(@file_data){ 

    if ($line=~ /^\s*$/){ 
    next; 
      } 
    elsif ($line=~ /^\s*#/){ 
    next; 
    } 
    elsif ($line=~ /^>/){ 
    next; 
    } 
    else { 
    $sequence.=$line; 
    } 
    } 
    $sequence=~ s/\s//g; 
    return $sequence; 
    } 

    sub(match_positions) { 
    my ($regexp, $sequence)[email protected]_; 
    use strict; 
    my @position=(); 
    while ($sequence=~ /$regexp/ig){ 
    push (@position, $-[0]); 
    } 
    return @position; 
    } 

    } 

    main(); 
+0

Iam初学者,先研究生物信息学Sem。事情是我们不能使用LWP,所以我不知道如何下载网站的内容,anyidea我应该如何开始?那我怎么能救他们呢? –

+0

为什么不能使用lwp,它是在perl下载文件的最佳方式? – user1937198

+0

所以我问我的助理,他的意思是我不需要那样做......他说你只需要下载文件,然后打开它并contuniue工作... –

回答

0

尝试

使工作区

mkdir protein 
cd protein 

下载

curl http://ceres.primus-fatum.de/~fate/scriptsprachen/uniprotDB_part.txt > uniprotDB_part.txt 

wget http://ceres.primus-fatum.de/~fate/scriptsprachen/uniprotDB_part.txt 

分裂

csplit -n 6 -k uniprotDB_part.txt '/^ID /' '{100000}' 

你会得到zilion文件xx000000xx00nnnnn而且每次都会包含一个部分。

所有使用简单的unix工具。如果你从perl什么都不知道 - 这是没有意义的问。

+0

一些帮助家伙iam卡住,它不是我不知道任何关于Perl,但IAM初学者和idont知道如何完成这项工作..所以somehelp!我试图从moprnings和我无法解决它!无论如何 –

相关问题