因此,新的任务是从网站下载文件(http://ceres.primus-fatum.de/~fate/scriptsprachen/uniprotDB_part.txt),然后我必须得到一个子程序逐行保存,然后搜索ID和Sq ..和所有那应该保存在新的Txt文件中:1. Id线应该是最初的,2. SQ最后3.其他所有内容都应该在ID和SQ之间,并且在End应该是Salsh ....这里是一个例子..但是文件有1000例输出Perl蛋白质Seq。 ID和SQ和“//”
例预期:
ID 001R_FRG3G Reviewed; 256 AA. -> ID First place *****
AC Q6GZX4;
DT 28-JUN-2011, integrated into UniProtKB/Swiss-Prot.
DT 19-JUL-2004, sequence version 1.
DT 18-APR-2012, entry version 24.
DE RecName: Full=Putative transcription factor 001R;
GN ORFNames=FV3-001R;
OS Frog virus 3 (isolate Goorha) (FV-3).
OC Viruses; dsDNA viruses, no RNA stage; Iridoviridae; Ranavirus.
OX NCBI_TaxID=654924;
OH NCBI_TaxID=8295; Ambystoma (mole salamanders).
OH NCBI_TaxID=30343; Hyla versicolor (chameleon treefrog).
OH NCBI_TaxID=8316; Notophthalmus viridescens (Eastern newt) (Triturus viridescens).
OH NCBI_TaxID=8404; Rana pipiens (Northern leopard frog).
OH NCBI_TaxID=45438; Rana sylvatica (Wood frog).
RN [1]
RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RX PubMed=15165820; DOI=10.1016/j.virol.2004.02.019;
RA Tan W.G., Barkman T.J., Gregory Chinchar V., Essani K.;
RT "Comparative genomic analyses of frog virus 3, type species of the
RT genus Ranavirus (family Iridoviridae).";
RL Virology 323:70-84(2004).
CC -!- FUNCTION: Transcription activation (Potential).
CC -----------------------------------------------------------------------
CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms
CC Distributed under the Creative Commons Attribution-NoDerivs License
CC -----------------------------------------------------------------------
DR EMBL; AY548484; AAT09660.1; -; Genomic_DNA.
DR RefSeq; YP_031579.1; NC_005946.1.
DR ProteinModelPortal; Q6GZX4; -.
DR GeneID; 2947773; -.
DR ProtClustDB; CLSP2511514; -.
DR GO; GO:0006355; P:regulation of transcription, DNA-dependent; IEA:UniProtKB-KW.
DR GO; GO:0046782; P:regulation of viral transcription; IEA:InterPro.
DR GO; GO:0006351; P:transcription, DNA-dependent; IEA:UniProtKB-KW.
DR InterPro; IPR007031; Poxvirus_VLTF3.
DR Pfam; PF04947; Pox_VLTF3; 1.
PE 4: Predicted;
KW Activator; Complete proteome; Reference proteome; Transcription;
KW Transcription regulation.
FT CHAIN 1 256 Putative transcription factor 001R.
FT /FTId=PRO_0000410512.
FT COMPBIAS 14 17 Poly-Arg.
SQ SEQUENCE 256 AA; 29735 MW; B4840739BF7D4121 CRC64; -> SQ at LAST and then "//"
MAFSAEDVLK EYDRRRRMEA LLLSLYYPND RKLLDYKEWS PPRVQVECPK APVEWNNPPS
EKGLIVGHFS GIKYKGEKAQ ASEVDVNKMC CWVSKFKDAM RRYQGIQTCK IPGKVLSDLD
AKIKAYNLTV EGVEGFVRYS RVTKQHVAAF LKELRHSKQY ENVNLIHYIL TDKRVDIQHL
EKDLVKDFKA LVESAHRMRQ GHMINVKYIL YQLLKKHGHG PDGPDILTVK TGSKGVLYDD
SFRKIYTDLG WKFTPL
//
我已经试过这样:
use strict;
use warnings;
sub main {
my @file_data=();
my $motif ='';
my $protein_seq='';
my $h= '[VLIM]';
my $s= '[AG]';
my $x= '[ARNDCEQGHILKMFPSTWYV]';
my $regexp = "($I){1}D"; ->motif to be searched is ID
my $regexp = "($S){1}Q"; ->motif to be searched is SQ
my @locations=();
@file_data= get_file_data("seq.txt");
$protein_seq= extract_sequence(@file_data);
foreach my $line(@file_data){
if ($motif=~ /$regexp/){
print "found motif \n\n";
} else {
print "not found \n\n";
}
}
录制主题的定位/位置被outputed ..
@locations= match_position($regexp,$seq);
if (@locations){
print "Searching for motifs $regexp \n";
print "Catalytic site is at location:\n";
}
else{
print "motif not found \n\n";
}
exit;
sub get_file_data{
#body...
my ($filename)[email protected]_;
my $sequence='';
foreach my $line(@file_data){
if ($line=~ /^\s*$/){
next;
}
elsif ($line=~ /^\s*#/){
next;
}
elsif ($line=~ /^>/){
next;
}
else {
$sequence.=$line;
}
}
$sequence=~ s/\s//g;
return $sequence;
}
sub(match_positions) {
my ($regexp, $sequence)[email protected]_;
use strict;
my @position=();
while ($sequence=~ /$regexp/ig){
push (@position, $-[0]);
}
return @position;
}
}
main();
Iam初学者,先研究生物信息学Sem。事情是我们不能使用LWP,所以我不知道如何下载网站的内容,anyidea我应该如何开始?那我怎么能救他们呢? –
为什么不能使用lwp,它是在perl下载文件的最佳方式? – user1937198
所以我问我的助理,他的意思是我不需要那样做......他说你只需要下载文件,然后打开它并contuniue工作... –