2014-02-19 48 views
1

基本上,一个GenBank文件由基因条目组成(由'gene'声明,之后是相应的'CDS'条目(每个基因只有一个),如下面展示的两个我想获得locus_tag vs产品在制表符分隔的两列文件中,'gene'和'CDS'前后都有空格,如果使用已有的工具可以很容易地执行此任务,请告诉我们解析GenBank文件

输入文件:

gene   complement(8972..9094) 
       /locus_tag="HAPS_0004" 
       /db_xref="GeneID:7278619" 
CDS    complement(8972..9094) 
       /locus_tag="HAPS_0004" 
       /codon_start=1 
       /transl_table=11 
       /product="hypothetical protein" 
       /protein_id="YP_002474657.1" 
       /db_xref="GI:219870282" 
       /db_xref="GeneID:7278619" 
       /translation="MYYKALAHFLPTLSTMQNILSKSPLSLDFRLLFLAFIDKR" 
gene   9632..11416 
       /gene="frdA" 
       /locus_tag="HAPS_0005" 
       /db_xref="GeneID:7278620" 
CDS    9632..11416 
       /gene="frdA" 
       /locus_tag="HAPS_0005" 
       /note="part of four member fumarate reductase enzyme 
       complex FrdABCD which catalyzes the reduction of fumarate 
       to succinate during anaerobic respiration; FrdAB are the 
       catalytic subcomplex consisting of a flavoprotein subunit 
       and an iron-sulfur subunit, respectively; FrdCD are the 
       membrane components which interact with quinone and are 
       involved in electron transfer; the catalytic subunits are 
       similar to succinate dehydrogenase SdhAB" 
       /codon_start=1 
       /transl_table=11 
       /product="fumarate reductase flavoprotein subunit" 
       /protein_id="YP_002474658.1" 
       /db_xref="GI:219870283" 
       /db_xref="GeneID:7278620" 
       /translation="MQTVNVDVAIVGAGGGGLRAAIAAAEANPNLKIALISKVYPMRS 
       HTVAAEGGAAAVAKEEDSYDKHFHDTVAGGDWLCEQDVVEYFVEHSPVEMTQLERWGC 
       PWSRKADGDVNVRRFGGMKIERTWFAADKTGFHLLHTLFQTSIKYPQIIRFDEHFVVD 
       ILVDDGQVRGCVAMNMMEGTFVQINANAVVIATGGGCRAYRFNTNGGIVTGDGLSMAY 
       RHGVPLRDMEFVQYHPTGLPNTGILMTEGCRGEGGILVNKDGYRYLQDYGLGPETPVG 
       KPENKYMELGPRDKVSQAFWQEWRKGNTLKTAKGVDVVHLDLRHLGEKYLHERLPFIC 
       ELAQAYEGVDPAKAPIPVRPVVHYTMGGIEVDQHAETCIKGLFAVGECASSGLHGANR 
       LGSNSLAELVVFGKVAGEMAAKRAVEATARNQAVIDAQAKDVLERVYALARQEGEESW 
       SQIRNEMGDSMEEGCGIYRTQESMEKTVAKIAELKERYKRIKVKDSSSVFNTDLLYKI 
       ELGYILDVAQSISSSAVERKESRGAHQRLDYVERDDVNYLKHTLAFYNADGTPTIKYS 
       DVKITKSQPAKRVYGAEAEAQEAAAKKE" 

希望的输出(locus_tag VS产品在制表符分隔的2个columnfile):

HAPS_0004 hypothetical protein 
HAPS_0005 fumarate reductase flavoprotein subunit 

事实上,具有该输出将是理想的,(示出为仅一个基因)一行每个基因:

locus_tag="HAPS_0004" db_xref="GeneID:7278619" complement(8972..9094) codon_start=1 transl_table=11 product="hypothetical protein" protein_id="YP_002474657.1" db_xref="GI:219870282" db_xref="GeneID:7278619" translation="MYYKALAHFLPTLSTMQNILSKSPLSLDFRLLFLAFIDKR" 
+1

尝试https://开头metacpan.org/pod/Bio::GenBankParser – frezik

回答

3
perl -nE' 
    BEGIN{ ($/, $") = ("CDS", "\t") } 
    say "@r[0,1]" if @r= m!/(?:locus_tag|product)="(.+?)"!g and @r>1 
' file 

输出

HAPS_0004  hypothetical protein 
HAPS_0005  fumarate reductase flavoprotein subunit