2014-06-18 37 views
3

我有一个文件如下。我想统计每个角色的数量。统计文件中的残差数

>1DMLA 
MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP 
LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR 
ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS 
NALTKAGQAAANAKTVYGENTHRTFSVVVDDCSMRAVLRRLQVGGGTLKFFLTTPVPSLCVTATGPNAVSAVFLLKPQK 
>1DMLB 
DDVAARLRAAGFGAVGAGATAEETRRMLHRAFDTLA 
>2BHDC 
MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP 
LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR 
ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS 

我试了下面的代码。

awk '/^>/ { res=substr($0, 2); } /^[^>]/ { print res " - " length($0); }' <file 

上述代码的输出是

1DMLA - 80 
1DMLA - 80 
1DMLA - 80 
1DMLA - 79 
1DMLB - 36 
2BHDC - 80 
2BHDC - 80 
2BHDC - 80 

我期望的输出是

1DMLA - 319 
1DMLB - 36 
2BHDC - 240 

如何改变让我的期望输出上面的代码?

+0

最好避免' Steve

+0

你测试过所有的解决方案吗? – klashxx

回答

0

下面是使用awk单程:

awk '/^>/ && r { print r, "-", s; r=s="" } /^>/ { r = substr($0, 2); next } { s += length } END { print r, "-", s }' file 

结果:

1DMLA - 319 
1DMLB - 36 
2BHDC - 240 
0

这样:

awk -F\> '/^>/ {if (seqlen != ""){print seqlen}printf("%s - ",$2);seqlen=0;next}seqlen != ""{seqlen +=length($0)}END{print seqlen}' infile 

或格式化:

awk -F\> '/^>/ { if (seqlen != "") 
        print seqlen 
       printf("%s - ",$2) 
       seqlen=0 
       next } 
      seqlen != ""{seqlen+=length($0)} 
      END{ 
      print seqlen}' infile 

见: Sequence length of FASTA file

从预期的结果

除此之外,这将处理这些意外的文件格式。

$ cat infile 
MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP 
LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR 
ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS 
NALTKAGQAAANAKTVYGENTHRTFSVVVDDCSMRAVLRRLQVGGGTLKFFLTTPVPSLCVTATGPNAVSAVFLLKPQK 
>1DMLB 
>2BHDC 
MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP 
LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR 
ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS 


$ awk -F\> '/^>/ {if (seqlen != ""){print seqlen}printf("%s - ",$2);seqlen=0;next}seqlen != ""{seqlen +=length($0)}END{print seqlen}' kk2 
1DMLB - 0 
2BHDC - 240 
0
awk -vRS='>' '$1{gsub("[\r]", "",$1); 
       printf "%s - %d\n", $1, length($0) - length($1) - NF + 1}' input 
+0

您能否详细介绍一下您做了哪些更改以及这些更改对未来参考的作用? –