2017-06-08 28 views
1

我有以下input删除字符串,并添加序列号,文件用awk或sed的头

>Thimo_0001|ID:40710520| hypothetical protein [Thioflavicoccus mobilis 8321] 
LIAPTMILRIRLTEFCPMRTEGFEE 
TGIGPLDSRMPRYDDVVHHREIIT 
YPPEALSNDPFDPTSIDGSPSAFF* 
>ThimoAM_0002|ID:40707134| protein of unknown function [Thioflavicoccus mobilis 8321] 
VRKAERDSPCKRRGADRSFP 
KSARLISSKAFRDVFAESITNSDPFFVVR 
ARPNLAETARLGIAVSKKCARRSVDRSRIKRII 
RESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA* 
>Thimo_0002|ID:40710524| ribonuclease P protein component [Thioflavicoccus mobilis 8321] 
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRAR 
TTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAP 
RRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL* 

而且我想

  1. 删除行的换行符的头开始>
  2. 删除星号
  3. 更改fasta标头

我可以做1.2.

awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' 
sed "s/\*//g" 

,我还可以添加一个序列号,标题行的末尾:

awk '/^>/{$0=$0"_"(++i)}1' 

但我在与最后一步失败替换/删除和添加序号:

想要的output

>TM0001|hypothetical_protein 
LIAPTMILRIRLTEFCPMRTEGFEETGIGPLDSRMPRYDDVVHHREIITYPPEALSNDPFDPTSIDGSPSAFF 
>TM0002|protein_of_unknown_function 
VRKAERDSPCKRRGADRSFPKSARLISSKAFRDVFAESITNSDPFFVVRARPNLAETARLGIAVSKKCARRSVDRSRIKRIIRESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA 
>TM0003|ribonuclease_P_protein_component 
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRARTTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAPRRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL 

回答

1

根据你的 “理想” 输出 - GAWK解决方案:

awk 'BEGIN{ RS=">"; FS="[|\\]\\[]" }!$0{ next } 
    { gsub(/^ */,"",$3); gsub(/[*[:space:]]/,"",$5); printf(">TM%04d|%s\n%s\n",++c,$3,$5) 
}' yourfile 

输出:

>TM0001|hypothetical protein 
LIAPTMILRIRLTEFCPMRTEGFEETGIGPLDSRMPRYDDVVHHREIITYPPEALSNDPFDPTSIDGSPSAFF 
>TM0002|protein of unknown function 
VRKAERDSPCKRRGADRSFPKSARLISSKAFRDVFAESITNSDPFFVVRARPNLAETARLGIAVSKKCARRSVDRSRIKRIIRESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA 
>TM0003|ribonuclease P protein component 
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRARTTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAPRRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL 

详情:

  • RS=">" - 考虑>作为记录分隔

  • FS="[|\\]\\[]" - 字段分隔,任意字符|[]

  • !$0{ next }的 - 跳过空记录

  • gsub(/^ */,"",$3) - 删除前导空格在第三场

  • gsub(/[*[:space:]]/,"",$5) - 更换/删除翠菊isk *和第五个字段内的空格字符