1
我有以下input
:删除字符串,并添加序列号,文件用awk或sed的头
>Thimo_0001|ID:40710520| hypothetical protein [Thioflavicoccus mobilis 8321]
LIAPTMILRIRLTEFCPMRTEGFEE
TGIGPLDSRMPRYDDVVHHREIIT
YPPEALSNDPFDPTSIDGSPSAFF*
>ThimoAM_0002|ID:40707134| protein of unknown function [Thioflavicoccus mobilis 8321]
VRKAERDSPCKRRGADRSFP
KSARLISSKAFRDVFAESITNSDPFFVVR
ARPNLAETARLGIAVSKKCARRSVDRSRIKRII
RESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA*
>Thimo_0002|ID:40710524| ribonuclease P protein component [Thioflavicoccus mobilis 8321]
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRAR
TTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAP
RRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL*
而且我想
- 删除行的换行符的头开始
>
后 - 删除星号
- 更改fasta标头
我可以做1.
和2.
awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }'
sed "s/\*//g"
,我还可以添加一个序列号,标题行的末尾:
awk '/^>/{$0=$0"_"(++i)}1'
但我在与最后一步失败替换/删除和添加序号:
想要的output
>TM0001|hypothetical_protein
LIAPTMILRIRLTEFCPMRTEGFEETGIGPLDSRMPRYDDVVHHREIITYPPEALSNDPFDPTSIDGSPSAFF
>TM0002|protein_of_unknown_function
VRKAERDSPCKRRGADRSFPKSARLISSKAFRDVFAESITNSDPFFVVRARPNLAETARLGIAVSKKCARRSVDRSRIKRIIRESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA
>TM0003|ribonuclease_P_protein_component
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRARTTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAPRRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL