我有一个FASTA文件(testfile.fa),其中包含标题行(包含>开始处)以及带有字符的行,表示某些类型的核苷酸A,C,G,T,a,g,c,t,N)。如何逐行读取文件,如果行包含特定值,则更改字符
>chr1
cccccccccttttttttaaaa
AAAACCCCTTCCCCCCCCGGG
GGGGGGGGGGGGGGGGGGGGG
TTTTTTTTTTTTTTTTTTTTT
>chr1_alt
TCTCTCTCTCTCTCTCTCTCT
gggtttccccccccccccccc
CGCGCGCGCGCGCGCGCGCGC
CCCCCAAAAAAAAAAAAAAAA
>chr2
CCCCCCCCCCCCCCCCCCCCC
TTTTTTTTTTTTATTTTTTTT
>chr3
AAAAAAAAAAAAAAAAAAAAA
GGGGGGGGGGGGGGGGGGGGG
TTTTTTTTTTTTTTTTTTTTT
我需要读出由线这个文件中的行并改变小字符(A,C,T,G)到N的每个序列中的不同之处的标题,其含有>。所以我用下面的代码:
#!/bin/bash
while read line
do
if [[ $line =~ ">" ]]
then
echo $line
else
tr 'c' 'N'
echo $line
fi
done < testfile.fa
但结果令人困惑:
>chr1
# the first line was missed
AAAACCCCTTCCCCCCCCGGG
GGGGGGGGGGGGGGGGGGGGG
TTTTTTTTTTTTTTTTTTTTT
>Nhr1_alt #the character was changed but the line contains >
TCTCTCTCTCTCTCTCTCTCT
gggtttNNNNNNNNNNNNNNN
CGCGCGCGCGCGCGCGCGCGC
CCCCCAAAAAAAAAAAAAAAA
>Nhr2 #the character was changed but the line contains >
CCCCCCCCCCCCCCCCCCCCC
TTTTTTTTTTTTATTTTTTTT
>Nhr3 #the character was changed but the line contains >
AAAAAAAAAAAAAAAAAAAAA
GGGGGGGGGGGGGGGGGGGGG
TTTTTTTTTTTTTTTTTTTTTcccccccccttttttttaaaa #the first line from the first sequence comes here
什么可以对这些问题的可能的原因以及如何解决这些问题?先谢谢你!
谢谢!但接下来的问题出现了 - 当我尝试这段代码时,我失去了我的序列的最后一行,而我对“> chr3”只是这样的: AAAAAAAAAAAAAAAAAAAAA GGGGGGGGGGGGGGGGGGGGG。可能的解释是什么? –
@ N.Kn嗯......我只是用同样的例子,得到了正确的结果。如果您可以确保您确实丢失了最后一行,现在您需要进行调试。例如,你可以删除'> chr3'后面的所有行并执行脚本,看看会发生什么。或者你可以删除缺失的行并执行脚本......或者你可以修改最后一行并执行脚本......享受你的生物学的东西:D Btw,请注意,脚本中的文件名是't.file' ,而你的是'testfile.fa'。 – Yves
再次感谢你:)有趣的是,最后一行总是在调试的这些变体中丢失...嗯,我想我想继续这些实验:) –