2017-07-05 29 views
0

我有一个FASTA文件(testfile.fa),其中包含标题行(包含>开始处)以及带有字符的行,表示某些类型的核苷酸A,C,G,T,a,g,c,t,N)。如何逐行读取文件,如果行包含特定值,则更改字符

>chr1 
cccccccccttttttttaaaa 
AAAACCCCTTCCCCCCCCGGG 
GGGGGGGGGGGGGGGGGGGGG 
TTTTTTTTTTTTTTTTTTTTT 
>chr1_alt 
TCTCTCTCTCTCTCTCTCTCT 
gggtttccccccccccccccc 
CGCGCGCGCGCGCGCGCGCGC 
CCCCCAAAAAAAAAAAAAAAA 
>chr2 
CCCCCCCCCCCCCCCCCCCCC 
TTTTTTTTTTTTATTTTTTTT 
>chr3 
AAAAAAAAAAAAAAAAAAAAA 
GGGGGGGGGGGGGGGGGGGGG 
TTTTTTTTTTTTTTTTTTTTT 

我需要读出由线这个文件中的行并改变小字符(A,C,T,G)到N的每个序列中的不同之处的标题,其含有>。所以我用下面的代码:

#!/bin/bash 
while read line 
do 
    if [[ $line =~ ">" ]] 
    then 
     echo $line 
    else 
     tr 'c' 'N' 
     echo $line 
    fi 
done < testfile.fa 

但结果令人困惑:

>chr1 
# the first line was missed 
AAAACCCCTTCCCCCCCCGGG 
GGGGGGGGGGGGGGGGGGGGG 
TTTTTTTTTTTTTTTTTTTTT 
>Nhr1_alt #the character was changed but the line contains > 
TCTCTCTCTCTCTCTCTCTCT 
gggtttNNNNNNNNNNNNNNN 
CGCGCGCGCGCGCGCGCGCGC 
CCCCCAAAAAAAAAAAAAAAA 
>Nhr2 #the character was changed but the line contains > 
CCCCCCCCCCCCCCCCCCCCC 
TTTTTTTTTTTTATTTTTTTT 
>Nhr3 #the character was changed but the line contains > 
AAAAAAAAAAAAAAAAAAAAA 
GGGGGGGGGGGGGGGGGGGGG 
TTTTTTTTTTTTTTTTTTTTTcccccccccttttttttaaaa #the first line from the first sequence comes here 

什么可以对这些问题的可能的原因以及如何解决这些问题?先谢谢你!

回答

0

您以错误的方式使用tr

这里是我的脚本:

#!/bin/sh 

while read line 
do 
    if [[ $line =~ ">" ]] 
    then 
     echo $line 
    else 
     echo $line | tr 'c' 'N' 
    fi 
done < t.file 
+0

谢谢!但接下来的问题出现了 - 当我尝试这段代码时,我失去了我的序列的最后一行,而我对“> chr3”只是这样的: AAAAAAAAAAAAAAAAAAAAA GGGGGGGGGGGGGGGGGGGGG。可能的解释是什么? –

+0

@ N.Kn嗯......我只是用同样的例子,得到了正确的结果。如果您可以确保您确实丢失了最后一行,现在您需要进行调试。例如,你可以删除'> chr3'后面的所有行并执行脚本,看看会发生什么。或者你可以删除缺失的行并执行脚本......或者你可以修改最后一行并执行脚本......享受你的生物学的东西:D Btw,请注意,脚本中的文件名是't.file' ,而你的是'testfile.fa'。 – Yves

+0

再次感谢你:)有趣的是,最后一行总是在调试的这些变体中丢失...嗯,我想我想继续这些实验:) –

0

要更改与一个awk语句中的所有小写变量,我们可以使用:

awk '{ if (substr($0,1,1) != ">") { stat="";for (i=1;i<=length($0);i++) { if (substr($0,i,1) ~ /[[:lower:]]/) { stat=stat"N" } else stat=stat substr($0,i,1) } print stat } else { print $0 } }' testfile.fa 

我们使用awk的SUBSTR功能,只是打印任何行与>作为第一个字符。在其他行中,我们建立一个变量stat,将所有小写字母改为N,然后打印最终的统计结果。

+0

如果它工作,请添加投票。谢谢 –

1

使用AWK:

$ awk '/^[^>]/{gsub(/[actg]/,"N")}1' file 
>chr1 
NNNNNNNNNNNNNNNNNNNNN 
AAAACCCCTTCCCCCCCCGGG 
GGGGGGGGGGGGGGGGGGGGG 
TTTTTTTTTTTTTTTTTTTTT 
>chr1_alt 
TCTCTCTCTCTCTCTCTCTCT 
NNNNNNNNNNNNNNNNNNNNN 
CGCGCGCGCGCGCGCGCGCGC 
CCCCCAAAAAAAAAAAAAAAA 
>chr2 
CCCCCCCCCCCCCCCCCCCCC 
TTTTTTTTTTTTATTTTTTTT 
>chr3 
AAAAAAAAAAAAAAAAAAAAA 
GGGGGGGGGGGGGGGGGGGGG 
TTTTTTTTTTTTTTTTTTTTT 

解释:

/^[^>]/ {    # if the record starts with anything but > 
    gsub(/[actg]/,"N") # replace all actg with N 
}1      # output 
+0

非常感谢,它完美的工作! –

0

sed的是一种简单的方法来实现这一点:

sed -i '/^>/ !s/[actg]/N/g' testfile.fa 

[]包含将被改变到N中的字符,并/^>/ !部分忽略行开始>

-i会覆盖当前文件,没有它你会得到标准输出的输出。

+0

谢谢你这个优雅的解决方案!恐怕我对sed知之甚少,但我真的很喜欢这个变种 –

相关问题