2017-09-14 62 views
0

在下面的awk中,我试图将:p.=添加到每个$7,但前提是它们的模式为/NM/。下面似乎这样做,如果$7中只有一个NM,就像第2行。但是,如果$7中有多个NM,就像第3行那么:p.=只会被添加到最后。 A ;用于在现场分离多个NM。我添加了评论,但我不确定我没有做什么,那是需要的。谢谢 :)。awk向字段中的每个模式添加文本

输入tab-delimited

R_Index Chr Start End Ref Alt Detail.refGene Gene.refGene 
1 chr1 948846 948846 - A dist=1 ISG15 
2 chr1 948870 948870 C G NM_005101:c.-84C>G ISG15 
3 chr1 948921 948921 T C NM_005101:c.-33T>C;NM_005101:c.-84C>G ISG15 
4 chr1 949654 949654 A G . ISG15 

AWK

awk ' 
    BEGIN { FS=OFS="\t" } # define FS and OFS as tab and start processing 
    $7 ~ /NM/ {   # look for pattern NM in $7 
     # split $7 by ";" and cycle through them 
      i=split($7,NM,";") 
      for (n=1; n<=i; n++) { 
       sub("$", ":p=", $7) # add :p. to end off each $7 before the ; 
    }  # close block 
}1' input # define input file 

电流输出tab-delimited

R_Index Chr Start End Ref Alt Detail.refGene Gene.refGene 
1 chr1 948846 948846 - A dist=1 ISG15 
2 chr1 948870 948870 C G NM_005101:c.-84C>G:p.= ISG15 
3 chr1 948921 948921 T C NM_005101:c.-33T>C;NM_005101:c.-84C>G:p.=p.= ISG15 
4 chr1 949654 949654 A G . ISG15 

所需的输出tab-delimited

R_Index Chr Start End Ref Alt Detail.refGene Gene.refGene 
1 chr1 948846 948846 - A dist=1 ISG15 
2 chr1 948870 948870 C G NM_005101:c.-84C>G:p.= ISG15 
3 chr1 948921 948921 T C NM_005101:c.-33T>C:p.=;NM_005101:c.-84C>G:p.= ISG15 
4 chr1 949654 949654 A G . ISG15 
+3

谁与这些可怕的形式出现?它既不是机器也不是人类友好的。 – karakfa

+0

对不起,我试图缩进代码更具可读性,但不幸的是文件类型是这样从仪器来....我想也许我应该在Excel中查看它可能会帮助。谢谢 :)。 – Chris

+2

':p。='在领带上流口水? :D –

回答

2

替换此:

 i=split($7,NM,";") 
     for (n=1; n<=i; n++) { 
      sub("$", ":p=", $7) # add :p. to end off each $7 before the ; 
     } 

与此:

 out="" 
     i=split($7,NM,/;/) 
     for (n=1; n<=i; n++) { 
      sub(/$/, ":p=", NM[i]) # add :p. to end off each NM[i] before the ; 
      out = (out=="" ? "" : out";") NM[i] 
     } 
     $7 = out 
相关问题