awk打印字段匹配使用条件和两个文件中不匹配的默认值

试图使用AWK匹配file中的每行内容与list中的$2。这两个文件是制表符分隔，并有可能在list被匹配在名称中有空格或特殊字符，例如在file名称是BRCA1但list名称是BRCA 1或file名字是BCR但list名称是BCR/ABL。awk打印字段匹配使用条件和两个文件中不匹配的默认值

如果存在匹配，并且list的$4中有full gene sequence，则将被打印，并由制表符分隔。如果没有找到匹配，那么不匹配的名称和14将被打印，并用制表符分隔。下面的awk执行，但没有输出结果。谢谢：）。

文件

BRCA1 
BCR 
SCN1A 
fbn1

列表

List code gene gene name methodology 
81 DMD dystrophin deletion analysis and duplication analysis 
811 BRCA 1 BRCA2 full gene sequence and full deletion/duplication analysis 
70 ABL1 ABL1 gene analysis variants in the kinse domane 
71 BCR/ABL t(9;22) full gene sequence

AWK

awk -F'\t' -v OFS="\t" 'FNR==NR{A[$1]=$0;next} ($2 in A){if($4=="full gene sequence"){print A[$2],$1}} ELSE {print A[$2],"14"}' file list

渴望Desir ED输出

BRCA1 811 
BCR 71 
SCN1A 14 
fbn1  85

编辑

List code gene gene name methodology 
85 fbn1 Fibrillin full gene sequencing 
95 FBN1 fibrillin del/dup

结果

85 fbn1 Fibrillin full gene sequencing

因为只有这一行full gene sequencing在它，仅此被打印。

来源

2017-03-17 Chris

定义'match'：string或regexp？部分或全部？区分大小写/不敏感？如果没有这些信息，你很可能会得到一个适用于某些特定测试输入设置的解决方案，但在6个月后你的真实数据就失败了。现在你有两种不同的解决方案，每种解决方案都会根据'match'的含义做出非常不同的假设，并且每种解决方案的行为都会因不同的输入集合而异，即使它们会根据您提供的示例输入产生相同的输出。 –

匹配是一个完整且不区分大小写的字符串....即“BRCA1”是匹配，但它可以是“brca1”或“brca 1”。另外，我只注意到'$ 4'或'完全基因序列'不包括在内，并且由于可能有多个条目用于相同的匹配，所以它使得它是唯一的。我在帖子中也包含了一个例子。谢谢：）。 – Chris

'file'中的名称将匹配'list'的'$ 2'中的字符串。在'list'中，匹配的名称可能是字符串的一部分，但它始终是来自'file'的完整名称。这就是'BCR'的名字与'list'，'BCR/ABL'中的'$ 2'字符串匹配的名字。谢谢：）。 – Chris

你可以试试，

awk 'BEGIN{FS=OFS="\t"} 
FNR==NR{ 
    if(NR>1){ 
     gsub(" ","",$2)  #removing white space 
     n=split($2,v,"/") 
     d[v[1]] = $1   #from split, first element as key 
    } 
    next 
}{print $1, ($1 in d?d[$1]:14)}' list file

你得到了，

 
BRCA1 811 
BCR 71 
SCN1A 14

来源

2017-03-17 13:21:13

非常感谢你们:)。 – Chris

awk 'FNR==NR{ 
      a[$2]=$1; 
      next 
     } 
    { 
     for(i in a){ 
      if($1 ~ i || i ~ $1){ print $1, a[i] ; next } 
     } 
     print $1,14 
    }' list file

输入

$ cat list 
List code gene gene name methodology 
81 DMD dystrophin deletion analysis and duplication analysis 
811 BRCA 1 BRCA2 full gene sequence and full deletion/duplication analysis 
70 ABL1 ABL1 gene analysis variants in the kinse domane 
71 BCR/ABL t(9;22) full gene sequence 

$ cat file 
BRCA1 
BCR 
SCN1A

输出

$ awk 'FNR==NR{ 
      a[$2]=$1; 
      next 
     } 
    { 
     for(i in a){ 
      if($1 ~ i || i ~ $1){ print $1, a[i] ; next } 
     } 
     print $1,14 
    }' list file 
BRCA1 811 
BCR 71 
SCN1A 14

来源

2017-03-17 13:20:47

awk打印字段匹配使用条件和两个文件中不匹配的默认值

回答

相关问题