2015-06-23 34 views
1

我有2个TSV文件:连接两个TSV文件与内部联接

TSV file 1: 
    A  B 
    hello 0.5 
    bye  0.4 

TSV file 2: 
C  D 
hello  1 
country 5 

我想加入2 TSV文件一起基于file1.A=file2.C

如何与在Linux中加入功能做?

希望能得到这样的:

Text  B D 
hello 0.5 1 
bye  0.4 
country  5 

没有得到任何输出,这一点:

join -j 1 <(sort -k1 file1.tsv) <(sort -k1 file2.tsv) 
+0

是您的样本文件1真的这样呢?标签在哪里?你为什么在'-k2'上排序,但是使用'-j 1'来加入?另外请注意'man join'中的'-e'选项可能有助于找到不匹配的项目。祝你好运。 – shellter

+0

这种为我工作。 'join -t $'\ t'-1 1 -2 1 <(sort -k1 file1.tsv)<(sort -k1 file2.tsv)> join_test.tsv'我遇到的主要问题是定义了tab分隔符。 – jxn

+0

良好的接触和抱歉,我错过了这一关键点。我很高兴你有一个解决方案。对于那些已经发布可用解决方案的人来说,它绝不会感到痛苦。它给人们激励分享他们所知道的东西。祝你们好运。 – shellter

回答

1

有点毛茸茸的,但在这里是用awk和关联数组的解决方案。

awk 'FNR == 1 {h[length(h) + 1] = $2} 
    FILENAME ~ /test1.tsv/ && FNR > 1 {t1[$1]=$2} 
    FILENAME ~ /test2.tsv/ && FNR > 1 {t2[$1]=$2} 
    END{print "Text\t"h[1]"\t"h[2]; 
     for(x in t1){print x"\t"t1[x]"\t"t2[x]} 
     for(x in t2){print x"\t"t1[x]"\t"t2[x]}}' test1.tsv test2.tsv | 
    sort | uniq 
1

File1中

$ cat file1 
A  B 
hello 0.5 
bye  0.4 

文件2

$ cat file2 
C  D 
hello  1 
country 5 

输出

$ awk 'NR==1{print "Text","B","D"}FNR==1{next}FNR==NR{A[$1]=$2;next}{print $0,(f=$1 in A ? A[$1] : ""; if(f)delete A[$1]}END{for(i in A)print i,"",A[i]}' OFS='\t' file2 file1 
Text B D 
hello 0.5 1 
bye  0.4 
country  5 

更好的阅读的版本

awk ' 
    # Print header when NR = 1, this happens only when awk reads first file 
    NR==1{print "Text","B","D"} 

    # Number of Records relative to the current input file. 
    # When awk reads from the multiple input file, 
    # awk NR variable will give the total number of records relative to all the input file. 
    # Awk FNR will give you number of records for each input file 
    # So when awk reads first line, stop processing and go to next line 
    # this is just to skip header from each input file 
    FNR==1{ 
      next 
      } 

    # FNR==NR is only true while reading first file (file2) 
    FNR==NR{ 
       # Build assicioative array on the first column of the file 
       # where array element is second column 
       A[$1]=$2 

       # Skip all proceeding blocks and process next line 
       next 
      } 
      { 
       # Check index ($1 = column1) from second argument (file1) exists in array A 
       # if exists variable f will be 1 (true) otherwise 0 (false) 
       # As long as above state is true 
       # print current line and element of array A where index is column1 
       print $0,(f=$1 in A ? A[$1] : "") 

       # Delete array element corresponding to index $1, if f is true 
       if(f)delete A[$1] 
      } 

     # Finally in END block print array elements one by one, 
     # from file2 which does not exists in file1 
     END{ 
       for(i in A) 
        print i,"",A[i] 
      } 
    ' OFS='\t' file2 file1