使用AWK合并两个数据集

我有两个数据文件：file01和file02。在这两个数据集中，字段是：（i）识别器; （ii）数字参考; （iii）经度;和（四）纬度。对于file01中的每一行，我想使用相同的数字参考搜索file02中的数据，然后在file02中找到最接近file01中的标识符的标识符。使用AWK合并两个数据集

我能得到这个，如果我使用下面的代码从file01通过手动值awk程序：

awk 'function acos(x) { return atan2(sqrt(1-x*x), x) } 
BEGIN {pi=3.14159; 
     ndist=999999999.1; 
     date=1001; 
     lo1=-1.20; lg1=lo1*(pi/180); 
     la1=30.31; lt1=la1*(pi/180) 
      } 
{if($2==date) {ws=$1; 
       lg2=$3*(pi/180); 
       lt2=$4*(pi/180); 
       dist= 6378.7 * acos(sin(lt1)*sin(lt2) + cos(lt1)*cos(lt2)*cos(lg2-lg1)); 
       if(dist < ndist) {ndist=dist; ws0=ws}}} 
END {print(ws0,ndist)}' file02

正如你看到的，date，lo1并在BEGIN声明la1是美国的价值观第一行file01（数据文件见下文）。我的问题是，如果我可以一次做到这一点，所以每次我读file01中的一行时，我会得到最近的标识符和距离，并追加到file01的行数据中。我不知道一些shell命令是否可以通过更简单的方式实现这一点，也许可以使用管道。

这两个数据文件和所希望的输出的一个例子是：

=== file01 ===

A 1001 -1.2 30.31 
A 1002 -1.2 30.31 
B 1002 -1.8 30.82 
B 1003 -1.8 30.82 
C 1001 -2.1 28.55

=== file02 ===

ws1 1000 -1.3 29.01 
ws1 1001 -1.3 29.01 
ws1 1002 -1.3 29.01 
ws1 1003 -1.3 29.01 
ws1 1004 -1.3 29.01 
ws1 1005 -1.3 29.01 
ws2 1000 -1.5 30.12 
ws2 1002 -1.5 30.12 
ws2 1003 -1.5 30.12 
ws2 1004 -1.5 30.12 
ws2 1005 -1.5 30.12 
ws3 1000 -1.7 29.55 
ws3 1001 -1.7 29.55 
ws3 1002 -1.7 29.55 
ws3 1003 -1.7 29.55 
ws3 1004 -1.7 29.55 
ws3 1005 -1.7 29.55 
ws4 1000 -1.9 30.33 
ws4 1001 -1.9 30.33 
ws4 1002 -1.9 30.33 
ws4 1003 -1.9 30.33 
ws4 1004 -1.9 30.33 
ws4 1005 -1.9 30.33

= ==输出文件===

A 1001 -1.2 30.31 ws4 67.308 
A 1002 -1.2 30.31 ws2 35.783 
B 1002 -1.8 30.82 ws4 55.387 
B 1003 -1.8 30.82 ws4 55.387 
C 1001 -2.1 28.55 ws1 85.369

编辑＃1：考虑通过@Eran的建议，我写了下面的代码：

join -j 2 < (sort -k 2,2 file01) < (sort -k 2,2 file02) | 
awk 'function acos(x) { return atan2(sqrt(1-x*x), x) } 
    BEGIN {pi=3.14159} 

    {if (last != $1 $2) 
     {print NR, id,r,lon,lat,ws0,ndist; 
      last = $1 $2; 
      ndist=999999999.1 

     } else { 

      lg1=$3*(pi/180); 
      lt1=$4*(pi/180); 
      lg2=$6*(pi/180); 
      lt2=$7*(pi/180); 
      dist= 6378.7 * acos(sin(lt1)*sin(lt2) + cos(lt1)*cos(lt2)*cos(lg2-lg1)); 
      if(dist< ndist) {ndist=dist; ws0=$5} 
      id=$2;r=$1;lon=$3;lat=$4 

      } 
    }'

从这个脚本的输出是：

1  
4 A 1001 -1.2 30.31 ws4 67.3078 
7 C 1001 -2.0 28.55 ws3 115.094 
11 A 1002 -1.2 30.31 ws2 35.7827 
15 B 1002 -1.8 30.82 ws4 55.387

编辑＃2：使用的athe建议@丹尼斯（有一些修改）我已经得到了期望的输出。 awk脚本如下：


awk 'function acos(x) { return atan2(sqrt(1-x*x), x) } 
    BEGIN {pi=3.14159} 
    NR==FNR {c++; a1[c]=$1;a2[c]=$2;a3[c]=$3;a4[c]=$4; next} 
      {d++; b1[d]=$1;b2[d]=$2;b3[d]=$3;b4[d]=$4} 

    END { 
    for(k=1;k<=c;k++) { 
     lg1=a3[k]*(pi/180); 
     lt1=a4[k]*(pi/180); 
     ndist=999999999.1; 
     for(l=1;l<=d;l++) { 
      if(b2[l]==a2[k]) {kk=b2[l]; 
       lg2=b3[l]*(pi/180); 
       lt2=b4[l]*(pi/180); 
       dist= 6378.7 * acos(sin(lt1)*sin(lt2) + cos(lt1)*cos(lt2)*cos(lg2-lg1)); 
       if(dist&ltndist) {ndist=dist; ws0=b1[l]} 
      } 
     } 
     print a1[k],a2[k],a3[k],a4[k],ws0,ndist 
    } 
    }' file01 file02

来源

2012-05-09 Manuel Ramón

将您的值从file01读入一个或多个数组。您可以在BEGIN块中使用getline，或者规范的方式是使用FNR == NR循环作为主要块之一。

FNR == NR {array[$1] = $1; ...; next } # read file01 into some arrays 
{ for (item in array) { ... }  # process each item in the array(s) against each line in file02

你的脚本将被调用为awk '...' file01 file02

而是通过字段值索引的数组，你可以用计数器array1[c] = $1; array2[c] = $2; c++对其进行索引以及与计数器而不是使用迭代in：for (i=0; i<c; i++)。

当然，你应该选择有意义的数组名。

来源

2012-05-09 16:04:23

+1谢谢@Dennis，你的回答非常有帮助！ –

有趣的挑战。既然你必须首先在file02中读取数据，并将数据存储在数据结构中，那么我首先会倾向于使用Perl。

#!/usr/bin/perl 
use strict; 
use warnings; 

# see http://perldoc.perl.org/Math/Trig.html 
use Math::Trig qw(great_circle_distance deg2rad); 
sub NESW {deg2rad($_[0]), deg2rad(90-$_[1])} 

# read file02 
my %data; 
my $file2 = 'file02'; 
open my $fid, '<', $file2 or die "can't read $file2: $!\n"; 
while (<$fid>) { 
    my ($id, $ref, $long, $lat) = split; 
    push @{$data{$ref}}, [$id, $long, $lat]; 
} 
close $fid; 

$, = " "; 

# process file01 
my $file1 = 'file01'; 
open $fid, '<', $file1 or die "can't read $file1: $!\n"; 
while (<$fid>) { 
    my ($id, $ref, $long, $lat) = split; 
    my @here = NESW($long, $lat); 
    my $min = 99_999_999; 
    my (@min_id, $dist); 

    while (my ($key, $listref) = each %data) { 
     next unless $key == $ref; 

     foreach my $trioref (@$listref) { 
      my ($other_id, $other_long, $other_lat) = @$trioref; 
      my @there = NESW($other_long, $other_lat); 
      $dist = great_circle_distance(@here, @there, 6378.7); 
      if ($dist < $min) { 
       $min = $dist; 
       @min_id = @$trioref; 
      } 
     } 
    } 

    printf "%s %d %s %s %s %6.3f\n", $id, $ref, $long, $lat, $min_id, $min; 
} 
close $fid;

此输出

A 1001 -1.2 30.31 ws4 67.308 
A 1002 -1.2 30.31 ws2 35.783 
B 1002 -1.8 30.82 ws4 55.387 
B 1003 -1.8 30.82 ws4 55.387 
C 1001 -2.1 28.55 ws1 93.361

我注意到 “C” 距离是你有什么建议它应该是不同的。

来源

2012-05-09 16:47:06

谢谢@glenn的帮助。我对Perl不熟悉，所以我需要更多时间来检查你的建议，但它似乎是一个很好的帮助。 –

要马上做，运行

join -j 2 <(sort -k 2,2 file01) <(sort -k 2,2 file02)

它管的AWK其在参考每一个变化做了计算：

gawk '{if (last != $1 $2) {calc_nearest_on_array; last=$1 $2; add_point_to_array} else {add_point_to_array}}'

来源

2012-05-10 05:13:14

你的尝试看起来很有希望。然而，当我尝试这样做时，我遇到了一个问题：由于我每次都计算最近的'last！= $ 1 $ 2'，并且在读取第一行'last'与'$ 1'和'$ 2'不同时，I不要在第一行获取最近的点，并且在'file01'中没有打印最后一行。请查看我编辑的问题以查看我的代码。 –

你是对的。我写了一个'概念'awk来展示这个想法。为了打印最后一行，你需要使用END子句。 –

TXR：

@(do 
    (defvar pi 3.1415926535) 
    (defvar earth-radius 6378.7) 
    (defun rad (deg) (/ (* deg pi) 180)) 
    (defun sphere-distance (lat0 lon0 lat1 lon1) 
    (let ((lat0 (rad lat0)) (lat1 (rad lat1)) 
      (lon0 (rad lon0)) (lon1 (rad lon1))) 
     (* earth-radius (acos (+ (* (sin lat0) (sin lat1)) 
           (* (cos lat0) (cos lat1) (cos (- lon1 lon0))))))))) 
@(next "file01") 
@(collect) 
@id @ref @lon0 @lat0 
@ (filter :tonumber lon0 lat0) 
@ (next "file02") 
@ (bind (min-dist ws) (1e99 nil)) 
@ (collect) 
@ws1 @ref @lon1 @lat1 
@ (filter :tonumber lon1 lat1) 
@ (do (let ((d (sphere-distance lat0 lon0 lat1 lon1))) 
      (cond ((< d min-dist) 
        (set min-dist d) 
        (set ws ws1))))) 
@ (end) 
@ (do (format t "~a ~a ~0,2f ~0,2f ~a ~0,3f\n" id ref lon0 lat0 ws min-dist)) 
@(end)

运行：

$ txr dist.txr 
A 1001 -1.20 30.31 ws4 67.308 
A 1002 -1.20 30.31 ws2 35.783 
B 1002 -1.80 30.82 ws4 55.387 
B 1003 -1.80 30.82 ws4 55.387 
C 1001 -2.10 28.55 ws1 93.361

来源

2012-05-11 07:09:27 Kaz

什么是TXR脚本？我的debian发行版没有TXR。 –

TxR主页是：http://www.nongnu.org/txr。没有Debian软件包维护者。 – Kaz

使用AWK合并两个数据集

回答

相关问题