2014-05-15 152 views
1

想从第一列打印缺失的序列缺口 然后需要打印最小值&该第一列的最大序列 而且$ 2,substr($ 3,4,6),substr($ 4,4,6),$ 6,$ 8,$ 10字段的组合。 输入文件不按第一列排序。awk要打印缺少序列缺口和最小 - 最大值:

Input.csv

21,abc,22-JUN-12.08:06:03,22-JUN-12.08:06:03,19-Apr-16,1,INR,RO0412,RC03,L7,,31 
22,abc,22-JUN-12.08:06:03,22-JUN-12.08:06:03,19-Apr-16,1,INR,RO0412,RC03,L7,,31 
23,abc,22-JUN-12.08:06:03,22-JUN-12.08:06:03,19-Apr-16,1,INR,RO0412,RC03,L7,,31 
24,abc,30-JUN-12.01:06:49,30-JUN-12.01:06:49,19-Apr-16,1,INR,RO0412,RC03,L7,,29 
28,abc,30-JUN-12.01:06:49,30-JUN-12.01:06:49,19-Apr-16,1,INR,RO0412,RC03,L7,,29 
32,abc,29-MAY-13.12:05:11,29-MAY-13.12:05:11,15-Feb-17,1350,INR,RO0213,CD,K1,,30 
38,abc,29-MAY-13.12:05:11,29-MAY-13.12:05:11,15-Feb-17,1350,INR,RO0213,CD,K1,,30 
41,abc,20-FEB-14.11:02:37,20-FEB-14.11:02:37,31-Dec-20,650,INR,EN1113,ch650,S317,,28 
46,abc,20-FEB-14.11:02:37,20-FEB-14.11:02:37,31-Dec-20,650,INR,EN1113,ch650,S317,,28 
51,abc,20-FEB-14.11:02:37,20-FEB-14.11:02:37,31-Dec-20,650,INR,EN1113,ch650,S317,,28 
52,abc,20-FEB-14.11:02:37,20-FEB-14.11:02:37,31-Dec-20,650,INR,EN1113,ch650,S317,,28 

是否尝试该命令,并获得部分输出:

cat Input.csv | \ 
awk -F, '{OFS=","; print $1,$2,substr($3,4,6),substr($4,4,6),$6,$8,$10}' | \ 
sort -k1 -t, | \ 
awk -F, 'BEGIN {OFS=","} (($1!=p+1) && ($7==p7)) {print p,p2,p3,p4,p5,p6,p7,p+1 "," $1-1,$1} {p=$1;p2=$2;p3=$3;p4=$4;p5=$5;p6=$6;p7=$7}' 

上述命令输出标题名称是:

Minimum Seq ($1),$2,substr($3,4,6),substr($4,4,6),$6,$8,$10,start Missing Seq ($1),End Missing Seq ($1),Maximum Seq ($1) 

24,abc,JUN-12,JUN-12,1,RO0412,L7,25,27,28 
32,abc,MAY-13,MAY-13,1350,RO0213,K1,33,37,38 
41,abc,FEB-14,FEB-14,650,EN1113,S317,42,45,46 
46,abc,FEB-14,FEB-14,650,EN1113,S317,47,50,51 

在上面的输出 - 最低Seq($ 1),最大Seq($ 1)值与我预期的结果不符,请帮忙... 例如,在打印输出第一行 - 最少SEQ应为21不打印输出24 第三行 - 最大SEQ应该是52不46

所需的输出:

## $2,$3,$4,$6,$8,$10,"start Missing Seq ($1), ",End Missing Seq ($1) ,Minimum Seq ($1),Maximum Seq ($1) ## 

abc,JUN-12,JUN-12,1,ROTN0412,L7,25,27,21,28 
abc,MAY-13,MAY-13,1350,ROTN0213,K1,33,37,32,38 
abc,FEB-14,FEB-14,650,CHEN1113,S317,42,45,41,52 
abc,FEB-14,FEB-14,650,CHEN1113,S317,47,50,41,52 
+0

尝试使用“编辑”按钮并将问题的格式设置为一点。这样就不可能阅读。 – fedorqui

+0

感谢蒂莫西布朗的编辑.. – VNA

+0

Hakon,非常感谢这冗长的脚本和你的努力,而运行这个我得到这个错误, -bash-3.2 $ perl Min_Max_MissingGap.pl 无法找到文件/ Slurp .pm的在@公司(@公司包含:/usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.7/x86_64-linux-thread-多/usr/lib64/perl5/site_perl/5.8.6/x86_64-linux-thread-multi 多/usr/lib64/perl5/vendor_perl/5.8.5/ 多/usr/lib/perl5/5.8.8。)在Min_Max_MissingGap.pl线5 BEGIN失败 - 编译中止在Min_Max_MissingGap.pl线5 – VNA

回答

0

可以尝试以下perl脚本:

#! /usr/bin/perl 

use warnings; 
use strict; 
use File::Slurp qw(read_file); 
use List::Util qw(min max); 

my @lines=read_file('input.csv'); 

my $ll=sortLines(\@lines); 

$ll=reduceFields($ll); 

my $rr=findRanges($ll); 

printMissingSeqs($rr,$ll); 


sub printMissingSeqs { 
    my ($rr,$ll) = @_; 

    my $pkey=""; my $pss; my $i=0; 
    for (@$ll) { 
    my @f=split(/,/); 
    my $key=$f[6]; 
    my $ss=$f[0]; 
    $pss=$ss if $i==0; 
    if (($key eq $pkey) && ($ss-$pss)>1) { 
     print join(",",(@f[1..6], $pss+1,$ss-1,@{$rr->{$key}}))."\n"; 
    } 
    $pkey=$key; $pss=$ss; 
    $i++; 
    } 
} 

sub findRanges { 
    my ($ll) = @_; 

    my %temp; 
    my %rr; 

    for (@$ll) { 
    my @f=split(/,/); 
    push (@{$temp{$f[6]}},$f[0]); 
    } 

    for (keys %temp) { 
    my $min=min(@{$temp{$_}}); 
    my $max=max(@{$temp{$_}}); 
    $rr{$_}=[$min, $max]; 
    } 
    return \%rr; 
} 

sub reduceFields { 
    my ($ll) = @_; 

    my @a; 
    for (@$ll) { 
    my @f=split(/,/); 
    my $line=join(",",($f[0],$f[1],substr($f[2],3,6),substr($f[3],3,6),$f[5],$f[7],$f[9])); 
    push (@a,$line); 
    } 
    return \@a; 
} 


sub sortLines { 
    my ($lines) = @_; 

    my @a=sort { my ($keyA)=$a=~/(.*?),/; my ($keyB)=$b=~/(.*?),/; $keyA<=>$keyB} @$lines; 

    return \@a; 
} 

输出:

abc,JUN-12,JUN-12,1,RO0412,L7,25,27,21,28 
abc,MAY-13,MAY-13,1350,RO0213,K1,33,37,32,38 
abc,FEB-14,FEB-14,650,EN1113,S317,42,45,41,52 
abc,FEB-14,FEB-14,650,EN1113,S317,47,50,41,52