2013-11-25 192 views
4

我有一个逗号分隔的文件,包含12列。从逗号分隔的文件中删除额外的逗号

第5列和第6列有问题(第5列和第6列的文字完全相同,但它们之间可能有额外的逗号),其中包含额外的逗号。

2011,123456,1234567,12345678,Hey There,How are you,Hey There,How are you,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two 

所以在上面的例子“嘿,你好,”不应该有一个逗号。

我需要删除第5和第6列中的额外逗号。

+0

第4列和第7列总是包含数字? –

+2

如果可能的话,最好在包含逗号的列上使用封装来正确地重新请求或重新生成csv文件。 例如'2011,123456,1234567,12345678,“你好,你好吗”,“你好,你好吗”,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two' – AeroX

回答

4

如果你总是想删除第五逗号,尽量

sed 's/,//5' input.txt 

但你说,它可能额外的逗号。你必须提供一个逻辑如何找出额外的逗号是否存在。

如果您知道逗号的数量,则可以使用。这已被证明是相当的锻炼,我相信别人会拿出一个更优雅的解决方案,但无论如何,我会分享我:

awk -f script.awk input.txt 

与script.awk:

BEGIN{ 
    FS="," 
} 
NF<=12{ 
    print $0 
} 
NF>12{ 
    for (i=1; i<=4; i++) printf $i FS 
    for (j=0; j<2; j++){ 
     for (i=0; i<=(NF-12)/2; i++){ 
      printf $(i+5) 
      if (i<(NF-12)/2) printf "_" 
      else printf FS 
     } 
    } 
    for (i=NF-5; i<=NF; i++) printf $i FS 
    printf "n" 
} 

首先,我们将字段分隔符设置为,。如果我们计数小于或等于12字段,一切都很好,我们只需打印整行。如果有超过12个字段,我们首先打印前4个字段(再次使用字段分隔符),然后打印两次字段5(和字段6),但不打印,,我们将其与_进行交换。最后我们打印剩下的字段。

正如我所说,这可能是一个更优雅的解决方案。我想知道其他人出现了什么。

+0

每行应该有11个逗号(有12列),第5和第6列有额外的逗号。 – Stu

2

如果所有其他字段都是数字字段,则可以尝试按照该条件保存有用的逗号。

sed -r 's/(,)[0-9]/;/g' a | sed -r 's/[0-9](,)/;/g' | sed -r 's/,//g' | awk -F\; '{ print $1 "," $2 "," $3 "," $4 "," substr($5, 0, length($5)/2) "," substr($5, length($5)/2 +1, length($5)/2) "," $6 "," $7}' 
2011,23456,234567,234567,Hey ThereHow are you,Hey ThereHow are you,8286430903, 
1

可以与及其Text::CSV_XS模块尝试:

#!/usr/bin/env perl 

use warnings; 
use strict; 
use Text::CSV_XS; 

my (@columns); 

open my $fh, '<', shift or die; 

my $csv = Text::CSV_XS->new or die; 
while (my $row = $csv->getline($fh)) { 
    undef @columns; 
    if (@$row <= 12) { 
     @columns = @$row; 
     next; 
    } 

    my $extra_columns = (@$row - 12)/2; 
    my $post_columns_index = 4 + 2 * $extra_columns * 2; 
    @columns = ( 
     @$row[0..3], 
     (join('', @$row[4..(4+$extra_columns)])) x 2, 
     @$row[$post_columns_index..$#$row] 
    ); 
} 
continue { 
    $csv->print(\*STDOUT, \@columns); 
    printf "\n"; 
} 

假设与三根线,其中所述第一个具有一个额外的逗号一个输入文件(infile),第二个具有两个附加逗号,第三个是正确的:

2011,123456,1234567,12345678,Hey There,How are you,Hey There,How are you,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two 
2011,123456,1234567,12345678,Hey There,How are you,now,Hey There,How are you,now,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two 
2011,123456,1234567,12345678,Hey There:How are you,Hey There:How are you,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two 

运行脚本,如:

perl script.pl infile 

国债收益率:

2011,123456,1234567,12345678,"Hey ThereHow are you","Hey ThereHow are you",882864309037,"ABC ABCD",LABACD,1.00000000,80.2500000,"One Two" 
2011,123456,1234567,12345678,"Hey ThereHow are younow","Hey ThereHow are younow",LABACD,1.00000000,80.2500000,"One Two" 
2011,123456,1234567,12345678,"Hey There:How are you","Hey There:How are you",882864309037,"ABC ABCD",LABACD,1.00000000,80.2500000,"One Two" 

需要注意的是它增加了一些报价,但它是正确的总部设在csv规范,更容易处理了以前的状态。