提取从csv文件

-2

我有一个位置网格（AI和1-9）的字母数字文本，其在一个平面文件引用（*的.csv）以各种形式，有时包括空格，和随机的情况下，如： 9-H，@ b 3，e-4，d4，c6，5h，C2，i9，...这是a到i和1到9的任何组合，包括空白，〜和。提取从csv文件

什么是处理提取这种字母数字字符的好方法？理想情况下，输出将位于“注释”前面的另一列或其他文本文件中。我可以阅读脚本并弄清楚他们做了什么，但是我还不够自信地写下它们。

样品输入文件：

Record Notes 
46651 Adrian reported green-pylons are in central rack. (e-4) 
46652 Jose enetered location of triangles in the uppur corner. (b/c6) 
46207 [Location: 5h] Gabe located the long pipes in the near the far corner. 
46205 Committee-reports are in boxes in holding area, @ b 3). 
45164 Caller-nu,mbers @ 1A 
45165 All carbon rod tackles 3 F and short (top rack) 
45166 USB(3 Port) in C2 
45167 Full tackle in b2. 
45168 5b; USB(4 port) 
45073 SHOVELs+ KIPER ON PET-FOOD (@g6), ALSO ATTEMPT-STALL AND DRAWCORD. 
45169 Persistent CORDS ~i9 
45170 Deliverate handball moved to D-2 on instructions from Pete 
45440 slides and overheads + contact-sheets to 9-H (top bin). 
45441 d7-slides and negatives (black and white) 
<eof>

希望的输出（在字母数字格式，无论是在同一个文件，或新的文件）

Record Location Notes 
46651 E4 
46652 C6 
46205 A1 
... 
46169 I9

即，总是提取后者的字符集。

好的家伙，“在未初始化值$注意在使用模式匹配（M //）”的错误越来越之后，我刚刚就做了尝试和我取得了部分成功。

# # starts with anything then space or punctuation then letter then number 
if ($note =~ /.*[\s\~\p{Punct}]([a-iA-I])[\s\p{Punct}]*([0-9]).*/) { 
    $note =~ s/.*[\s\~\p{Punct}]([a-iA-I])[\s\p{Punct}]*([0-9]).*/$1$2/x; 

# # starts line with letter then number 
} elsif ($note =~ /^([a-iA-I])[\s\p{Punct}]*([0-9]).*/) { 
    $note =~ s/^([a-iA-I])[\s\p{Punct}]*([0-9]).*/$1$2/x; 

# # after punctuation then number 
} elsif ($note =~ /.*[\s\p{Punct}]([0-9])[\s\p{Punct}]*([a-iA-I]).*/) { 
    $note =~ s/.*[\s\p{Punct}]([0-9])[\s\p{Punct}]*([a-iA-I]).*/$2$1/x; 

# # beginning of line with number 
} elsif ($note =~ /^([0-9])[\s\p{Punct}]*([a-iA-I]).*/) { 
    $note =~ s/^([0-9])[\s\p{Punct}]*([a-iA-I]).*/$2$1/x; 

# # empty line or no record of any grid location except "#7 asdfg" format 
} elsif ($note=~ "") { 
    $note = "##"; 

}

的时间脚本是不是很成功的是，当它遇到的记录，如99994和99993.

99999 norecordofgridhere -
99997箱＃7进入与出发票的阵列。
99996在第7小时下降，而当我在场外发现时，教练在第8小时。
99994箱在上任后4桶在办公室文件柜顶级的
99993 6盒

输出现在是：

99999 ## norecordofgridhere -
99998 ##
99997Ë7方框＃7没有发票进入阵列。
99996当我发现离场时，E8在第7小时下降，并且在第8小时中，
99994 B 4纸箱在上任后4桶
99993 b 6分配6盒在办公室文件柜顶级的

应该有99994和99993.＃分别在哪里我失败了呢？我应该如何解决这个问题？

我认为，有一个更清洁的方式，喜欢用文字:: CSV_XS，但是，我遇到了草莓perl的毛刺，甚至测试模块已正确安装后。所以我回到了主动状态。

来源

2013-03-18 Solutions

你可以给这个例子输入所需的输出？ – azhrei 2013-03-19 00:04:26

只是要清楚：你想抓的东西是'e-4'，'b/c6'，'5h'，'b 3'，'1A'，'3 F'，'C2'，'b2' ，'5b'，'g6'，'i9'，'D-2'，'9-H'和'd7'？ – Dougal 2013-03-19 00:04:41

不仅可以抓取这些文件，还可以将它们列为文件中每个记录的字母数字，即E4，C6，B3，A1等等。 – Solutions 2013-03-19 00:23:09

... 

my $coord; 
if ($note =~/
    (?&DEL) 

    ((?&ROW) (?&SEP)?+ (?&COL) 
    | (?&COL) (?&SEP)?+ (?&ROW) 
    ) 

    (?&DEL) 

    (?(DEFINE) 
     (?<ROW> [a-hA-H] ) 
     (?<COL> [1-9]  ) 
     (?<SEP> [\s~\@\-]++) 
     (?<DEL>^| \W | \z) 
    ) 
/x) { 
    $coord = $1; 
    (my $row = uc($coord)) =~ s/[^A-H]//g; 
    (my $col = uc($coord)) =~ s/[^1-9]//g; 
    $coord = "$row$col"; 
} 

...

来源

2013-03-19 00:19:51 ikegami

使用Text::CSV_XS解析CSV文件，它快速而准确。

然后构建一个正则表达式来匹配ID。

最后，标准化每个ID。

#!/usr/bin/perl 

use v5.10; 
use strict; 
use warnings; 
use autodie; 

use Text::CSV_XS; 

# Build up the regular expression to look for IDs 
my $Separator_Set = qr{ [- ] }x; 
my $ID_Letters_Set = qr{ [a-i] }xi; 
my $ID_Numbers_Set = qr{ [1-9] }x; 
my $Location_Re = qr{ 
    \b 
    $ID_Letters_Set $Separator_Set? $ID_Numbers_Set | 
    $ID_Numbers_Set $Separator_Set? $ID_Letters_Set 
    \b 
}x; 

# Initialize Text::CSV_XS and tell it this is a tab separated CSV 
my $csv = Text::CSV_XS->new({ 
    sep_char => "\t", # tab separated fields 
}) or die "Cannot use CSV: ".Text::CSV_XS->error_diag(); 

# Read in and discard the CSV header line. 
my $headers = $csv->getline(*DATA); 

# Output our own header line  
say "Record\tLocation\tNotes"; 

# Read each CSV row, extract and normalize the ID, and output a new row. 
while(my $row = $csv->getline(*DATA)) { 
    my($record, $notes) = @$row; 

    # Extract and normalize the ID 
    my($id) = $notes =~ /($Location_Re)/; 
    $id = normalize_id($id); 

    # Output a new row 
    printf "%d\t%s\t%s\n", $record, $id, $notes; 
} 


sub normalize_id { 
    my $id = shift; 

    # Return empty string if we were passed in a blank 
    return '' if !defined $id or !length $id or $id !~ /\S/; 

    my($letter) = $id =~ /($ID_Letters_Set)/; 
    my($number) = $id =~ /($ID_Numbers_Set)/; 

    return uc($letter).$number; 
} 

__END__ 
Record Notes 
46651 Adrian reported green-pylons are in central rack. (e-4) 
46652 Jose enetered location of triangles in the uppur corner. (b/c6) 
46207 [Location: 5h] Gabe located the long pipes in the near the far corner. 
46205 Committee-reports are in boxes in holding area, @ b 3). 
45164 Caller-nu,mbers @ 1A 
45165 All carbon rod tackles 3 F and short (top rack) 
45166 USB(3 Port) in C2 
45167 Full tackle in b2. 
45168 5b; USB(4 port) 
45073 SHOVELs+ KIPER ON PET-FOOD (@g6), ALSO ATTEMPT-STALL AND DRAWCORD. 
45169 Persistent CORDS ~i9 
45170 Deliverate handball moved to D-2 on instructions from Pete 
45440 slides and overheads + contact-sheets to 9-H (top bin). 
45441 d7-slides and negatives (black and white)

来源

2013-03-19 01:06:24 Schwern

提取从csv文件

回答

相关问题