2013-03-18 66 views
-2

我有一个位置网格(AI和1-9)的字母数字文本,其在一个平面文件引用(*的.csv)以各种形式,有时包括空格,和随机的情况下,如: 9-H,@ b 3,e-4,d4,c6,5h,C2,i9,...这是a到i和1到9的任何组合,包括空白,〜和。提取从csv文件

什么是处理提取这种字母数字字符的好方法?理想情况下,输出将位于“注释”前面的另一列或其他文本文件中。我可以阅读脚本并弄清楚他们做了什么,但是我还不够自信地写下它们。

样品输入文件:

Record Notes 
46651 Adrian reported green-pylons are in central rack. (e-4) 
46652 Jose enetered location of triangles in the uppur corner. (b/c6) 
46207 [Location: 5h] Gabe located the long pipes in the near the far corner. 
46205 Committee-reports are in boxes in holding area, @ b 3). 
45164 Caller-nu,mbers @ 1A 
45165 All carbon rod tackles 3 F and short (top rack) 
45166 USB(3 Port) in C2 
45167 Full tackle in b2. 
45168 5b; USB(4 port) 
45073 SHOVELs+ KIPER ON PET-FOOD (@g6), ALSO ATTEMPT-STALL AND DRAWCORD. 
45169 Persistent CORDS ~i9 
45170 Deliverate handball moved to D-2 on instructions from Pete 
45440 slides and overheads + contact-sheets to 9-H (top bin). 
45441 d7-slides and negatives (black and white) 
<eof> 

希望的输出(在字母数字格式,无论是在同一个文件,或新的文件)

Record Location Notes 
46651 E4 
46652 C6 
46205 A1 
... 
46169 I9 

即,总是提取后者的字符集。

好的家伙,“在未初始化值$注意在使用模式匹配(M //)”的错误越来越之后,我刚刚就做了尝试和我取得了部分成功。

# # starts with anything then space or punctuation then letter then number 
if ($note =~ /.*[\s\~\p{Punct}]([a-iA-I])[\s\p{Punct}]*([0-9]).*/) { 
    $note =~ s/.*[\s\~\p{Punct}]([a-iA-I])[\s\p{Punct}]*([0-9]).*/$1$2/x; 

# # starts line with letter then number 
} elsif ($note =~ /^([a-iA-I])[\s\p{Punct}]*([0-9]).*/) { 
    $note =~ s/^([a-iA-I])[\s\p{Punct}]*([0-9]).*/$1$2/x; 

# # after punctuation then number 
} elsif ($note =~ /.*[\s\p{Punct}]([0-9])[\s\p{Punct}]*([a-iA-I]).*/) { 
    $note =~ s/.*[\s\p{Punct}]([0-9])[\s\p{Punct}]*([a-iA-I]).*/$2$1/x; 

# # beginning of line with number 
} elsif ($note =~ /^([0-9])[\s\p{Punct}]*([a-iA-I]).*/) { 
    $note =~ s/^([0-9])[\s\p{Punct}]*([a-iA-I]).*/$2$1/x; 

# # empty line or no record of any grid location except "#7 asdfg" format 
} elsif ($note=~ "") { 
    $note = "##"; 

} 

的时间脚本是不是很成功的是,当它遇到的记录,如99994和99993.

99999 norecordofgridhere -
99997箱#7进入与出发票的阵列。
99996在第7小时下降,而当我在场外发现时,教练在第8小时。
99994箱在上任后4桶在办公室文件柜顶级的
99993 6盒

输出现在是:

99999 ## norecordofgridhere -
99998 ##
99997Ë7方框#7没有发票进入阵列。
99996当我发现离场时,E8在第7小时下降,并且在第8小时中,
99994 B 4纸箱在上任后4桶
99993 b 6分配6盒在办公室文件柜顶级的

应该有99994和99993.#分别在哪里我失败了呢?我应该如何解决这个问题?

我认为,有一个更清洁的方式,喜欢用文字:: CSV_XS,但是,我遇到了草莓perl的毛刺,甚至测试模块已正确安装后。所以我回到了主动状态。

+0

你可以给这个例子输入所需的输出? – azhrei 2013-03-19 00:04:26

+0

只是要清楚:你想抓的东西是'e-4','b/c6','5h','b 3','1A','3 F','C2','b2' ,'5b','g6','i9','D-2','9-H'和'd7'? – Dougal 2013-03-19 00:04:41

+0

不仅可以抓取这些文件,还可以将它们列为文件中每个记录的字母数字,即E4,C6,B3,A1等等。 – Solutions 2013-03-19 00:23:09

回答

0
... 

my $coord; 
if ($note =~/
    (?&DEL) 

    ((?&ROW) (?&SEP)?+ (?&COL) 
    | (?&COL) (?&SEP)?+ (?&ROW) 
    ) 

    (?&DEL) 

    (?(DEFINE) 
     (?<ROW> [a-hA-H] ) 
     (?<COL> [1-9]  ) 
     (?<SEP> [\s~\@\-]++) 
     (?<DEL>^| \W | \z) 
    ) 
/x) { 
    $coord = $1; 
    (my $row = uc($coord)) =~ s/[^A-H]//g; 
    (my $col = uc($coord)) =~ s/[^1-9]//g; 
    $coord = "$row$col"; 
} 

... 
0

使用Text::CSV_XS解析CSV文件,它快速而准确。

然后构建一个正则表达式来匹配ID。

最后,标准化每个ID。

#!/usr/bin/perl 

use v5.10; 
use strict; 
use warnings; 
use autodie; 

use Text::CSV_XS; 

# Build up the regular expression to look for IDs 
my $Separator_Set = qr{ [- ] }x; 
my $ID_Letters_Set = qr{ [a-i] }xi; 
my $ID_Numbers_Set = qr{ [1-9] }x; 
my $Location_Re = qr{ 
    \b 
    $ID_Letters_Set $Separator_Set? $ID_Numbers_Set | 
    $ID_Numbers_Set $Separator_Set? $ID_Letters_Set 
    \b 
}x; 

# Initialize Text::CSV_XS and tell it this is a tab separated CSV 
my $csv = Text::CSV_XS->new({ 
    sep_char => "\t", # tab separated fields 
}) or die "Cannot use CSV: ".Text::CSV_XS->error_diag(); 

# Read in and discard the CSV header line. 
my $headers = $csv->getline(*DATA); 

# Output our own header line  
say "Record\tLocation\tNotes"; 

# Read each CSV row, extract and normalize the ID, and output a new row. 
while(my $row = $csv->getline(*DATA)) { 
    my($record, $notes) = @$row; 

    # Extract and normalize the ID 
    my($id) = $notes =~ /($Location_Re)/; 
    $id = normalize_id($id); 

    # Output a new row 
    printf "%d\t%s\t%s\n", $record, $id, $notes; 
} 


sub normalize_id { 
    my $id = shift; 

    # Return empty string if we were passed in a blank 
    return '' if !defined $id or !length $id or $id !~ /\S/; 

    my($letter) = $id =~ /($ID_Letters_Set)/; 
    my($number) = $id =~ /($ID_Numbers_Set)/; 

    return uc($letter).$number; 
} 

__END__ 
Record Notes 
46651 Adrian reported green-pylons are in central rack. (e-4) 
46652 Jose enetered location of triangles in the uppur corner. (b/c6) 
46207 [Location: 5h] Gabe located the long pipes in the near the far corner. 
46205 Committee-reports are in boxes in holding area, @ b 3). 
45164 Caller-nu,mbers @ 1A 
45165 All carbon rod tackles 3 F and short (top rack) 
45166 USB(3 Port) in C2 
45167 Full tackle in b2. 
45168 5b; USB(4 port) 
45073 SHOVELs+ KIPER ON PET-FOOD (@g6), ALSO ATTEMPT-STALL AND DRAWCORD. 
45169 Persistent CORDS ~i9 
45170 Deliverate handball moved to D-2 on instructions from Pete 
45440 slides and overheads + contact-sheets to 9-H (top bin). 
45441 d7-slides and negatives (black and white)