2011-09-15 191 views
2

我需要在行之间提取文本行并将其填充到excel文件中。有行数之间的差异,但他们已经开始与 评论备案“IDNO” ......等文字相同行之间的文本提取

__DATA__ (This is what my .txt file looks like) 
Comment for the record "id1" 
Attempt1 made on [time] outcome [outcome] 
note 1 

Comment for the record "id2" 
Attempt1 made on [time] outcome [outcome] 
note 1 
Attempt2 made on [time] outcome [outcome] 
note 2 

Comment for the record "id3" 
Attempt1 made on [time] outcome [outcome] 
note 1 
Attempt2 made on [time] outcome [outcome] 
note 2 
Attempt3 made on [time] outcome [outcome] 
note 3 
Attempt4 made on [time] outcome [outcome] 
note 4 

的字符串希望这显示

id1  Attempt1 Note1 [outcome] 
id2  Attempt1 Note1 [outcome] 
id2  Attempt2 Note2 [outcome] 
id3  Attempt1 Note1 [outcome] 
id3  Attempt2 Note2 [outcome] 
id3  Attempt3 Note3 [outcome] 
id3  Attempt4 Note4 [outcome] 

结果值将改变并且将是2-3位数字代码。

任何帮助将不胜感激。我在最后一天或2天浏览过这个网站,但由于我的经验有限,我无法找到相关的东西,而且我是相当新的perl,shell认为将它作为一个问题发布会更好。

类方面, 王牌

回答

1

我想你寻找这样的事情。它打印CSV可以用Excel

use strict; 

local $/; 

block(/(id\d+)/,$_) for split /\n\n/, <DATA>; 

sub block { 
    my ($id,$block) = @_; 

    $block =~ s/.*?(?=Attempt)//s; 

    print join(',', $id, /(Attempt\d+)/, /([^\n]+)$/, /outcome (\d+)/)."\n" 
    for split /(?=Attempt)/, $block 
    ; 
} 
+0

CPAN也有一个简单的Excel模块,可能对此很有用。 – Sorpigal

2

使用GNU AWK(为正则表达式捕获组)打开

gawk ' 
    /^$/ {next} 
    match($0, /Comment for the record "([^"]*)/, a) {id = a[1]; next} 
    match($0, /(.+) made on .* outcome (.+)/, a) {att = a[1]; out = a[2]; next} 
    {printf("%s\t%s\t%s\t%s\n", id, att, $0, out)} 
' 

,或者翻译成Perl:

perl -lne ' 
    chomp; 
    next if /^$/; 
    if (/Comment for the record "([^"]*)/) {$id = $1; next;} 
    if (/(.+) made on .* outcome (.+)/) {$att = $1; $out = $2; next;} 
    print join("\t", $id, $att, $_, $out); 
' 
1

除非我缺少的东西,它看起来很直截了当:

  • 您寻找一条以Comment开头的行。这将包含您的ID。
  • 一旦你有一个ID,你会有一个尝试线,后面跟着一条笔记线。阅读试图和之后将包含注释的行。
  • 当你到下一个评论时,你需要冲洗并重复。

我们有一个特殊的结构:每个ID将有一个尝试的数组。每次尝试将包含结果注释

我打算在这里使用面向对象的Perl。我会将所有记录ID放入一个列表,名为@dataList,此列表中的每个项目都是Id类型。

每种类型Id将包括尝试阵列,并且每个尝试将具有标识时间成果,和

#! /usr/bin/perl 
# test.pl 

use strict; 
use warnings; 
use feature qw(say); 

######################################################################## 
# READ IN AND PARSE YOUR DATA 
# 

my @dataList; 

my $record; 
while (my $line = <DATA>) { 
    chomp $line; 
    if ($line =~ /^Comment for the record "(.*)"/) { 
     my $id = $1; 
     $record = Id->new($id); 
     push @dataList, $record; 
    } 
    elsif ($line =~ /^(\S+)\s+made on\s(\S+)\soutcome\s(.*)/) { 
     my $attemptId = $1; 
     my $time = $2; 
     my $outcome = $3; 

     # Next line is the note 

     chomp (my $note = <DATA>); 
     my $attempt = Attempt->new($attemptId, $time, $outcome, $note); 
     $record->PushAttempt($attempt); 
    } 
} 

foreach my $id (@dataList) { 
    foreach my $attempt ($id->Attempt) { 
     print $id->Id . "\t"; 
     print $attempt->Id . "\t"; 
     print $attempt->Note . "\t"; 
     print $attempt->Outcome . "\n"; 
    } 
} 
# 
######################################################################## 


######################################################################## 
# PACKAGE Id; 
# 
package Id; 
use Carp; 

sub new { 
    my $class  = shift; 
    my $id = shift; 

    my $self = {}; 

    bless $self, $class; 

    $self->Id($id); 

    return $self; 
} 

sub Id { 
    my $self = shift; 
    my $id = shift; 

    if (defined $id) { 
     $self->{ID} = $id; 
    } 

    return $self->{ID}; 
} 

sub PushAttempt { 
    my $self  = shift; 
    my $attempt = shift; 

    if (not defined $attempt) { 
     croak qq(Missing Attempt in call to Id->PushAttempt); 
    } 
    if (not exists ${$self}{ATTEMPT}) { 
     $self->{ATTEMPT} = []; 
    } 
    push @{$self->{ATTEMPT}}, $attempt; 

    return $attempt; 
} 

sub PopAttempt { 
    my $self = shift; 

    return pop @{$self->{ATTEMPT}}; 
} 

sub Attempt { 
    my $self = shift; 
    return @{$self->{ATTEMPT}}; 
} 


# 
######################################################################## 

######################################################################## 
# PACKAGE Attempt 
# 
package Attempt; 

sub new { 
    my $class  = shift; 
    my $id = shift; 
    my $time  = shift; 
    my $note  = shift; 
    my $outcome = shift; 

    my $self = {}; 
    bless $self, $class; 

    $self->Id($id); 
    $self->Time($time); 
    $self->Note($note); 
    $self->Outcome($outcome); 

    return $self; 
} 

sub Id { 
    my $self = shift; 
    my $id = shift; 


    if (defined $id) { 
     $self->{ID} = $id; 
    } 

    return $self->{ID}; 
} 

sub Time { 
    my $self = shift; 
    my $time = shift; 

    if (defined $time) { 
     $self->{TIME} = $time; 
    } 

    return $self->{TIME}; 
} 

sub Note { 
    my $self = shift; 
    my $note = shift; 

    if (defined $note) { 
     $self->{NOTE} = $note; 
    } 

    return $self->{NOTE}; 
} 

sub Outcome { 
    my $self  = shift; 
    my $outcome = shift; 

    if (defined $outcome) { 
     $self->{OUTCOME} = $outcome; 
    } 

    return $self->{OUTCOME}; 
} 
# 
######################################################################## 

package main; 

__DATA__ 
Comment for the record "id1" 
Attempt1 made on [time] outcome [outcome11] 
note 11 

Comment for the record "id2" 
Attempt21 made on [time] outcome [outcome21] 
note 21 
Attempt22 made on [time] outcome [outcome22] 
note 22 

Comment for the record "id3" 
Attempt31 made on [time] outcome [outcome31] 
note 31 
Attempt32 made on [time] outcome [outcome32] 
note 32 
Attempt33 made on [time] outcome [outcome33] 
note 33 
Attempt34 made on [time] outcome [outcome34] 
note 34 
0

这可能不是非常可靠的,但这里有一个有趣的尝试与sed

sed -r -n 's/Comment for the record "([^"]+)"$/\1/;tgo;bnormal;:go {h;n;};:normal /^Attempt[0-9]/{s/(.+) made on .* outcome (.+)$/\1 \2/;G;s/\n/ /;s/(.+) (.+) (.+)/\3\t\1\t\2/;N;s/\t([^\t]+)\n(.+)/\t\2\t\1/;p;d;}' data.txt 

注:GNU sed的唯一。如果需要,可移植性很容易实现。

2

您的数据与段落导向解析策略很好地吻合。因为你的规范是模糊的,很难知道需要什么正则表达式,但是这应该说明的一般方法:根据你的榜样

use strict; 
use warnings; 

# Paragraph mode: read the input file a paragraph/block at a time. 
local $/ = ""; 

while (my $block = <>){ 
    # Convert the block to lines. 
    my @lines = grep /\S/, split("\n", $block); 

    # Parse the text, capturing needing items from @lines as we consume it. 
    # Note also the technique of assigning regex captures directly to variables. 
    my ($id) = shift(@lines) =~ /"(.+)"/; 
    while (@lines){ 
     my ($attempt, $outcome) = shift(@lines) =~ /(Attempt\d+).+outcome (\d+)/; 
     my $note = shift @lines; 
     print join("\t", $id, $attempt, $note, $outcome), "\n"; 
    } 
} 
+1

设置'$/=“\ n \ n”'意味着两条换行符,而推荐的设置'$/=“”'意味着**两条或更多的换行符**,以便它对多少空白行,每个记录始终以真实数据开始。 – tchrist

+0

@tchrist不知道。谢谢你的提示。 – FMc

0

AWK oneliner。

kent$ awk 'NF==5{gsub(/\"/,"",$5);id=$5;next;} /^Attempt/{n=$1;gsub(/Attempt/,"Note",n);print id,$1,n,$6}' input      
id1 Attempt1 Note1 [outcome] 
id2 Attempt1 Note1 [outcome] 
id2 Attempt2 Note2 [outcome] 
id3 Attempt1 Note1 [outcome] 
id3 Attempt2 Note2 [outcome] 
id3 Attempt3 Note3 [outcome] 
id3 Attempt4 Note4 [outcome] 
相关问题