2014-08-27 21 views
0

我有一个问题。我想编写一个perl脚本来将Mailgun输出解析为csv格式。我会假设'拆分'和'连接'功能可以适用于此过程。下面是一些示例数据:mailgun报告为csv格式perl

样本数据

{ 

    "geolocation": { 

    "city": "Random City", 

    "region": "State", 

    "country": "US" 
    }, 
    "url": "https://www4.website.com/register/1234567", 

    "timestamp": "1237854980723.0239847" 
} 


{ 

    "geolocation": { 

    "city": "Random City2", 

    "region": "State2", 

    "country": "mEXICO" 
    }, 
    "url": "https://www4.website2.com/register/ABCDE567", 

    "timestamp": "1237854980723.0239847" 
} 

所需的输出

“城市”, “区域”, “国家”, “URL”, “时间戳”

“随机城市”,“州”,“美国”,“https://www4.website.com/register/1234567”,“1237854980723.0239847”

“随机City_2”,“State_2”,“mEXICO”,“www4.website2.com/ABCDE567","1234.jpg”,网址为:http://www4.website2.com/ABCDE567 ,,“1237854980723.0239847_2”

我的目标是将我的Sample数据创建为逗号分隔的CSV文件。我不确定如何去解决这个问题。通常我会尝试通过批处理文件中的一系列单行程来破解,但我更喜欢perl脚本。真实的数据将包含更多信息。但是,只要弄清楚如何解析一般结构就没问题。

这是我在一个批处理文件中。

代码

perl -p -i.bak -e "s/(,$|,+ +$|^.*?{$|^.*?}.*?$|^.*?],.*?$)//gi" file.txt 

    rem Removes all unnecessary characters and lines with { and }.^

    perl -p -i.bak -e "s/(^ +| +$)//gi" file.txt  

    perl -p -i.bak -e "s/^\n$//gi" file.txt 


rem Removes all blank lines in initial file. Next one-liner takes care of trailing and beginning 

rem whitespace. The file is nice and clean now. 

perl -p -e "s/(^\".*?\"):.*?$/$1/gi" file.txt > header.txt 

rem retains only header info and puts into 'header.txt'^

perl -p -e "s/^\".*?\": +(\".*?\"$)/$1/gi" file.txt > data.txt 

rem retains only data that is associated with each field. 

perl -p -i.bak -e "s/\n/,/gi" data.txt 

rem replaces new line character with ',' delimiter. 

perl -p -i.bak -e "s/^/\n/gi" data.txt 

rem drops data down a line 

perl -p -i.bak -e "s/\n/,/gi" header.txt 

rem replaces new line character with ',' delimiter. 

copy header.txt+data.txt report.txt 

rem copies both files together. Since there is the same amount of fields as there are data 

rem delimiters, the columns and headers match. 

我的输出

“城市”, “区域”, “国家”, “URL”, “时间戳”

“随机城” “国家”,“美国”,“https://www4.website.com/register/1234567”,1237854980723.0239847

这是做的伎俩,但浓缩脚本会更好。变化的情况会影响到这个批处理脚本,我需要更坚实的东西。有什么建议么??

+2

使用[JSON](https://metacpan.org/pod/JSON)。 – jm666 2014-08-27 21:38:48

回答

1

您可以使用一个Perl脚本用一个正则表达式

#!/usr/bin/env perl 
use v5.10; 
use Data::Dumper; 

$_ = <<TXT; 
{ 

    "geolocation": { 

    "city": "Random City", 

    "region": "State", 

    "country": "US" 
    }, 
    "url": "https://www4.website.com/register/1234567", 

    "timestamp": "1237854980723.0239847" 
} 
TXT 

my @matches = /\s*\s*("[^"]+")\s*\s*:\s*("[^"]+")/gmx; 
my %hash = @matches; 

say join(",", keys %hash); 
say join(",", values %hash);   

其中输出这样的:

"city","country","region","timestamp","url" 
"Random City","US","State","1237854980723.0239847","https://www4.website.com/register/1234567" 

当然,如果你想使用标准输入,而不是你替换字符串定义:

local $/ = undef; 
$_ = <>; 

如果你想要一个更健壮的代码,我建议首先匹配数据块包含编成括号。然后你会搜索关键字:值。

我会写这个program.pl文件:

#!/usr/bin/env perl 
use v5.10; 
use Data::Dumper; 

local $/ = undef;  
open FILE, $ARGV[0] or die $!; 
$_ = <FILE>; 
close FILE; 

# Match all group { ... } 
my @groups = /((?&BRACKETED)) 
(?(DEFINE) 
    (?<WORD>  [^\{\}]+) 
    (?<BRACKETED> \s* \{ (?&TEXT)? \s* \}) 
    (?<TEXT>  (?: (?&WORD) | (?&BRACKETED))+) 
)/gmx; 

# Match any key:value pairs inside each group 
my @results; 
for(grep($_,@groups)) { 
    push @results, {/\s*\s*"([^"]+)"\s*\s*:\s*("[^"]+")/gmx}; 
} 

# For each result, we print the keys we want 
for(@results) { 
    say join ",", @$_{qw/city region country url timestamp/}; 
} 

然后一个批处理文件来调用脚本:

rem How to call it... 
@perl program.pl text.txt > report.txt 
+0

我喜欢你的答案。它以他们想要的方式工作,但请查看我刚刚对我的问题所做的编辑。查看所需的输出和我提供的重新编辑的样本数据。如果有两组数据呢?所以csv将包含我们提取的头,然后在它下面将是数据行1,数据行2等等。 @coin – JDE876 2014-09-02 17:53:18

+0

@ JDE876脚本的第二个版本将输出您期望的内容:每个城市有两行代码。但是,而不是使用正则表达式来解析您的数据,我会建议使用JSON解析器。 – nowox 2014-09-02 18:27:57

+0

是否有任何可能的方式可以提供用JSON解析器替换正则表达式的示例? @coin – JDE876 2014-09-02 19:17:25

0

完全没有@硬币的正则表达式福嗤之以鼻,但使用CPAN模块的优点包括获得一个更加灵活的解决方案,并且可以利用其他人已经制定的边缘案例处理。

该解决方案使用JSON模块来解析您的传入数据(我假设它继续看起来像JSON),并使用CSV模块生成高质量的CSV,这样可以处理像嵌入式引号和逗号之类的内容你的数据。

use warnings; 
use strict; 

use JSON qw/decode_json/; 
use Text::CSV_XS; 

my $json_data_as_string = <<EOL; 
{ 
    "geolocation": { 
     "city": "Random City", 
     "region": "State", 
     "country": "US" 
    }, 
    "url": "https://www4.website.com/register/1234567", 
    "timestamp": "1237854980723.0239847" 
} 
EOL 

my $s = decode_json($json_data_as_string); 

my $csv = Text::CSV_XS->new({ binary => 1 }); 

$csv->combine(
    $s->{geolocation}{city}, 
    $s->{geolocation}{region}, 
    $s->{geolocation}{country}, 
    $s->{url}, 
    $s->{timestamp}, 
) || die $csv->error_diag;; 

print $csv->string, "\n"; 

要从文件读取数据到$ json_data_as_string中,您可以使用@ coin解决方案中的代码。