2010-05-13 76 views
1

我有记录看起来像这样的文件...PowerShell的 - 日志文件转换为CSV

2009-12-18T08:25:22.983Z  1   174 dns:0-apr-credit-cards-uk.pedez.co.uk P http://0-apr-credit-cards-uk.pedez.co.uk/ text/dns #170 20091218082522021+89 sha1:AIDBQOKOYI7OPLVSWEBTIAFVV7SRMLMF - - 
2009-12-18T08:25:22.984Z  1   5 dns:0-60racing.co.uk P http://0-60racing.co.uk/ text/dns #116 20091218082522037+52 sha1:WMII7OOKYQ42G6XPITMHJSMLQFLGCGMG - - 
2009-12-18T08:25:23.066Z  1   79 dns:0-addiction.metapress.com.wam.leeds.ac.uk P http://0-addiction.metapress.com.wam.leeds.ac.uk/ text/dns #042 20091218082522076+20 sha1:NSUQN6TBIECAP5VG6TZJ5AVY34ANIC7R - - 
...plus millions of other records 

我需要把这些转化成CSV文件...

"2009-12-18T08:25:22.983Z","1","174","dns:0-apr-credit-cards-uk.pedez.co.uk","P","http://0-apr-credit-cards-uk.pedez.co.uk/","text/dns","#170","20091218082522021+89","sha1:AIDBQOKOYI7OPLVSWEBTIAFVV7SRMLMF","-","-" 
"2009-12-18T08:25:22.984Z","1","5","dns:0-60racing.co.uk","P","http://0-60racing.co.uk/","text/dns","#116","20091218082522037+52","sha1:WMII7OOKYQ42G6XPITMHJSMLQFLGCGMG","-","-" 
"2009-12-18T08:25:23.066Z","1","79","dns:0-addiction.metapress.com.wam.leeds.ac.uk","P","http://0-addiction.metapress.com.wam.leeds.ac.uk/","text/dns","#042","20091218082522076+20","sha1:NSUQN6TBIECAP5VG6TZJ5AVY34ANIC7R","-","-" 

字段分隔符可以可以是单个或多个空格字符,同时具有固定宽度和可变宽度字段。这往往会混淆我发现的大多数CSV解析器。

最终我想将这些文件包装到SQL Server中,但只能指定一个字符作为字段分隔符(即''),并且这会打破固定长度的字段。

到目前为止 - 我使用PowerShell的

gc -ReadCount 10 -TotalCount 200 .\crawl_sample.log | foreach { ([regex]'([\S]*)\s+').matches($_) } | foreach {$_.Groups[1].Value} 

这返回的字段的流:

2009-12-18T08:25:22.983Z 
1 
74 
dns:0-apr-credit-cards-uk.pedez.co.uk 
P 
http://0-apr-credit-cards-uk.pedez.co.uk/ 
text/dns 
#170 
20091218082522021+89 
sha1:AIDBQOKOYI7OPLVSWEBTIAFVV7SRMLMF 
- 
- 
2009-12-18T08:25:22.984Z 
1 
55 
dns:0-60racing.co.uk 
P 
http://0-60racing.co.uk/ 
text/dns 
#116 
20091218082522037+52 
sha1:WMII7OOKYQ42G6XPITMHJSMLQFLGCGMG 
- 

但我怎么是输出转换成CSV格式?

+0

你可能想看看我的FOSS CSV改写(munging)工具http://code.google.com/p/csvfix,我可以按照你的想法做,但只能作为一个多阶段的过程。 – 2010-05-13 12:30:09

回答

2

Anwering再次我自己的问题......

measure-command { 
    $q = [regex]" +" 
    $q.Replace(([string]::join([environment]::newline, (Get-Content -ReadCount 1 \crawl_sample2.log))), ",") > crawl_sample2.csv 
} 

,它的快!

观察:

  • 我用\s+作为正则表达式分隔符和该被打破换行符
  • Get-Content -ReadCount 1到流单列阵列来正则表达式
  • 然后管输出字符串到新的文件

UPDATE

此脚本可用,但在处理大文件时使用大量内存。那么,如果没有8GB内存和交换空间,我该如何做同样的事情!

我认为这是由join再次缓冲所有的数据....任何想法?

更新2

OK - 有一个更好的解决办法...

Get-Content -readcount 100 -totalcount 100000 .\crawl.log | 
    ForEach-Object { $_ } | 
     foreach { $_ -replace " +", "," } > .\crawl.csv 

一个非常方便的指南PowerShell的 - Powershell regular expressions

+0

...欢迎任何更好的解决方案或对脚本的改进! – Guy 2010-05-13 12:44:29

+1

您可以通过摆脱中间的Foreach-Object来简化这一点,因为-replace在字符串数组上运行,例如''a b','c d','e f'-replace'+',',''。试试这个'gc crawl.log -read 100 -total 100000 | %{$ _ -replace'+',','}>抓取。csv' – 2010-05-14 00:05:21

+0

考虑'-replace',它可以更简单:'(gc crawl.log ...)-replace'+',',''> crawl.csv(我的帖子*运营商链* http:/ /www.leporelo.eu/blog.aspx?id=powershell-tips-and-tricks-3-chain-of-operators) – stej 2010-05-14 08:07:34