2012-10-27 44 views
1

这是使用R解析这样的日志文件的最佳方式吗?如何使用R解析网络服务器日志?

- - - [20/Nov/2011:01:16:29 +0100] "POST /csw/servlet/cswservlet HTTP/1.1" 200 279 
- - - [20/Nov/2011:01:16:29 +0100] "GET /DescargaFenomenos/index.jsp HTTP/1.1" 200 11769 
- - - [20/Nov/2011:01:16:29 +0100] "GET /IDEE-ServicesSearch/ServicesSearch.html?locale=es HTTP/1.1" 200 1665 
- - - [20/Nov/2011:01:16:29 +0100] "GET /search/indexLayout.jsp?PAGELANGUAGE=es HTTP/1.1" 200 9874 
- - - [20/Nov/2011:01:16:29 +0100] "GET /clientesIGN/wmsGenericClient/index.html?lang=ES HTTP/1.1" 200 12058 
- - - [20/Nov/2011:01:16:30 +0100] "POST /csw/servlet/cswservlet HTTP/1.1" 200 258038 
- - - [20/Nov/2011:01:17:09 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769 
- - - [20/Nov/2011:01:17:33 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769 
- - - [20/Nov/2011:01:17:33 +0100] "GET //show.do?to=pideep_pidee.ES HTTP/1.1" 200 26647 
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "POST /csw/?locale=es HTTP/1.0" 200 2536 
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "GET /DescargaFenomenos/index.jsp HTTP/1.0" 200 11769 
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "GET /clientesIGN/wmsGenericClient/index.html?lang=ES HTTP/1.0" 200 12058 
- - - [20/Nov/2011:01:17:39 +0100] "GET //csw/servlet/cswservlet?request=GetCapabilities&service=CSW&version=2.0.2 HTTP/1.1" 200 8867 
- - - [20/Nov/2011:01:17:46 +0100] "GET //csw/servlet/cswservlet?request=GetCapabilities&service=CSW&version=2.0.2 HTTP/1.1" 200 8867 
- - - [20/Nov/2011:01:18:10 +0100] "GET //show.do?to=pideep_pidee.ES HTTP/1.1" 200 26647 
- - - [20/Nov/2011:01:19:01 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769 

我必须考虑边界情况,如在一行(内部和外部)中有2个IP。

谢谢!

+0

正则表达式也许可以用来解析这一点,类似于您在Perl做什么。我的问题是,你最终希望数据看起来如何? –

+0

准备编写bitchy正则表达式。标记的表达式和'gsub'是你的朋友。 – aL3xa

+0

如果你想让你的生活更轻松Apache有一个非常灵活的方式来指定日志文件的样子。这种“通用日志”格式是一种痛苦,因为一半的东西是空间分隔的,另一半用方括号分隔,另一半用引号括起来,另一半用逗号分隔......它只是不加向上。请参阅http://httpd.apache.org/docs/1.3/logs.html以了解如何重新配置​​日志并使其健康(假定访问Web服务器)。 – Spacedman

回答

3

对于这个例子,用两个NA和空格替换前面的破折号就足够了。然后,您可以用解析read.table()

datlog <- readLines(textConnection('- - - [20/Nov/2011:01:16:29 +0100] "POST /csw/servlet/cswservlet HTTP/1.1" 200 279 
- - - [20/Nov/2011:01:16:29 +0100] "GET /DescargaFenomenos/index.jsp HTTP/1.1" 200 11769 
- - - [20/Nov/2011:01:16:29 +0100] "GET /IDEE-ServicesSearch/ServicesSearch.html?locale=es HTTP/1.1" 200 1665 
- - - [20/Nov/2011:01:16:29 +0100] "GET /search/indexLayout.jsp?PAGELANGUAGE=es HTTP/1.1" 200 9874 
- - - [20/Nov/2011:01:16:29 +0100] "GET /clientesIGN/wmsGenericClient/index.html?lang=ES HTTP/1.1" 200 12058 
- - - [20/Nov/2011:01:16:30 +0100] "POST /csw/servlet/cswservlet HTTP/1.1" 200 258038 
- - - [20/Nov/2011:01:17:09 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769 
- - - [20/Nov/2011:01:17:33 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769 
- - - [20/Nov/2011:01:17:33 +0100] "GET //show.do?to=pideep_pidee.ES HTTP/1.1" 200 26647 
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "POST /csw/?locale=es HTTP/1.0" 200 2536 
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "GET /DescargaFenomenos/index.jsp HTTP/1.0" 200 11769 
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "GET /clientesIGN/wmsGenericClient/index.html?lang=ES HTTP/1.0" 200 12058 
- - - [20/Nov/2011:01:17:39 +0100] "GET //csw/servlet/cswservlet?request=GetCapabilities&service=CSW&version=2.0.2 HTTP/1.1" 200 8867 
- - - [20/Nov/2011:01:17:46 +0100] "GET //csw/servlet/cswservlet?request=GetCapabilities&service=CSW&version=2.0.2 HTTP/1.1" 200 8867 
- - - [20/Nov/2011:01:18:10 +0100] "GET //show.do?to=pideep_pidee.ES HTTP/1.1" 200 26647 
- - - [20/Nov/2011:01:19:01 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769')) 
datlog <- gsub("^-", "NA NA", datlog) 
datlog <- sub("\\,", " ", datlog) 
datlog<-read.table(text=datlog, fill=TRUE) 
datlog 

Spacedman被问及日期时间解析:

datlog[['dtime']] <- as.POSIXct(paste(sub("\\[", "", datlog[[5]]), 
             sub("\\]", "", datlog[[6]])), 
           format="%d/%b/%Y:%H:%M:%S %z") 
+0

如果查询中有逗号,会失败吗?我不认为他们被要求逃脱。显然,解析日期还有一些工作要做。 – Spacedman

+0

如果你的意思是正则表达式模式,那么我看到转义并不是必须的,但不像许多其他不必要的转义,不会抛出错误。 “解析”的请求有点模糊。人们也可以想象想要从HTML请求中提取信息。 –

+0

不,我的意思是\t GET路径中的逗号。日志格式为:短划线或一个或多个以逗号分隔的空格分隔的IP地址,重复三次,日期放在方括号中,带引号的请求,状态码,大小。繁琐! – Spacedman