解析apache日志文件

我刚开始学习Python，希望读取Apache日志文件并将每行的部分放入不同的列表中。从文件解析apache日志文件

线

172.16.0.3 - - [25 /月/ 2002：14：04：19 0200] “GET/HTTP/1.1” 401 - “”“的Mozilla/5.0（ X11; U; Linux的i686的;的en-US; rv中：1.1）Gecko的/ 20020827"

根据Apache website格式是

％H％升％U％T \“％r \ “％> s％b \”％{Referer} i \“\”％{User-Agent} i \

我能够打开文件，只是读取它，但我不知道如何使它以该格式读取，所以我可以把每个部分放在一个列表中。

来源

2012-09-22 ogward

此行的元素是你感兴趣的？（所有这些？） –

随着线条稍微改变，我希望它们能够以这种确切格式进行阅读:) – ogward

您误解了我的意思是，您想要从每一行中提取什么？日期？ IP？所有的？ –

这是regular expressions的工作。

例如：

line = '172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET/HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"' 
regex = '([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) - "(.*?)" "(.*?)"' 

import re 
print re.match(regex, line).groups()

的输出将是从线（具体地，在该图案括号内的基团）与6条信息的元组：

('172.16.0.3', '25/Sep/2002:14:04:19 +0200', 'GET/HTTP/1.1', '401', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827')

来源

2012-09-22 14:25:18

这看起来不错，但我可以以某种方式有每个部分的“正则表达式”？这样的事 'ip ='（[（\ d \。）] +）' date ='.........' req ='........... '' – ogward

我想我想通了。非常感谢！ – ogward

当我从文件中尝试其他行时，它不起作用。 ex ** 127.0.0.1 - stefan [01/Apr/2002：12：17：21 +0200]“GET /sit3-shine.7.gif HTTP/1.1”200 15811“http：// localhost /” “Mozilla/5.0（兼容; Konqueror/2.2.2-2; Linux）”** – ogward

使用常规表达式将行分成单独的“标记”：

>>> row = """172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET/HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827" """ 
>>> import re 
>>> map(''.join, re.findall(r'\"(.*?)\"|\[(.*?)\]|(\S+)', row)) 
['172.16.0.3', '-', '-', '25/Sep/2002:14:04:19 +0200', 'GET/HTTP/1.1', '401', '-', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827']

另一种解决方案是t o使用专用工具，例如http://pypi.python.org/pypi/pylogsparser/0.4

来源

2012-09-22 14:54:36 georg

我已经创建了一个python库，它只是这样做的：apache-log-parser。

>>> import apache_log_parser 
>>> line_parser = apache_log_parser.make_parser("%h <<%P>> %t %Dus \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %l %u") 
>>> log_line_data = line_parser('127.0.0.1 <<6113>> [16/Aug/2013:15:45:34 +0000] 1966093us "GET/HTTP/1.1" 200 3478 "https://example.com/" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18)" - -') 
>>> pprint(log_line_data) 
{'pid': '6113', 
'remote_host': '127.0.0.1', 
'remote_logname': '-', 
'remote_user': '', 
'request_first_line': 'GET/HTTP/1.1', 
'request_header_referer': 'https://example.com/', 
'request_header_user_agent': 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18)', 
'response_bytes_clf': '3478', 
'status': '200', 
'time_received': '[16/Aug/2013:15:45:34 +0000]', 
'time_us': '1966093'}

来源

2014-11-05 09:27:36 Rory

正则表达式似乎极端和有问题的考虑格式的简单，所以我写了这个小分路器可能使他人有用的：

def apache2_logrow(s): 
    ''' Fast split on Apache2 log lines 

    http://httpd.apache.org/docs/trunk/logs.html 
    ''' 
    row = [ ] 
    qe = qp = None # quote end character (qe) and quote parts (qp) 
    for s in s.replace('\r','').replace('\n','').split(' '): 
     if qp: 
      qp.append(s) 
     elif '' == s: # blanks 
      row.append('') 
     elif '"' == s[0]: # begin " quote " 
      qp = [ s ] 
      qe = '"' 
     elif '[' == s[0]: # begin [ quote ] 
      qp = [ s ] 
      qe = ']' 
     else: 
      row.append(s) 

     l = len(s) 
     if l and qe == s[-1]: # end quote 
      if l == 1 or s[-2] != '\\': # don't end on escaped quotes 
       row.append(' '.join(qp)[1:-1].replace('\\'+qe, qe)) 
       qp = qe = None 
    return row

来源

2015-03-31 18:54:00

这对我来说很合适Python 3。正则表达式的例子。谢谢。 –

在httpd.conf apache的转换添加此记录到json。

LogFormat "{\"time\":\"%t\", \"remoteIP\" :\"%a\", \"host\": \"%V\", \"request_id\": \"%L\", \"request\":\"%U\", \"query\" : \"%q\", \"method\":\"%m\", \"status\":\"%>s\", \"userAgent\":\"%{User-agent}i\", \"referer\":\"%{Referer}i\" }" json_log 

CustomLog /var/log/apache_access_log json_log 
CustomLog "|/usr/bin/python -u apacheLogHandler.py" json_log

现在你看到你的json格式的access_logs。使用下面的python代码来解析不断更新的json日志。

apacheLogHandler.py

import time 
f = open('apache_access_log.log', 'r') 
for line in f: # read all lines already in the file 
    print line.strip() 

# keep waiting forever for more lines. 
while True: 
    line = f.readline() # just read more 
    if line: # if you got something... 
    print 'got data:', line.strip() 
    time.sleep(1)

来源

2017-01-09 22:35:44

import re 


HOST = r'^(?P<host>.*?)' 
SPACE = r'\s' 
IDENTITY = r'\S+' 
USER = r'\S+' 
TIME = r'(?P<time>\[.*?\])' 
REQUEST = r'\"(?P<request>.*?)\"' 
STATUS = r'(?P<status>\d{3})' 
SIZE = r'(?P<size>\S+)' 

REGEX = HOST+SPACE+IDENTITY+SPACE+USER+SPACE+TIME+SPACE+REQUEST+SPACE+STATUS+SPACE+SIZE+SPACE 

def parser(log_line): 
    match = re.search(REGEX,log_line) 
    return ((match.group('host'), 
      match.group('time'), 
         match.group('request') , 
         match.group('status') , 
         match.group('size') 
        ) 
        ) 


logLine = """180.76.15.30 - - [24/Mar/2017:19:37:57 +0000] "GET /shop/page/32/?count=15&orderby=title&add_to_wishlist=4846 HTTP/1.1" 404 10202 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)""" 
result = parser(logLine) 
print(result)

来源

2017-11-03 11:57:03

解析apache日志文件

回答

相关问题