2017-10-18 24 views
1

我在写一个脚本,它将apache日志文件的请求参数解析到熊猫表中。如何在熊猫中将多行(多索引?)连接成一行或如何从apache日志(字符串)中提取参数?

一个例子请求是这样的:

GET /v1/board?id=8504178&limit=1&to=8504177 HTTP/1.1 
GET /v1/connections?from=850417&to=8504177 HTTP/1.1 
GET /v1/location?query=850417 

有很多的参数和没有固定的顺序。所以我认为pandas方法extract()将不起作用。 这就是为什么我试着用extractall()来做到这一点。我的第一个正则表达式和版本提取它是这样的:

req_patt = ("(?P<request>GET) \/v1\/(?P<resource>connections|stationboard|locations)|" 
     "from=(?P<from>.*?)&|" 
     "to=(?P<to>\d*|\w*)(?P<to_del>&|\s)" 
) 


df_temp = df['fullrequest'].str.extractall(req_patt) 

所以,我得到这个数据帧回:

actual output: 
index requests resources from  to 
(0, 0) GET  connections nan  nan 
(0, 1) nan  nan   8504178 nan 
(0, 2) nan  nan   nan  8504177 
(1, 0) GET  stationboard nan  nan 
(1, 1) nan  nan   nan  8504177 

但最后我想有这样的事情:

expected output: 
index requests resources from  to 
0  GET  connections 8504178 8504177 
1  GET  stationboard nan  8504177 

所以我在最后一个问题: 我如何加入这些单行一个行?

+0

'(P GET(= \ s)?) | \/v1 \ /(?P [^?\ s] *)| from =(?P [^&\ s] *)| to =(?P [^&s] * )'? – ctwheels

回答

0

您可以使用

ndf = df.apply(sorted,key=pd.isnull).set_index('index') 

ndf = ndf[~ndf.isnull().all(1)] 

输出:

 
    requests  resources  from   to 
index            
(0,0)  GET connections 8504178.0 8504177.0 
(0,1)  GET stationboard  NaN 8504177.0 

要获得该指数可以用ndf.index = ndf.index.str[-2]

 
    requests  resources  from   to 
index            
0   GET connections 8504178.0 8504177.0 
1   GET stationboard  NaN 8504177.0 
相关问题