2016-07-25 107 views
2

我有数据看起来像下面的文件a.dat:解析数据

01/Jul/2016 00:05:09  8438.2 
01/Jul/2016 00:05:19  8422.4 g 

我希望把它们解析成三列:时间表,浮点数,字符串(无或g)

我曾尝试:

df=pd.read_csv('a.dat',sep='  | ',engine='python') 

,其与4列结束了:日期,时间,浮动和g

df=pd.read_csv('a.dat',sep='  | (g)',engine='python') 

其给出5列与第1列和4的NaN

有没有更好的方式来创建没有任何后处理的datafram?

回答

2

您可以使用read_csv

import pandas as pd 
import io 

temp=u'''01/Jul/2016 00:05:09  8438.2 
01/Jul/2016 00:05:19  8422.4 g''' 
#after testing replace io.StringIO(temp) to filename 
df = pd.read_csv(io.StringIO(temp), 
       sep='\s+', 
       names=['date','time','float','string'], 
       parse_dates=[['date','time']]) 
print (df) 
      date_time float string 
0 2016-07-01 00:05:09 8438.2 NaN 
1 2016-07-01 00:05:19 8422.4  g 

或者:

import pandas as pd 
import io 

temp=u'''01/Jul/2016 00:05:09  8438.2 
01/Jul/2016 00:05:19  8422.4 g''' 
#after testing replace io.StringIO(temp) to filename 
df = pd.read_csv(io.StringIO(temp), 
       delim_whitespace=True, 
       names=['date','time','float','string'], 
       parse_dates=[['date','time']]) 
print (df) 
      date_time float string 
0 2016-07-01 00:05:09 8438.2 NaN 
1 2016-07-01 00:05:19 8422.4  g 

解决方案与read_fwf

import pandas as pd 
import io 

temp=u'''01/Jul/2016 00:05:09  8438.2 
01/Jul/2016 00:05:19  8422.4 g''' 
#after testing replace io.StringIO(temp) to filename 
df = pd.read_fwf(io.StringIO(temp), 
       names=['date','time','float','string'], 
       parse_dates=[['date','time']]) 
print (df) 
      date_time float string 
0 2016-07-01 00:05:09 8438.2 NaN 
1 2016-07-01 00:05:19 8422.4  g 

你也可以指定列的宽度:

df = pd.read_fwf(io.StringIO(temp), 
       fwidths = [20,12,2], 
       names=['date','time','float','string'], 
       parse_dates=[['date','time']]) 
print (df) 
      date_time float string 
0 2016-07-01 00:05:09 8438.2 NaN 
1 2016-07-01 00:05:19 8422.4  g