2016-10-31 169 views
1

我有以下数据框:熊猫列的条件回填

 DATE  ID  STATUS 
0 2014-01-01 1 INPROGRESS 
1 2013-03-01 1  ENDED 
2 2015-05-01 2 INPROGRESS 
3 2012-05-01 1  STARTED 
4 2011-05-01 2  STARTED 
5 2011-03-01 3  STARTED 
6 2011-04-01 3  ENDED 
7 2011-06-01 3 INPROGRESS 
8 2011-09-01 3  STARTED 

这里的代码来构建它:

>>> df1 = pd.DataFrame(columns=["DATE", "ID", "STATUS"]) 
>>> df1["DATE"] = ['2014-01-01', '2013-03-01', '2015-05-01', '2012-05-01', '2011-05-01', '2011-03-01', '2011-04-01', '2011-06-01', '2011-09-01'] 
>>> df1["ID"] = [1,1,2,1,2,3,3,3,3] 
>>> df1["STATUS"] = ['INPROGRESS', 'ENDED', 'INPROGRESS', 'STARTED', 'STARTED', 'STARTED','ENDED', 'INPROGRESS', 'STARTED'] 

每个ID组状态列表示,可以是一个任务:

STARTED,INPROGRESS或ENDED

以这个精确的时间顺序(STARTED应该是no t在ENDED等后出现)。

通过由ID分组和按日期我获得ID 3排序:

df1.sort_values('DATE')[df1['ID']==3] 

    DATE  ID  STATUS 
5 2011-03-01 3  STARTED 
6 2011-04-01 3  ENDED 
7 2011-06-01 3 INPROGRESS 
8 2011-09-01 3  STARTED 

不,我需要“修复”状态栏跟随基础上,最后状态上面定义的顺序。对于ID 3的最后状态开始,所以一切都应该被回填,以作为后续启动的状态:

 DATE  ID  STATUS 
5 2011-03-01 3  STARTED 
6 2011-04-01 3  STARTED 
7 2011-06-01 3  STARTED 
8 2011-09-01 3  STARTED 

对于ID 1:

df1.sort_values('DATE')[df1['ID']==1] 
    DATE ID  STATUS 
3 2012-05-01 1  STARTED 
1 2013-03-01 1  ENDED 
0 2014-01-01 1 INPROGRESS 

我将结束了最后两个状态INPROGRESS和请以STARTED开头:

df1.sort_values('DATE')[df1['ID']==1] 
    DATE ID  STATUS 
3 2012-05-01 1  STARTED 
1 2013-03-01 1 INPROGRESS 
0 2014-01-01 1 INPROGRESS 

ID 2的顺序是正确的。

任何想法如何用熊猫来做到这一点? 我试图通过ID进行分组,我正在考虑基于最后状态的回填,但我不知道如何在适当的时候停止回填。

谢谢!

回答

2

一个经典的方法是忘记你的状态是标签:改为将它们视为严格增加的数字,如开始1,进行中2和结束3.使用这样的列,你现在可以检查每组这些数字的单调性,然后回填,直到你看到单调中断。通过ID

keymapping = {'STARTED':0, 'INPROGRESS':1, 'ENDED':2} 
df['STATUS_ID'] = df.STATUS.map(keymapping) 
df.set_index(['ID', 'DATE'], inplace=True) 
df.sort_index(inplace=True) 

现在,组,并使用transform让整个指数每组传播的最后一个值,这样就可以把它分配给您的数据帧作为新列:

准备好您的数据帧:

df['STATUS_LAST'] = df.groupby(level=0, as_index=False).STATUS_ID.transform('last') 

df 
Out[63]: 
        STATUS STATUS_ID STATUS_LAST 
ID DATE           
1 2012-05-01  STARTED   0   1 
    2013-03-01  ENDED   2   1 
    2014-01-01 INPROGRESS   1   1 
2 2011-05-01  STARTED   0   1 
    2015-05-01 INPROGRESS   1   1 
3 2011-03-01  STARTED   0   0 
    2011-04-01  ENDED   2   0 
    2011-06-01 INPROGRESS   1   0 
    2011-09-01  STARTED   0   0 

最后,通过使用针对STATUS_ID最后的增加单调申请回填,即STATUS_ID每个值是有效的,如果是低于或等于STATUS_LAST时:

df.STATUS_ID = df.STATUS_ID.where(df.STATUS_ID <= df.STATUS_LAST, df.STATUS_LAST) 
df.STATUS_ID 
Out[65]: 
ID DATE  
1 2012-05-01 0 
    2013-03-01 1 
    2014-01-01 1 
2 2011-05-01 0 
    2015-05-01 1 
3 2011-03-01 0 
    2011-04-01 0 
    2011-06-01 0 
    2011-09-01 0 

扭转它映射到标签,并将其分配给STATUS

df.STATUS_ID.map({v:k for k,v in keymapping.items()}) 
Out[66]: 
ID DATE  
1 2012-05-01  STARTED 
    2013-03-01 INPROGRESS 
    2014-01-01 INPROGRESS 
2 2011-05-01  STARTED 
    2015-05-01 INPROGRESS 
3 2011-03-01  STARTED 
    2011-04-01  STARTED 
    2011-06-01  STARTED 
    2011-09-01  STARTED 
Name: STATUS_ID, dtype: object