2017-03-22 61 views
0

我正在改变一些申请人的交易数据,我需要创建一个新的标志列(在我的例子中标记为“DESIRED FLAG”)。但是,我无法弄清楚正确的循环/应用方法,因为在下面的逻辑中可能有很多不同的变化。这种情况下最好的熊猫应用/循环方法是什么?

在一个完美的世界里,连续申请过程中的历史是这样的,所有的“状态”的设置为“已完成”:

  • 现场采访开球 - >安排面试 - >决策; OR
  • 电话采访开球 - >安排面试 - >决策

当然,申请人可以顺利通过很多电话面试和站点在他们的申请过程。

如下面的例子所示,有时会有“Schedule Interviews”被取消。在这些情况下,我需要删除该步骤以及与此相关的后续步骤。其中包括“时间表访谈”,“决定”和“现场访谈开始”或“电话采访开始”。另外,有时还会有其他“事件”,就像我们看到的手动跳过的那样。

我还有其他类型的,我需要为标志的情况,所以我需要保持原有的数据框只新列。

import pandas as pd 

data = {'Employee ID': ["100","100", "100", "100","100","100","100","100","100","100","200", "200", "200","200","200","200","200","300","300", "300", "300","300","300","300"], 
     'Completed On Date': ["2009-01-01","2010-01-01","2011-06-05","2012-07-01","2013-01-01","2014-01-01","2015-01-01","2016-01-01","2017-01-01","2018-01-01","2010-01-01","2011-06-05","2012-07-01","2012-08-15","2013-01-01","2014-01-01","2015-01-01","2009-01-01","2010-01-01","2011-06-05","2012-07-01","2013-01-01","2014-01-01","2015-01-01"], 
     'Event': ["Decision","On-Site Interview Kick Off","Schedule Interviews","Decision","On-Site Interview Kick Off","Schedule Interviews","Decision","Phone Interview Kick Off","Schedule Interviews","Decision","On-Site Interview Kick Off","Schedule Interviews","Decision","Decision","Phone Interview Kick Off","Schedule Interviews","Decision","Job Apply","Phone Interview Kick Off","Schedule Interviews","Decision","On-Site Interview Kick Off","Schedule Interviews","Decision"], 
     'Event Status': ["Completed","Completed","CANCELED","Completed","Completed","Completed","Completed","Completed","Completed","Completed","Completed","CANCELED","Manually Skipped","Completed","Completed","Completed","Completed","Completed","Completed","CANCELED","Completed","Completed","Completed","Completed"], 
     'DESIRED FLAG': ["Keep","Keep","Remove","Remove","Remove","Keep","Keep","Keep","Keep","Keep","Keep","Remove","Remove","Remove","Remove","Keep","Keep","Keep","Keep","Remove","Remove","Remove","Keep","Keep"]} 
df = pd.DataFrame(data, columns=['Employee ID','Completed On Date','Event','Event Status','DESIRED FLAG']) 
df = df.sort_values(by=(['Employee ID','Completed On Date'])) 

df 
+0

如果您可以发布所需输出的样子,这将非常有帮助。 – pshep123

+0

请参阅'DESIRED FLAG'列。这就是输出结果的样子。谢谢! – Christopher

+0

明白了。有助于以数据框的形式呈现,但也许这只是我。 – pshep123

回答

1

我认为下面的代码解决您的问题

import pandas as pd 

data = {'Employee ID': ["100","100", "100", "100","100","100","100","100","100","100","200", "200", "200","200","200","200","200","300","300", "300", "300","300","300","300"], 
     'Completed On Date': ["2009-01-01","2010-01-01","2011-06-05","2012-07-01","2013-01-01","2014-01-01","2015-01-01","2016-01-01","2017-01-01","2018-01-01","2010-01-01","2011-06-05","2012-07-01","2012-08-15","2013-01-01","2014-01-01","2015-01-01","2009-01-01","2010-01-01","2011-06-05","2012-07-01","2013-01-01","2014-01-01","2015-01-01"], 
     'Event': ["Decision","On-Site Interview Kick Off","Schedule Interviews","Decision","On-Site Interview Kick Off","Schedule Interviews","Decision","Phone Interview Kick Off","Schedule Interviews","Decision","On-Site Interview Kick Off","Schedule Interviews","Decision","Decision","Phone Interview Kick Off","Schedule Interviews","Decision","Job Apply","Phone Interview Kick Off","Schedule Interviews","Decision","On-Site Interview Kick Off","Schedule Interviews","Decision"], 
     'Event Status': ["Completed","Completed","CANCELED","Completed","Completed","Completed","Completed","Completed","Completed","Completed","Completed","CANCELED","Manually Skipped","Completed","Completed","Completed","Completed","Completed","Completed","CANCELED","Completed","Completed","Completed","Completed"], 
     'DESIRED FLAG': ["Keep","Keep","Remove","Remove","Remove","Keep","Keep","Keep","Keep","Keep","Keep","Remove","Remove","Remove","Remove","Keep","Keep","Keep","Keep","Remove","Remove","Remove","Keep","Keep"]} 
df = pd.DataFrame(data, columns=['Employee ID','Completed On Date','Event','Event Status','DESIRED FLAG']) 
df = df.sort_values(by=(['Employee ID','Completed On Date'])) 


index_list_delete = [] 
start_deleting = False 
for i in range(0, len(df)): 
    if start_deleting == False: 
     # whenever I see a "CANCELED", i know some following rows need to be deleted 
     if df.iloc[i]['Event Status'] == 'CANCELED': 
      index_list_delete += [i] 
      start_deleting = True 
    else: 
     # whenever i see a "Schedule Interviews", i need to stop deleting. 
     # otherwise keep track of the rows that need to be deleted 
     if df.iloc[i]['Event'] == 'Schedule Interviews': 
      start_deleting = False 
     else: 
      index_list_delete += [i] 

# deleting rows 
df = df.drop(df.index[index_list_delete]) 
# reseting index 
df = df.reset_index(drop = True) 

你会得到以下结果

Employee ID Completed On Date      Event Event Status DESIRED FLAG 
0   100  2009-01-01     Decision Completed   Keep 
1   100  2010-01-01 On-Site Interview Kick Off Completed   Keep 
2   100  2014-01-01   Schedule Interviews Completed   Keep 
3   100  2015-01-01     Decision Completed   Keep 
4   100  2016-01-01 Phone Interview Kick Off Completed   Keep 
5   100  2017-01-01   Schedule Interviews Completed   Keep 
6   100  2018-01-01     Decision Completed   Keep 
7   200  2010-01-01 On-Site Interview Kick Off Completed   Keep 
8   200  2014-01-01   Schedule Interviews Completed   Keep 
9   200  2015-01-01     Decision Completed   Keep 
10   300  2009-01-01     Job Apply Completed   Keep 
11   300  2010-01-01 Phone Interview Kick Off Completed   Keep 
12   300  2014-01-01   Schedule Interviews Completed   Keep 
13   300  2015-01-01     Decision Completed   Keep 
+0

我对真实数据做了一些额外的测试,而且这种逻辑不会将自己限制为员工ID ......它只应在每个相应的员工ID集内执行您的解决方案。 – Christopher

+0

以下是一个不雅而部分的解决方案。在接下来的步骤中,我仍然需要筛选出他们最后一步是Schedule Interview Team .... if(df.iloc [i] ['Event Status'] =='CANCELED')和(df.iloc [i] ['Employee ID'] == df.iloc [i + 1] ['Employee ID']): – Christopher

相关问题