2017-05-15 69 views
1

我有一个数据帧,看起来像这样的连续天总人数和失踪天数(通常它有很多用户):计算在时间序列数据

userid | activityday 
222  2015-01-09 12:00 
222  2015-01-10 12:00 
222  2015-01-11 12:00 
222  2015-01-13 12:00 
222  2015-01-14 12:00 
222  2015-01-15 12:00 
222  2015-01-17 12:00 
222  2015-01-18 12:00 
222  2015-01-19 12:00 
222  2015-01-20 12:00 
222  2015-01-20 12:00 

我想获得连续的活动的总数和无效日期,直到给定日期为止。例如,如果日期是2015年1月23日,则:

userid | days_active_jb | days_inactive_jb | ttl_days_active | ttl_days_inactive 
222 | 3    | 2    | 10    | 2 

或者,如果给定的日期是2015年1月15日然后

userid | days_active_jb | days_inactive_jb | ttl_days_active | ttl_days_inactive 
222 | 2    | 0    | 5    | 1 

我身边有300.000行来处理以获得这个最终的数据帧。我想知道什么才是实现这一目标的有效方法。有任何想法吗?

下面是每个列的说明:

days_active_jb:天学生数量在连续的活动只是在给定日期之前。

days_inactive_jb:学生在给定日期之前连续没有活动的天数。

ttl_days_active:学生在指定日期前的任何一天有活动的天数。

ttl_days_inactive:学生在指定日期前的任何一天没有活动的天数。

+0

如何界定days_active_jb和days_inactive_jb?如果days_inactive_jb是另外1天的差距数,那么第二个例子对于days_inactive_jb是否有1? – Allen

+0

@艾伦谢谢你的回答。我提供了解释。我会很快尝试你的解决方案,并会让你知道。 – renakre

回答

1

设置:

df 
Out[1714]: 
    userid   activityday 
0  222 2015-01-09 12:00:00 
1  222 2015-01-10 12:00:00 
2  222 2015-01-11 12:00:00 
3  222 2015-01-13 12:00:00 
4  222 2015-01-14 12:00:00 
5  222 2015-01-15 12:00:00 
6  222 2015-01-17 12:00:00 
7  222 2015-01-18 12:00:00 
8  222 2015-01-19 12:00:00 
9  222 2015-01-20 12:00:00 
11  322 2015-01-09 12:00:00 
12  322 2015-01-10 12:00:00 
13  322 2015-01-11 12:00:00 
14  322 2015-01-13 12:00:00 
15  322 2015-01-14 12:00:00 
16  322 2015-01-15 12:00:00 
17  322 2015-01-17 12:00:00 
18  322 2015-01-18 12:00:00 
19  322 2015-01-19 12:00:00 
20  322 2015-01-20 12:00:00 

解决方案

def days_active_jb(x): 
    x = x[x<pd.to_datetime(cut_off_days)]  
    if len(x) == 0: 
     return 0 
    x = [e.date() for e in x.sort_values(ascending=False)] 
    prev = x.pop(0) 
    i = 1  
    for e in x:    
     if (prev-e).days == 1: 
      i+=1 
      prev = e 
     else: 
      break 
    return i 

def days_inactive_jb(x): 
    diff = (pd.to_datetime(cut_off_days) -max(x)).days 
    return 0 if diff<0 else diff  

def ttl_days_active(x):  
    x = x[x<pd.to_datetime(cut_off_days)] 
    return len(x[x<pd.to_datetime(cut_off_days)]) 

def ttl_days_inactive(x):  
    #counter the missing days between start and end dates 
    x = x[x<pd.to_datetime(cut_off_days)] 
    return len(pd.date_range(min(x),max(x))) - len(x) 

#drop duplicate userid-activityday pairs 
df = df.drop_duplicates(subset=['userid','activityday']) 

cut_off_days = '2015-01-23' 
df.sort_values(by=['userid','activityday'],ascending=False).\ 
       groupby('userid')['activityday'].\ 
       agg([days_active_jb, 
        days_inactive_jb, 
        ttl_days_active, 
        ttl_days_inactive]).\ 
       astype(np.int64) 

Out[1856]: 
     days_active_jb days_inactive_jb ttl_days_active ttl_days_inactive 
userid                  
222     4     2    10     2 
322     4     2    10     2 


cut_off_days = '2015-01-15' 
df.sort_values(by=['userid','activityday'],ascending=False).\ 
       groupby('userid')['activityday'].\ 
       agg([days_active_jb, 
        days_inactive_jb, 
        ttl_days_active, 
        ttl_days_inactive]).\ 
       astype(np.int64) 

Out[1863]: 
     days_active_jb days_inactive_jb ttl_days_active ttl_days_inactive 
userid                  
222     2     0    5     1 
322     2     0    5     1 
+0

我对在给定日期之前连续活动的日子感兴趣。例如,如果学生在给定日期前的最后3天内只有活动,那么返回的值应该是3.它不会在乎5天前他们是否有活动等等。你认为'days_active_jb(x):'检查这个? – renakre

+0

@renakre,我已经更新了代码。如果截止日期为2015-01-23,那么days_active_jb应为4,因为从'2015-01-17'到'2015-01-20'的日期是4天。 – Allen

+0

谢谢!先生! – renakre

1
''' 
    this code will work for different user id on the same file 
    the data should be present strictly on the format you provide 
    ''' 
    import datetime 
    ''' 
    following list comprehension generates the list of list 
    [uid,activedate,time] from file for different uid 
    ''' 
    data=[item2 for item2 in[item.strip().split() for item in[data for data \ 
      in open('c:/python34/stack.txt').readlines() ]] if item2] 
    data.pop(0)## pops first element ie the header 

    def active_dates(active_list,uid): 
     '''returns the list of list of year,month and day of active dates 
      for given user id as 'uid' ''' 
     for item in active_list: 
      item.pop(2) #removing time 
     return [[eval(item4.lstrip('0'))for item4 in item3] for item3 in 
      [item2.split('-') for item2 in [item[1]for item in data if \ 
        item[0]==uid]]] 


    def active_days(from_,to,dates): 
     #returns the no of active days from start date'from_' to till date 
     #'to'  
     count=0 
     for item in dates: 
      d1=datetime.date(item[0],item[1],item[2]) 
      if d1>from_ and d1<to: 
       count+=1 
     return count 
    def remove_duplicates(lst): 
     #removes the duplicates if active at different time on the same day 
     lst.sort() 
     i = len(lst) - 1 
     while i > 0: 
      if lst[i] == lst[i - 1]: 
       lst.pop(i) 
      i -= 1 
     return lst 

    active=remove_duplicates(active_dates(data,'222')) #pass uid variable as string 
    from_=datetime.date(2015,1,1) 
    to=datetime.date(2015,1,26) 
    activedays=active_days(from_,to,active) 
    total_days=to-from_ 
    inactive_days=total_days.days-activedays 
    print('activedays: %s and inactive days: %s'%(activedays,inactive_days))