2017-09-13 32 views
1

我有一个数据帧象下面这样:如何在熊猫数据框上执行groupby而不会丢失其他列?

df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football','basketball','basketball'], 
      'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','mahesh','mahesh'], 
       'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi','pune','nagpur'], 
     'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram','mah','mah'], 
     'person_count': ['10','14','25','20','34','23','43','34','10','20'], 
     'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-26','2017-02-26','2017-02-26','2017-02-26','2017-03-03','2017-03-03'], 
     'sir': ['a','a','a','a','b','b','b','b','c','c']}) 
df = df[['sport_name','person_name','city','person_symbol','person_count','month','sir']] 

print df 

    sport_name person_name city person_symbol person_count  month sir 
0 football  ramesh mumbai   ram   10 2017-01-23 a 
1 football  ramesh mumbai   mum   14 2017-01-23 a 
2 football  ramesh delhi   mum   25 2017-01-23 a 
3 football  ramesh delhi   ram   20 2017-01-23 a 
4 football  ramesh mumbai   ram   34 2017-02-26 b 
5 football  ramesh mumbai   mum   23 2017-02-26 b 
6 football  ramesh delhi   mum   43 2017-02-26 b 
7 football  ramesh delhi   ram   34 2017-02-26 b 
8 basketball  mahesh pune   mah   10 2017-03-03 c 
9 basketball  mahesh nagpur   mah   20 2017-03-03 c 

从这个数据帧,我希望创建命名为“derived_symbol”和“person_count”两个数据帧。为了创建它,我需要把重点放在一些条件如下图所示:

  • derived_symbol需要形成每个唯一的城市和person_symbol。
  • person_count是基于derived_symbol是什么来计算。

为我做了这上面的事情,它是工作的罚款:

df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football','basketball','basketball'], 
      'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','mahesh','mahesh'], 
       'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi','pune','nagpur'], 
     'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram','mah','mah'], 
     'person_count': ['10','14','25','20','34','23','43','34','10','20'], 
     'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-26','2017-02-26','2017-02-26','2017-02-26','2017-03-03','2017-03-03'], 
     'sir': ['a','a','a','a','b','b','b','b','c','c']}) 
df = df[['sport_name','person_name','city','person_symbol','person_count','month','sir']] 

df['person_count'] = df['person_count'].astype(int) 

df1=df.set_index(['sport_name','person_name','person_count','month','sir']).stack().reset_index(name='val') 

df1['derived_symbol'] = df1['sport_name'] + '.' + df1['person_name'] + '.TOTAL.' + df1['val'] + '_count' 

df2 = df1.groupby(['derived_symbol','month','sir','person_name'])['person_count'].sum().reset_index(name='person_count') 
print (df2) 

上面的代码的输出:

  derived_symbol     month  sir sport_name person_name person_count 
0  basketball.mahesh.TOTAL.mah_count 2017-03-03 c basketball mahesh   30 
1 basketball.mahesh.TOTAL.nagpur_count 2017-03-03 c basketball mahesh   20 
2  basketball.mahesh.TOTAL.pune_count 2017-03-03 c basketball mahesh   10 
3  football.ramesh.TOTAL.delhi_count 2017-01-23 a football ramesh   45 
4  football.ramesh.TOTAL.delhi_count 2017-02-26 b football ramesh   77 
5  football.ramesh.TOTAL.mum_count 2017-01-23 a football ramesh   39 
6  football.ramesh.TOTAL.mum_count 2017-02-26 b football ramesh   66 
7  football.ramesh.TOTAL.mumbai_count 2017-01-23 a football ramesh   24 
8  football.ramesh.TOTAL.mumbai_count 2017-02-26 b football ramesh   57 
9  football.ramesh.TOTAL.ram_count 2017-01-23 a football ramesh   30 
10  football.ramesh.TOTAL.ram_count 2017-02-26 b football ramesh   68 

不过,我想数据帧了另外两列其中是 “城市” 和 “person_symbol” 象下面这样:

      derived_symbol  month sir person_name sport_name person_count city  person_symbol 
0  basketball.mahesh.TOTAL.mah_count 2017-03-03 c  mahesh basketball 30   NO_ENTRY  mah 
1 basketball.mahesh.TOTAL.nagpur_count 2017-03-03 c  mahesh basketball 20   nagpur  NO_ENTRY 
2  basketball.mahesh.TOTAL.pune_count 2017-03-03 c  mahesh football  10   pune  NO_ENTRY 
3  football.ramesh.TOTAL.delhi_count 2017-01-23 a  ramesh football  45   delhi  NO_ENTRY 
4  football.ramesh.TOTAL.delhi_count 2017-02-26 b  ramesh football  77   delhi  NO_ENTRY 
5  football.ramesh.TOTAL.mum_count 2017-01-23 a  ramesh football  39   NO_ENTRY mum 
6  football.ramesh.TOTAL.mum_count 2017-02-26 b  ramesh football  66   NO_ENTRY mum 
7  football.ramesh.TOTAL.mumbai_count 2017-01-23 a  ramesh football  24   mumbai  NO_ENTRY 
8  football.ramesh.TOTAL.mumbai_count 2017-02-26 b  ramesh football  57   mumbai  NO_ENTRY 
9  football.ramesh.TOTAL.ram_count 2017-01-23 a  ramesh football  30   NO_ENTRY ram 
10  football.ramesh.TOTAL.ram_count 2017-02-26 b  ramesh football  68   NO_ENTRY ram 

背后实际上创建了这两个符号的逻辑是:

  • 如果某个城市创建当前行则城市列包含城市价值和person_symbol包含“NO_ENTRY”。
  • 如果当前行是为特定的符号产生了以后person_symbol列包含person_symbol价值和城市包含NO_ENTRY。

我怎样才能做到数据的操作等,而不会失去我以前的行为?

回答

1

可以level_5val第一列添加到groupby

df2 = df1.groupby(['derived_symbol', 
        'month','sir', 
        'person_name', 
        'level_5', 
        'val'])['person_count'].sum().reset_index(name='person_count') 
print (df2) 
          derived_symbol  month sir person_name \ 
0  basketball.mahesh.TOTAL.mah_count 2017-03-03 c  mahesh 
1 basketball.mahesh.TOTAL.nagpur_count 2017-03-03 c  mahesh 
2  basketball.mahesh.TOTAL.pune_count 2017-03-03 c  mahesh 
3  football.ramesh.TOTAL.delhi_count 2017-01-23 a  ramesh 
4  football.ramesh.TOTAL.delhi_count 2017-02-26 b  ramesh 
5  football.ramesh.TOTAL.mum_count 2017-01-23 a  ramesh 
6  football.ramesh.TOTAL.mum_count 2017-02-26 b  ramesh 
7  football.ramesh.TOTAL.mumbai_count 2017-01-23 a  ramesh 
8  football.ramesh.TOTAL.mumbai_count 2017-02-26 b  ramesh 
9  football.ramesh.TOTAL.ram_count 2017-01-23 a  ramesh 
10  football.ramesh.TOTAL.ram_count 2017-02-26 b  ramesh 

      level_5  val person_count 
0 person_symbol  mah   30 
1   city nagpur   20 
2   city pune   10 
3   city delhi   45 
4   city delhi   77 
5 person_symbol  mum   39 
6 person_symbol  mum   66 
7   city mumbai   24 
8   city mumbai   57 
9 person_symbol  ram   30 
10 person_symbol  ram   68 

然后通过unstack重塑背部,None转换为NO_ENTRYfillna

df3=df2.set_index(['derived_symbol', 
        'month', 
        'sir', 
        'person_name', 
        'person_count', 
        'level_5'])['val'] \ 
     .unstack() \ 
     .fillna('NO_ENTRY') \ 
     .rename_axis(None, 1) \ 
     .reset_index() 

print (df3) 
          derived_symbol  month sir person_name \ 
0  basketball.mahesh.TOTAL.mah_count 2017-03-03 c  mahesh 
1 basketball.mahesh.TOTAL.nagpur_count 2017-03-03 c  mahesh 
2  basketball.mahesh.TOTAL.pune_count 2017-03-03 c  mahesh 
3  football.ramesh.TOTAL.delhi_count 2017-01-23 a  ramesh 
4  football.ramesh.TOTAL.delhi_count 2017-02-26 b  ramesh 
5  football.ramesh.TOTAL.mum_count 2017-01-23 a  ramesh 
6  football.ramesh.TOTAL.mum_count 2017-02-26 b  ramesh 
7  football.ramesh.TOTAL.mumbai_count 2017-01-23 a  ramesh 
8  football.ramesh.TOTAL.mumbai_count 2017-02-26 b  ramesh 
9  football.ramesh.TOTAL.ram_count 2017-01-23 a  ramesh 
10  football.ramesh.TOTAL.ram_count 2017-02-26 b  ramesh 

    person_count  city person_symbol 
0    30 NO_ENTRY   mah 
1    20 nagpur  NO_ENTRY 
2    10  pune  NO_ENTRY 
3    45  delhi  NO_ENTRY 
4    77  delhi  NO_ENTRY 
5    39 NO_ENTRY   mum 
6    66 NO_ENTRY   mum 
7    24 mumbai  NO_ENTRY 
8    57 mumbai  NO_ENTRY 
9    30 NO_ENTRY   ram 
10   68 NO_ENTRY   ram 
+0

@ jezrael-我看到100个的不同列,而这样做的最后一步拆散()之类df2.set_index([ 'derived_symbol', '月', '先生', 'PERSON_NAME' ,'person_count','level_5'])['val']。unstack()上的实时数据? – kit

+0

ooops,你的样品有一些区别吗? – jezrael

+0

@ jezrael-尝试相同的命令df2.set_index([ 'derived_symbol', '月', '先生', 'PERSON_NAME', 'level_5', 'person_count'])[ 'VAL']。拆散()通过改变参数位置。这是问题。解决了它。 – kit