1
我有一个数据帧象下面这样:如何在熊猫数据框上执行groupby而不会丢失其他列?
df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football','basketball','basketball'],
'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','mahesh','mahesh'],
'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi','pune','nagpur'],
'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram','mah','mah'],
'person_count': ['10','14','25','20','34','23','43','34','10','20'],
'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-26','2017-02-26','2017-02-26','2017-02-26','2017-03-03','2017-03-03'],
'sir': ['a','a','a','a','b','b','b','b','c','c']})
df = df[['sport_name','person_name','city','person_symbol','person_count','month','sir']]
print df
sport_name person_name city person_symbol person_count month sir
0 football ramesh mumbai ram 10 2017-01-23 a
1 football ramesh mumbai mum 14 2017-01-23 a
2 football ramesh delhi mum 25 2017-01-23 a
3 football ramesh delhi ram 20 2017-01-23 a
4 football ramesh mumbai ram 34 2017-02-26 b
5 football ramesh mumbai mum 23 2017-02-26 b
6 football ramesh delhi mum 43 2017-02-26 b
7 football ramesh delhi ram 34 2017-02-26 b
8 basketball mahesh pune mah 10 2017-03-03 c
9 basketball mahesh nagpur mah 20 2017-03-03 c
从这个数据帧,我希望创建命名为“derived_symbol”和“person_count”两个数据帧。为了创建它,我需要把重点放在一些条件如下图所示:
- derived_symbol需要形成每个唯一的城市和person_symbol。
- person_count是基于derived_symbol是什么来计算。
为我做了这上面的事情,它是工作的罚款:
df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football','basketball','basketball'],
'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','mahesh','mahesh'],
'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi','pune','nagpur'],
'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram','mah','mah'],
'person_count': ['10','14','25','20','34','23','43','34','10','20'],
'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-26','2017-02-26','2017-02-26','2017-02-26','2017-03-03','2017-03-03'],
'sir': ['a','a','a','a','b','b','b','b','c','c']})
df = df[['sport_name','person_name','city','person_symbol','person_count','month','sir']]
df['person_count'] = df['person_count'].astype(int)
df1=df.set_index(['sport_name','person_name','person_count','month','sir']).stack().reset_index(name='val')
df1['derived_symbol'] = df1['sport_name'] + '.' + df1['person_name'] + '.TOTAL.' + df1['val'] + '_count'
df2 = df1.groupby(['derived_symbol','month','sir','person_name'])['person_count'].sum().reset_index(name='person_count')
print (df2)
上面的代码的输出:
derived_symbol month sir sport_name person_name person_count
0 basketball.mahesh.TOTAL.mah_count 2017-03-03 c basketball mahesh 30
1 basketball.mahesh.TOTAL.nagpur_count 2017-03-03 c basketball mahesh 20
2 basketball.mahesh.TOTAL.pune_count 2017-03-03 c basketball mahesh 10
3 football.ramesh.TOTAL.delhi_count 2017-01-23 a football ramesh 45
4 football.ramesh.TOTAL.delhi_count 2017-02-26 b football ramesh 77
5 football.ramesh.TOTAL.mum_count 2017-01-23 a football ramesh 39
6 football.ramesh.TOTAL.mum_count 2017-02-26 b football ramesh 66
7 football.ramesh.TOTAL.mumbai_count 2017-01-23 a football ramesh 24
8 football.ramesh.TOTAL.mumbai_count 2017-02-26 b football ramesh 57
9 football.ramesh.TOTAL.ram_count 2017-01-23 a football ramesh 30
10 football.ramesh.TOTAL.ram_count 2017-02-26 b football ramesh 68
不过,我想数据帧了另外两列其中是 “城市” 和 “person_symbol” 象下面这样:
derived_symbol month sir person_name sport_name person_count city person_symbol
0 basketball.mahesh.TOTAL.mah_count 2017-03-03 c mahesh basketball 30 NO_ENTRY mah
1 basketball.mahesh.TOTAL.nagpur_count 2017-03-03 c mahesh basketball 20 nagpur NO_ENTRY
2 basketball.mahesh.TOTAL.pune_count 2017-03-03 c mahesh football 10 pune NO_ENTRY
3 football.ramesh.TOTAL.delhi_count 2017-01-23 a ramesh football 45 delhi NO_ENTRY
4 football.ramesh.TOTAL.delhi_count 2017-02-26 b ramesh football 77 delhi NO_ENTRY
5 football.ramesh.TOTAL.mum_count 2017-01-23 a ramesh football 39 NO_ENTRY mum
6 football.ramesh.TOTAL.mum_count 2017-02-26 b ramesh football 66 NO_ENTRY mum
7 football.ramesh.TOTAL.mumbai_count 2017-01-23 a ramesh football 24 mumbai NO_ENTRY
8 football.ramesh.TOTAL.mumbai_count 2017-02-26 b ramesh football 57 mumbai NO_ENTRY
9 football.ramesh.TOTAL.ram_count 2017-01-23 a ramesh football 30 NO_ENTRY ram
10 football.ramesh.TOTAL.ram_count 2017-02-26 b ramesh football 68 NO_ENTRY ram
背后实际上创建了这两个符号的逻辑是:
- 如果某个城市创建当前行则城市列包含城市价值和person_symbol包含“NO_ENTRY”。
- 如果当前行是为特定的符号产生了以后person_symbol列包含person_symbol价值和城市包含NO_ENTRY。
我怎样才能做到数据的操作等,而不会失去我以前的行为?
@ jezrael-我看到100个的不同列,而这样做的最后一步拆散()之类df2.set_index([ 'derived_symbol', '月', '先生', 'PERSON_NAME' ,'person_count','level_5'])['val']。unstack()上的实时数据? – kit
ooops,你的样品有一些区别吗? – jezrael
@ jezrael-尝试相同的命令df2.set_index([ 'derived_symbol', '月', '先生', 'PERSON_NAME', 'level_5', 'person_count'])[ 'VAL']。拆散()通过改变参数位置。这是问题。解决了它。 – kit