import numpy as np
import pandas as pd
results = pd.DataFrame({'Contractor':[1,1,0,0,0,1],
'President':[1,0,0,0,1,1],
'Item 1':[1,1,0,0,1,np.nan],
'Item 2':[1,0,0,1,0,1]})
reference = pd.DataFrame({'Position':['Contractor','President'],
'Item(s)':[(1,), (1,2)]})
longref = pd.DataFrame([('Item {}'.format(item), row['Position'])
for index, row in reference.iterrows()
for item in row['Item(s)']], columns=['Item', 'Position'])
melted = pd.melt(results, id_vars=['Item 1','Item 2'], var_name='Position')
melted = melted.loc[melted['value']==1]
melted = pd.melt(melted, id_vars=['Position'],
value_vars=['Item 1','Item 2'], var_name='Item')
merged = pd.merge(longref, melted, how='left')
grouped = merged.groupby(['Position'])
result = (grouped['value'].sum()/grouped['value'].count())*100
result = result.rename('Overall%').reset_index()
print(result)
产生
Position Overall%
0 Contractor 100.0
1 President 80.0
说明:有由Hadley Wickham (PDF)的制品的要人看见优点使得数据 “整洁” 的 。主要原则是每行应该代表一个“观察”,每一列代表一些因子或变量。
经常会发现,一旦数据整齐,您需要用来表达计算结果的工具将会很自然地落到位。 这个问题的难度主要来自数据不整洁。
考虑results
:
In [405]: results
Out[405]:
Contractor Item 1 Item 2 President
0 1 1.0 1 1
1 1 1.0 0 0
2 0 0.0 0 0
3 0 0.0 1 0
4 0 1.0 0 1
5 1 NaN 1 1
代替具有用于Contractor
和President
单独的列的,这将是更好的具有一列称为Position
,由于Position
是可变的,并且每个观察或行可以具有一个值为Position
- Contractor
或President
。 同样,Item 1
和Item 2
应合并成一个单一的塔Item
:
In [416]: melted
Out[416]:
Position Item value
0 Contractor Item 1 1.0
1 Contractor Item 1 1.0
2 Contractor Item 1 NaN
3 President Item 1 1.0
4 President Item 1 1.0
5 President Item 1 NaN
6 Contractor Item 2 1.0
7 Contractor Item 2 0.0
8 Contractor Item 2 1.0
9 President Item 2 1.0
10 President Item 2 0.0
11 President Item 2 1.0
melted
包含相同的信息作为results
,但在一个整齐的格式。 value
列包含results[['Item 1', 'Item 2']]
中的值。每行对应于一个“观察”,其中results['Contractor']
或结果['总统']'等于1,因为计算的逻辑只需要这些值。
类似的,而不是
In [407]: reference
Out[407]:
Item(s) Position
0 (1,) Contractor
1 (1, 2) President
这将是整洁有一个数据帧的列是Item
和Position
:
In [408]: longref
Out[408]:
Item Position
0 Item 1 Contractor
1 Item 1 President
2 Item 2 President
一旦你有你的数据的形式整齐版本melted
和longref
, 计算所需的结果是相当直接的:
merged = pd.merge(longref, melted, how='left')
# Item Position value
# 0 Item 1 Contractor 1.0
# 1 Item 1 Contractor 1.0
# 2 Item 1 Contractor NaN
# 3 Item 1 President 1.0
# 4 Item 1 President 1.0
# 5 Item 1 President NaN
# 6 Item 2 President 1.0
# 7 Item 2 President 0.0
# 8 Item 2 President 1.0
grouped = merged.groupby(['Position'])
result = (grouped['value'].sum()/grouped['value'].count())*100
result = result.rename('Overall%').reset_index()
如何整理向上reference
使longref
:
刚刚经历的reference
行迭代,并为每一行通过项目的元组进行迭代,以建立新的数据帧,longref
:
longref = pd.DataFrame([('Item {}'.format(item), row['Position'])
for index, row in reference.iterrows()
for item in row['Item(s)']], columns=['Item', 'Position'])
如何整理results
使melted
:
它可以通过两个电话pd.melt
。 pd.melt
将“宽”格式转换为“长”格式的数据帧。它可以将多个列合并到一个列中。例如,凝聚承包商和总统列到一个位置列,您可以使用:
melted = pd.melt(results, id_vars=['Item 1','Item 2'], var_name='Position')
# we only care about rows where Contractor or President value was 1. So use .loc to select those rows.
melted = melted.loc[melted['value']==1]
# Item 1 Item 2 Position value
# 0 1.0 1 Contractor 1
# 1 1.0 0 Contractor 1
# 5 NaN 1 Contractor 1
# 6 1.0 1 President 1
# 10 1.0 0 President 1
# 11 NaN 1 President 1
同样,在Item 1
和Item 2
列合并成一个单一的Item
列,用途:
melted = pd.melt(melted, id_vars=['Position'],
value_vars=['Item 1','Item 2'], var_name='Item')
# Position Item value
# 0 Contractor Item 1 1.0
# 1 Contractor Item 1 1.0
# 2 Contractor Item 1 NaN
# 3 President Item 1 1.0
# 4 President Item 1 1.0
# 5 President Item 1 NaN
# 6 Contractor Item 2 1.0
# 7 Contractor Item 2 0.0
# 8 Contractor Item 2 1.0
# 9 President Item 2 1.0
# 10 President Item 2 0.0
# 11 President Item 2 1.0
哇!关于整洁的数据的好处。这超出了我的预期。非常感谢! –
如果您有时间,请考虑回复此相关问题:http://stackoverflow.com/questions/37245220/pandas-create-columns-from-rows-in-other-data-frame-with-criteria/37245696#37245696 –