2016-05-16 104 views
2

鉴于以下数据帧:熊猫透视表与花式堆叠

results = pd.DataFrame({'Contractor':[1,1,0,0,0,1], 
        'President':[1,0,0,0,1,1], 
        'Item 1':[1,1,0,0,1,np.nan], 
        'Item 2':[1,0,0,1,0,1]}) 
results[['Contractor','President','Item 1','Item 2']] 

results 

    Contractor President Item 1 Item 2 
0  1   1  1  1 
1  1   0  1  0 
2  0   0  0  0 
3  0   0  0  1 
4  0   1  1  0 
5  1   1  NaN  1 

这对于参照项目(见下文):

Position Item(s) 
0 Contractor 1 
1 President 1,2 

...我想透视数据产生这样的:

Position Overall% 
0 Contractor 100 
1 President 80 

...基于这样的逻辑:

由于总裁关心的是项目1和2,所以需要考虑5个数字:项目1中的(1和1)以及项目2中的(1,0,和1)。项目中的总和为4, 计数跨项目是5(不计算'NaN'),这就给出了80%。

因为承包商只关注第1项,所以需要考虑两个数字:1和1 - 不应计算'NaN' - (分别来自感兴趣的行)。因此,总数是2的计数,这是2,这给出100%

在此先感谢!

回答

2
import numpy as np 
import pandas as pd 

results = pd.DataFrame({'Contractor':[1,1,0,0,0,1], 
        'President':[1,0,0,0,1,1], 
        'Item 1':[1,1,0,0,1,np.nan], 
        'Item 2':[1,0,0,1,0,1]}) 
reference = pd.DataFrame({'Position':['Contractor','President'], 
          'Item(s)':[(1,), (1,2)]}) 

longref = pd.DataFrame([('Item {}'.format(item), row['Position']) 
         for index, row in reference.iterrows() 
         for item in row['Item(s)']], columns=['Item', 'Position']) 
melted = pd.melt(results, id_vars=['Item 1','Item 2'], var_name='Position') 
melted = melted.loc[melted['value']==1] 
melted = pd.melt(melted, id_vars=['Position'], 
       value_vars=['Item 1','Item 2'], var_name='Item') 
merged = pd.merge(longref, melted, how='left') 
grouped = merged.groupby(['Position']) 
result = (grouped['value'].sum()/grouped['value'].count())*100 
result = result.rename('Overall%').reset_index() 
print(result) 

产生

 Position Overall% 
0 Contractor  100.0 
1 President  80.0 

说明:有由Hadley WickhamPDF)的制品的要人看见优点使得数据 “整洁” 的 。主要原则是每行应该代表一个“观察”,每一列代表一些因子或变量。

经常会发现,一旦数据整齐,您需要用来表达计算结果的工具将会很自然地落到位。 这个问题的难度主要来自数据不整洁。

考虑results

In [405]: results 
Out[405]: 
    Contractor Item 1 Item 2 President 
0   1  1.0  1   1 
1   1  1.0  0   0 
2   0  0.0  0   0 
3   0  0.0  1   0 
4   0  1.0  0   1 
5   1  NaN  1   1 

代替具有用于ContractorPresident单独的列的,这将是更好的具有一列称为Position,由于Position是可变的,并且每个观察或行可以具有一个值为Position - ContractorPresident。 同样,Item 1Item 2应合并成一个单一的塔Item

In [416]: melted 
Out[416]: 
     Position Item value 
0 Contractor Item 1 1.0 
1 Contractor Item 1 1.0 
2 Contractor Item 1 NaN 
3 President Item 1 1.0 
4 President Item 1 1.0 
5 President Item 1 NaN 
6 Contractor Item 2 1.0 
7 Contractor Item 2 0.0 
8 Contractor Item 2 1.0 
9 President Item 2 1.0 
10 President Item 2 0.0 
11 President Item 2 1.0 

melted包含相同的信息作为results,但在一个整齐的格式。 value列包含results[['Item 1', 'Item 2']]中的值。每行对应于一个“观察”,其中results['Contractor']或结果['总统']'等于1,因为计算的逻辑只需要这些值。

类似的,而不是

In [407]: reference 
Out[407]: 
    Item(s) Position 
0 (1,) Contractor 
1 (1, 2) President 

这将是整洁有一个数据帧的列是ItemPosition

In [408]: longref 
Out[408]: 
    Item Position 
0 Item 1 Contractor 
1 Item 1 President 
2 Item 2 President 

一旦你有你的数据的形式整齐版本meltedlongref, 计算所需的结果是相当直接的:

merged = pd.merge(longref, melted, how='left') 
#  Item Position value 
# 0 Item 1 Contractor 1.0 
# 1 Item 1 Contractor 1.0 
# 2 Item 1 Contractor NaN 
# 3 Item 1 President 1.0 
# 4 Item 1 President 1.0 
# 5 Item 1 President NaN 
# 6 Item 2 President 1.0 
# 7 Item 2 President 0.0 
# 8 Item 2 President 1.0 

grouped = merged.groupby(['Position']) 
result = (grouped['value'].sum()/grouped['value'].count())*100 
result = result.rename('Overall%').reset_index() 

如何整理向上reference使longref

刚刚经历的reference行迭代,并为每一行通过项目的元组进行迭代,以建立新的数据帧,longref

longref = pd.DataFrame([('Item {}'.format(item), row['Position']) 
         for index, row in reference.iterrows() 
         for item in row['Item(s)']], columns=['Item', 'Position']) 

如何整理results使melted

它可以通过两个电话pd.meltpd.melt将“宽”格式转换为“长”格式的数据帧。它可以将多个列合并到一个列中。例如,凝聚承包商和总统列到一个位置列,您可以使用:

melted = pd.melt(results, id_vars=['Item 1','Item 2'], var_name='Position') 
# we only care about rows where Contractor or President value was 1. So use .loc to select those rows. 
melted = melted.loc[melted['value']==1] 
#  Item 1 Item 2 Position value 
# 0  1.0  1 Contractor  1 
# 1  1.0  0 Contractor  1 
# 5  NaN  1 Contractor  1 
# 6  1.0  1 President  1 
# 10  1.0  0 President  1 
# 11  NaN  1 President  1 

同样,在Item 1Item 2列合并成一个单一的Item列,用途:

melted = pd.melt(melted, id_vars=['Position'], 
       value_vars=['Item 1','Item 2'], var_name='Item') 
#  Position Item value 
# 0 Contractor Item 1 1.0 
# 1 Contractor Item 1 1.0 
# 2 Contractor Item 1 NaN 
# 3 President Item 1 1.0 
# 4 President Item 1 1.0 
# 5 President Item 1 NaN 
# 6 Contractor Item 2 1.0 
# 7 Contractor Item 2 0.0 
# 8 Contractor Item 2 1.0 
# 9 President Item 2 1.0 
# 10 President Item 2 0.0 
# 11 President Item 2 1.0 
+0

哇!关于整洁的数据的好处。这超出了我的预期。非常感谢! –

+0

如果您有时间,请考虑回复此相关问题:http://stackoverflow.com/questions/37245220/pandas-create-columns-from-rows-in-other-data-frame-with-criteria/372​​45696#37245696 –