2016-09-22 207 views
1

如何使用熊猫为每个单个客户高效追加多个KPI值?熊猫为单个追加多列

将df与 df和customers df结合会产生一些问题,因为该国是数据框架的索引并且国籍不在索引中。

countryKPI = pd.DataFrame({'country':['Austria','Germany', 'Germany', 'Austria'], 
          'indicator':['z','x','z','x'], 
          'value':[7,8,9,7]}) 
customers = pd.DataFrame({'customer':['first','second'], 
          'nationality':['Germany','Austria'], 
          'value':[7,8]}) 

见粉色期望的结果: enter image description here

回答

1

您可以通过merge计数器类别的不匹配:

df = pd.pivot_table(data=countryKPI, index=['country'], columns=['indicator']) 
df.index.name = 'nationality'  
customers.merge(df['value'].reset_index(), on='nationality', how='outer') 

Image

数据:

countryKPI = pd.DataFrame({'country':['Austria','Germany', 'Germany', 'Austria'], 
          'indicator':['z','x','z','x'], 
          'value':[7,8,9,7]}) 
customers = pd.DataFrame({'customer':['first','second'], 
          'nationality':['Slovakia','Austria'], 
          'value':[7,8]}) 

这个问题似乎是因为pivot操作导致您的DF中有CategoricalIndex,并且当您执行reset_index时,您会抱怨那个错误。

简单地做逆向工程在检查countryKPIdtypescustomers Dataframes何有category提到,通过astype(str)


转换这些列其string表示再现错误和打击它:

假设DF为上述提及的:

countryKPI['indicator'] = countryKPI['indicator'].astype('category') 
countryKPI['country'] = countryKPI['country'].astype('category') 
customers['nationality'] = customers['nationality'].astype('category') 

countryKPI.dtypes 
country  category 
indicator category 
value   int64 
dtype: object 

customers.dtypes 
customer   object 
nationality category 
value    int64 
dtype: object 

pivot操作后:

df = pd.pivot_table(data=countryKPI, index=['country'], columns=['indicator']) 
df.index 
CategoricalIndex(['Austria', 'Germany'], categories=['Austria', 'Germany'], ordered=False, 
        name='country', dtype='category') 
# ^^ See the categorical index 

当您执行对reset_index

df.reset_index() 

TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category

为了解决这个错误,简单地把分类列str类型。

countryKPI['indicator'] = countryKPI['indicator'].astype('str') 
countryKPI['country'] = countryKPI['country'].astype('str') 
customers['nationality'] = customers['nationality'].astype('str') 

现在,reset_index部分作品甚至merge了。

+0

有趣而简单。但是http://imgur.com/a/PeCyh为什么我会为初始数据集(0,1,2,3)获得其他几个值? –

+0

我看到了 - 您的最新修改会使我的最新评论无效。 –

+0

但是,仍然存在以下问题:不能将项目插入到分类索引中,但我不是已有的分类 –

2

我认为你可以使用concat

df_pivoted = countryKPI.pivot_table(index='country', 
           columns='indicator', 
           values='value', 
           fill_value=0) 
print (df_pivoted)  
indicator x z 
country   
Austria 7 7 
Germany 8 9 

print (pd.concat([customers.set_index('nationality'), df_pivoted], axis=1)) 
     customer value x z 
Austria second  8 7 7 
Germany first  7 8 9      


print (pd.concat([customers.set_index('nationality'), df_pivoted], axis=1) 
     .reset_index() 
     .rename(columns={'index':'nationality'}) 
     [['customer','nationality','value','x','z']]) 

    customer nationality value x z 
0 second  Austria  8 7 7 
1 first  Germany  7 8 9 

编辑的评论:

问题是列customers.nationalitycountryKPI.countrydtypescategory,如果有些类别是想念克,它引发错误:

ValueError: incompatible categories in categorical concat

解决方案通过union找到共同的类别,然后set_categories

import pandas as pd 
import numpy as np 

countryKPI = pd.DataFrame({'country':['Austria','Germany', 'Germany', 'Austria'], 
          'indicator':['z','x','z','x'], 
          'value':[7,8,9,7]}) 
customers = pd.DataFrame({'customer':['first','second'], 
          'nationality':['Slovakia','Austria'], 
          'value':[7,8]}) 

customers.nationality = customers.nationality.astype('category') 
countryKPI.country = countryKPI.country.astype('category') 

print (countryKPI.country.cat.categories) 
Index(['Austria', 'Germany'], dtype='object') 

print (customers.nationality.cat.categories) 
Index(['Austria', 'Slovakia'], dtype='object') 

all_categories =countryKPI.country.cat.categories.union(customers.nationality.cat.categories) 
print (all_categories) 
Index(['Austria', 'Germany', 'Slovakia'], dtype='object') 

customers.nationality = customers.nationality.cat.set_categories(all_categories) 
countryKPI.country = countryKPI.country.cat.set_categories(all_categories) 
df_pivoted = countryKPI.pivot_table(index='country', 
           columns='indicator', 
           values='value', 
           fill_value=0) 
print (df_pivoted)  
indicator x z 
country   
Austria 7 7 
Germany 8 9 
Slovakia 0 0   

print (pd.concat([customers.set_index('nationality'), df_pivoted], axis=1) 
     .reset_index() 
     .rename(columns={'index':'nationality'}) 
     [['customer','nationality','value','x','z']]) 

    customer nationality value x z 
0 second  Austria 8.0 7 7 
1  NaN  Germany NaN 8 9 
2 first Slovakia 7.0 0 0 

如果需要更好的性能,而不是pivot_table使用groupby

df_pivoted1 = countryKPI.groupby(['country','indicator']) 
         .mean() 
         .squeeze() 
         .unstack() 
         .fillna(0) 
print (df_pivoted1) 
indicator x z 
country    
Austria 7.0 7.0 
Germany 8.0 9.0 
Slovakia 0.0 0.0 

时序

In [177]: %timeit countryKPI.pivot_table(index='country', columns='indicator', values='value', fill_value=0) 
100 loops, best of 3: 6.24 ms per loop 

In [178]: %timeit countryKPI.groupby(['country','indicator']).mean().squeeze().unstack().fillna(0) 
100 loops, best of 3: 4.28 ms per loop 
+0

这几乎可行 - 但我得到类别连续不兼容的类别的错误 –

+1

问题是与真实的数据,对不对?我想,Smale完美地工作。 – jezrael

+0

不幸的是。 –