熊猫为单个追加多列

如何使用熊猫为每个单个客户高效追加多个KPI值？熊猫为单个追加多列

将df与 df和customers df结合会产生一些问题，因为该国是数据框架的索引并且国籍不在索引中。

countryKPI = pd.DataFrame({'country':['Austria','Germany', 'Germany', 'Austria'], 
          'indicator':['z','x','z','x'], 
          'value':[7,8,9,7]}) 
customers = pd.DataFrame({'customer':['first','second'], 
          'nationality':['Germany','Austria'], 
          'value':[7,8]})

见粉色期望的结果：

来源

2016-09-22 Georg Heiler

您可以通过merge计数器类别的不匹配：

df = pd.pivot_table(data=countryKPI, index=['country'], columns=['indicator']) 
df.index.name = 'nationality'  
customers.merge(df['value'].reset_index(), on='nationality', how='outer')

数据：

countryKPI = pd.DataFrame({'country':['Austria','Germany', 'Germany', 'Austria'], 
          'indicator':['z','x','z','x'], 
          'value':[7,8,9,7]}) 
customers = pd.DataFrame({'customer':['first','second'], 
          'nationality':['Slovakia','Austria'], 
          'value':[7,8]})

这个问题似乎是因为pivot操作导致您的DF中有CategoricalIndex，并且当您执行reset_index时，您会抱怨那个错误。

简单地做逆向工程在检查countryKPI的dtypes和customers Dataframes何有category提到，通过astype(str)

转换这些列其string表示再现错误和打击它：

假设DF为上述提及的：

countryKPI['indicator'] = countryKPI['indicator'].astype('category') 
countryKPI['country'] = countryKPI['country'].astype('category') 
customers['nationality'] = customers['nationality'].astype('category') 

countryKPI.dtypes 
country  category 
indicator category 
value   int64 
dtype: object 

customers.dtypes 
customer   object 
nationality category 
value    int64 
dtype: object

pivot操作后：

df = pd.pivot_table(data=countryKPI, index=['country'], columns=['indicator']) 
df.index 
CategoricalIndex(['Austria', 'Germany'], categories=['Austria', 'Germany'], ordered=False, 
        name='country', dtype='category') 
# ^^ See the categorical index

当您执行对reset_index：

df.reset_index()

TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category

为了解决这个错误，简单地把分类列str类型。

countryKPI['indicator'] = countryKPI['indicator'].astype('str') 
countryKPI['country'] = countryKPI['country'].astype('str') 
customers['nationality'] = customers['nationality'].astype('str')

现在，reset_index部分作品甚至merge了。

来源

2016-09-22 09:43:56

有趣而简单。但是http://imgur.com/a/PeCyh为什么我会为初始数据集（0,1,2,3）获得其他几个值？ –

我看到了 - 您的最新修改会使我的最新评论无效。 –

但是，仍然存在以下问题：不能将项目插入到分类索引中，但我不是已有的分类 –

我认为你可以使用concat：

df_pivoted = countryKPI.pivot_table(index='country', 
           columns='indicator', 
           values='value', 
           fill_value=0) 
print (df_pivoted)  
indicator x z 
country   
Austria 7 7 
Germany 8 9 

print (pd.concat([customers.set_index('nationality'), df_pivoted], axis=1)) 
     customer value x z 
Austria second  8 7 7 
Germany first  7 8 9      


print (pd.concat([customers.set_index('nationality'), df_pivoted], axis=1) 
     .reset_index() 
     .rename(columns={'index':'nationality'}) 
     [['customer','nationality','value','x','z']]) 

    customer nationality value x z 
0 second  Austria  8 7 7 
1 first  Germany  7 8 9

编辑的评论：

问题是列customers.nationality和countryKPI.country的dtypes是category，如果有些类别是想念克，它引发错误：

ValueError: incompatible categories in categorical concat

解决方案通过union找到共同的类别，然后set_categories：

import pandas as pd 
import numpy as np 

countryKPI = pd.DataFrame({'country':['Austria','Germany', 'Germany', 'Austria'], 
          'indicator':['z','x','z','x'], 
          'value':[7,8,9,7]}) 
customers = pd.DataFrame({'customer':['first','second'], 
          'nationality':['Slovakia','Austria'], 
          'value':[7,8]}) 

customers.nationality = customers.nationality.astype('category') 
countryKPI.country = countryKPI.country.astype('category') 

print (countryKPI.country.cat.categories) 
Index(['Austria', 'Germany'], dtype='object') 

print (customers.nationality.cat.categories) 
Index(['Austria', 'Slovakia'], dtype='object') 

all_categories =countryKPI.country.cat.categories.union(customers.nationality.cat.categories) 
print (all_categories) 
Index(['Austria', 'Germany', 'Slovakia'], dtype='object') 

customers.nationality = customers.nationality.cat.set_categories(all_categories) 
countryKPI.country = countryKPI.country.cat.set_categories(all_categories)

df_pivoted = countryKPI.pivot_table(index='country', 
           columns='indicator', 
           values='value', 
           fill_value=0) 
print (df_pivoted)  
indicator x z 
country   
Austria 7 7 
Germany 8 9 
Slovakia 0 0   

print (pd.concat([customers.set_index('nationality'), df_pivoted], axis=1) 
     .reset_index() 
     .rename(columns={'index':'nationality'}) 
     [['customer','nationality','value','x','z']]) 

    customer nationality value x z 
0 second  Austria 8.0 7 7 
1  NaN  Germany NaN 8 9 
2 first Slovakia 7.0 0 0

如果需要更好的性能，而不是pivot_table使用groupby：

df_pivoted1 = countryKPI.groupby(['country','indicator']) 
         .mean() 
         .squeeze() 
         .unstack() 
         .fillna(0) 
print (df_pivoted1) 
indicator x z 
country    
Austria 7.0 7.0 
Germany 8.0 9.0 
Slovakia 0.0 0.0

时序：

In [177]: %timeit countryKPI.pivot_table(index='country', columns='indicator', values='value', fill_value=0) 
100 loops, best of 3: 6.24 ms per loop 

In [178]: %timeit countryKPI.groupby(['country','indicator']).mean().squeeze().unstack().fillna(0) 
100 loops, best of 3: 4.28 ms per loop

来源

2016-09-22 08:50:59 jezrael

这几乎可行 - 但我得到类别连续不兼容的类别的错误 –

问题是与真实的数据，对不对？我想，Smale完美地工作。 – jezrael

不幸的是。 –

熊猫为单个追加多列

回答

相关问题