2015-08-21 114 views
2

我在Python 2.7中有以下Pandas Dataframe。熊猫标准偏差返回NaN

CODE:

import pandas as pd 
import numpy as np 
df = pd.DataFrame(np.random.rand(10,6),columns=list('ABCDEF')) 
df.insert(0,'Category',['A','C','D','D','B','E','F','F','G','H']) 
print df.groupby('Category').std() 

这里是df

Category   A   B   C   D   E   F 
     A 0.500200 0.791039 0.498083 0.360320 0.965992 0.537068 
     C 0.295330 0.638823 0.133570 0.272600 0.647285 0.737942 
     D 0.912966 0.051288 0.055766 0.906490 0.078384 0.928538 
     D 0.416582 0.441684 0.605967 0.516580 0.458814 0.823692 
     B 0.714371 0.636975 0.153347 0.936872 0.000649 0.692558 
     E 0.639271 0.486151 0.860172 0.870838 0.831571 0.404813 
     F 0.375279 0.555228 0.020599 0.120947 0.896505 0.424233 
     F 0.952112 0.299520 0.150623 0.341139 0.186734 0.807519 
     G 0.384157 0.858391 0.278563 0.677627 0.998458 0.829019 
     H 0.109465 0.085861 0.440557 0.925500 0.767791 0.626924 

我期待执行GROUP_BY,然后计算平均值和标准偏差。标准偏差是有时分组后计算1行 - 这意味着除以N-1有时给予除以0这将打印NaN

这里是上面的代码的输出:

OUTPUT:

   A   B   C   D   E   F 
Category                
A    NaN  NaN  NaN  NaN  NaN  NaN 
B    NaN  NaN  NaN  NaN  NaN  NaN 
C    NaN  NaN  NaN  NaN  NaN  NaN 
D   0.350996 0.276052 0.389051 0.275708 0.269004 0.074137 
E    NaN  NaN  NaN  NaN  NaN  NaN 
F   0.407882 0.180813 0.091941 0.155699 0.501884 0.271025 
G    NaN  NaN  NaN  NaN  NaN  NaN 
H    NaN  NaN  NaN  NaN  NaN  NaN 

对于我在哪里执行GROUP_BY超过1行的情况下,有一个方法来跳过标准偏差只是返回值本身。例如,我希望得到这样的:

所需的输出

    A   B   C   D   E   F 
Category                
A   0.500200 0.791039 0.498083 0.360320 0.965992 0.537068 
B   0.714371 0.636975 0.153347 0.936872 0.000649 0.692558 
C   0.295330 0.638823 0.133570 0.272600 0.647285 0.737942 
D   0.350996 0.276052 0.389051 0.275708 0.269004 0.074137 
E   0.639271 0.486151 0.860172 0.870838 0.831571 0.404813 
F   0.407882 0.180813 0.091941 0.155699 0.501884 0.271025 
G   0.384157 0.858391 0.278563 0.677627 0.998458 0.829019 
H   0.109465 0.085861 0.440557 0.925500 0.767791 0.626924 

是否有可能与大熊猫做到这一点?

编辑: 要创建精确的熊猫据帧以上,选择它,复制到剪贴板,然后使用此:

import pandas as pd 
df = pd.read_clipboard() 
print df 
print df.groupby('Category').std() 
+0

您是如何计算您所需的输出中的类别D和F的值(例如,类别D的A列为0.403709)? – Alexander

+0

请注意,'df.groupby('Category')。apply(np.std)'将std返回为'0.0',正如您所期望的那样。 – dhrumeel

回答

4

你可以fillna更换缺失值 - 传递一个DataFrame与最后的值的每个组。

In [86]: (df.groupby('Category').std() 
    ...: .fillna(df.groupby('Category').last())) 

Out[86]: 
       A   B   C   D   E   F 
Category                
A   0.500200 0.791039 0.498083 0.360320 0.965992 0.537068 
B   0.714371 0.636975 0.153347 0.936872 0.000649 0.692558 
C   0.295330 0.638823 0.133570 0.272600 0.647285 0.737942 
D   0.350996 0.276052 0.389051 0.275708 0.269005 0.074137 
E   0.639271 0.486151 0.860172 0.870838 0.831571 0.404813 
F   0.407883 0.180813 0.091941 0.155699 0.501884 0.271024 
G   0.384157 0.858391 0.278563 0.677627 0.998458 0.829019 
H   0.109465 0.085861 0.440557 0.925500 0.767791 0.626924