2016-12-02 51 views
1

我有一个数据框,每个组ID有+ - 100行。我想对组ID进行分组,然后只保留一列的标准差低于阈值的组。我用下面的代码熊猫:如何选择组内标准偏差小的组?

# df is the dataframe with all rows 
# group on groupID 
df_grouped = df.groupby('groupID') 

# this gives a table with groupID and the std within a group 
df_grouped_std = df_grouped.std() 

# from the df with standard deviations, I select only the groups 
# where the standard deviation is withing limits 
selection = df_grouped_std[df_grouped_std['col1']<1][df_grouped_std['col2']<0.05] 

# now I try to select from the original dataframe 'df_grouped' the groups that were selected in the previous step. 
df_plot = df_grouped[selection] 

堆栈跟踪:

Traceback (most recent call last): 

    File "<ipython-input-72-2cd045ecb262>", line 1, in <module> 
    runfile('C:/Documents and Settings/a708818/Desktop/coloredByRol.py', wdir='C:/Documents and Settings/a708818/Desktop') 

    File "C:\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 682, in runfile 
    execfile(filename, namespace) 

    File "C:\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile 
    exec(compile(scripttext, filename, 'exec'), glob, loc) 

    File "C:/Documents and Settings/a708818/Desktop/coloredByRol.py", line 50, in <module> 
    df_plot = df_grouped[selection] 

    File "C:\Anaconda\lib\site-packages\pandas\core\groupby.py", line 3170, in __getitem__ 
    if key not in self.obj: 

    File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 688, in __contains__ 
    return key in self._info_axis 

    File "C:\Anaconda\lib\site-packages\pandas\core\index.py", line 885, in __contains__ 
    hash(key) 

    File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 647, in __hash__ 
    ' hashed'.format(self.__class__.__name__)) 

TypeError: 'DataFrame' objects are mutable, thus they cannot be hashedus they cannot be hashed 

我无法弄清楚如何选择我想要的数据。任何提示?

回答

1

我认为你可以使用:

df_grouped = df.groupby('groupID') 
#get std per groups 
df_grouped_std = df_grouped.std() 
print (df_grouped_std) 
#select by conditions 
selection = df_grouped_std[ (df_grouped_std['col1']<1) & (df_grouped_std['col2']<0.05)] 
print (selection) 

#select all rows of original df where groupID is same as index of 'selection' 
df_plot = df[df.groupID.isin(selection.index)] 
print (df_plot) 

样品:

df = pd.DataFrame({'groupID':[1,1,1,2,3,3,2], 
        'col1':[5,3,6,4,7,8,9], 
        'col2':[7,8,9,1,2,3,8]}) 

print (df) 
    col1 col2 groupID 
0  5  7  1 
1  3  8  1 
2  6  9  1 
3  4  1  2 
4  7  2  3 
5  8  3  3 
6  9  8  2 
df_grouped = df.groupby('groupID') 
# 
df_grouped_std = df_grouped.std() 
print (df_grouped_std) 
      col1  col2 
groupID      
1  1.527525 1.000000 
2  3.535534 4.949747 
3  0.707107 0.707107 

#change conditions for testing only 
selection = df_grouped_std[ (df_grouped_std['col1']>1) & (df_grouped_std['col2']>3)] 
print (selection) 
      col1  col2 
groupID      
2  3.535534 4.949747 

# 
df_plot = df[df.groupID.isin(selection.index)] 
print (df_plot) 
    col1 col2 groupID 
3  4  1  2 
6  9  8  2 

编辑:

另一种可能的解决方案是使用filter

print (df.groupby('groupID') 
     .filter(lambda x: (x.col1.std() > 1) & (x.col2.std() > 3))) 

    col1 col2 groupID 
3  4  1  2 
6  9  8  2 
+0

使用过滤器的解决方案看起来更清洁。谢谢! – marqram