2016-03-20 117 views
4

我有一个dataframe,我想按分类变量和一系列值进行分组。你可以把它想成类似值的行(簇?)。例如:如何根据熊猫的一系列值进行分组?

df = pd.DataFrame({'symbol' : ['IP', 'IP', 'IP', 'IP', 'IP', 'IP', 'IP'], 
        'serie' : ['A', 'B', 'A', 'B', 'A', 'B', 'B'], 
        'strike' : [10, 10, 12, 13, 12, 13, 14], 
        'last' : [1, 2, 2.5, 3, 4.5, 5, 6], 
        'price' : [11, 11, 11, 11, 11, 11, 11], 
        'type' : ['call', 'put', 'put', 'put', 'call', 'put', 'call']}) 

如果我使用

grouped = df.groupby(['symbol', 'serie', 'strike']) 

我已经解决了我的问题的一部分,但我想罢工值更接近,如10和11,12和13等相结合向前。最好在%范围内。

+0

似乎复制:http://stackoverflow.com/questions/21441259/pandas-groupby-值范围 –

+2

您是否可以显示预期的输出? –

+0

您需要一个明确定义的标准,以首先对罢工值进行聚类/分组。 – Goyo

回答

1

我在猜测OP想要按分类变量进行分组,然后按照区间进行分组。在这种情况下,您可以使用np.digitize()

smallest = np.min(df['strike']) 
largest = np.max(df['strike']) 
num_edges = 3 
# np.digitize(input_array, bin_edges) 
ind = np.digitize(df['strike'], np.linspace(smallest, largest, num_edges)) 

所有你想要的栏目,然后ind应是对应装箱

[10, 10, 12, 13, 12, 13, 14] 

与仓

array([1, 1, 2, 2, 2, 2, 3], dtype=int64) 

边缘

array([ 10., 12., 14.]) # == np.linspace(smallest, largest, num_edges) 

最后,集团,但同这个额外斌列

df['binned_strike'] = ind 
for grp in df.groupby(['symbol', 'serie', 'binned_strike']): 
    print "group key" 
    print grp[0] 
    print "group content" 
    print grp[1] 
    print "=============" 

这应该打印

group key 
('IP', 'A', 1) 
group content 
    last price serie strike symbol type binned_strike 
0 1.0  11  A  10  IP call    1 
============= 
group key 
('IP', 'A', 2) 
group content 
    last price serie strike symbol type binned_strike 
2 2.5  11  A  12  IP put    2 
4 4.5  11  A  12  IP call    2 
============= 
group key 
('IP', 'B', 1) 
group content 
    last price serie strike symbol type binned_strike 
1 2.0  11  B  10  IP put    1 
============= 
group key 
('IP', 'B', 2) 
group content 
    last price serie strike symbol type binned_strike 
3 3.0  11  B  13  IP put    2 
5 5.0  11  B  13  IP put    2 
============= 
group key 
('IP', 'B', 3) 
group content 
    last price serie strike symbol type binned_strike 
6 6.0  11  B  14  IP call    3 
============= 
2

groupy()strike

打击数据的创建箱pd.cut,然后组由信息:

# Create DataFrame 
df = pd.DataFrame({ 
    'symbol' : ['IP', 'IP', 'IP', 'IP', 'IP', 'IP', 'IP'], 
    'serie' : ['A', 'B', 'A', 'B', 'A', 'B', 'B'], 
    'strike' : [10, 10, 12, 13, 12, 13, 14], 
    'last' : [1, 2, 2.5, 3, 4.5, 5, 6], 
    'price' : [11, 11, 11, 11, 11, 11, 11], 
    'type' : ['call', 'put', 'put', 'put', 'call', 'put', 'call'] 
}) 
# Create Bins (example three bins across data) 
df['strikebins'] = pd.cut(df['strike'], bins=3) 

print 'Binned DataFrame:' 
print df 
print 

# Group these DataFrame 
grouped = df.groupby(['symbol', 'serie', 'strikebins']) 

# Do something with groups for example 
gp_sum = grouped.sum() 

print 'Grouped Sum (for example):' 
print gp_sum 
print 

Binned DataFrame: 
    last price serie strike symbol type  strikebins 
0 1.0  11  A  10  IP call (9.996, 11.333] 
1 2.0  11  B  10  IP put (9.996, 11.333] 
2 2.5  11  A  12  IP put (11.333, 12.667] 
3 3.0  11  B  13  IP put  (12.667, 14] 
4 4.5  11  A  12  IP call (11.333, 12.667] 
5 5.0  11  B  13  IP put  (12.667, 14] 
6 6.0  11  B  14  IP call  (12.667, 14] 

Grouped Sum (for example): 
           last price strike 
symbol serie strikebins       
IP  A  (9.996, 11.333]  1  11  10 
      (11.333, 12.667]  7  22  24 
      (12.667, 14]  NaN NaN  NaN 
     B  (9.996, 11.333]  2  11  10 
      (11.333, 12.667] NaN NaN  NaN 
      (12.667, 14]  14  33  40 

你可以drop()strike如果你想,或者与范围的平均值代替strikebins ...

相关问题