2016-05-16 79 views
0

我是新的Python世界。我必须处理金融数据集。说我有一个数据帧是这样的:python groupwise winsorization和线性回归

TradingDate StockCode  Size  ILLIQ 
0 20050131 000001 13.980320 77.7522 
1 20050131 000002 14.071253 19.1471 
2 20050131 000004 10.805564 696.2428 
3 20050131 000005 11.910485 621.3723 
4 20050131 000006 11.631550 339.0952 
*** *** 

我想要做的就是做一个GroupWise OLS回归,其中分组varibales是TradingDate,因变量是“大小”,自变量是“ ILLIQ”。我想将剩余的回归项追加回原始的数据框,比如说一个名为“残差”的新列。我该如何处理这件事?

看来下面的代码不工作?

def regress(data,yvar,xvars): 
    Y = data[yvar] 
    X = data[xvars] 
    X['intercept']=1. 
    result = sm.OLS(Y,X).fit() 
    return result.resid() 

by_Date = df.groupby('TradingDate') 
by_Date.apply(regress,'ILLIQ',['Size']) 

回答

0

你只需要使用.resid访问残差 - .resid只是一种属性,而不是一个方法(see docs)。简化图解:

import statsmodels.formula.api as sm 
df = df.set_index('TradingDate', inplace=True) 
df['residuals'] = df.groupby(level=0).apply(lambda x: pd.DataFrame(sm.ols(formula="Size ~ ILLIQ", data=x).fit().resid)).values 

      StockCode  Size  ILLIQ residuals 
TradingDate           
20050131    1 13.980320 77.7522 0.299278 
20050131    2 14.071253 19.1471 0.132318 
20050131    4 10.805564 696.2428 -0.153800 
20050131    5 11.910485 621.3723 0.621652 
20050131    6 11.631550 339.0952 -0.899448 
+0

我想你的代码,它提供了以下错误: ValueError异常:值的长度不符合指标 – Vincent

+0

的长度我想我先把'TradingDate'移到索引上,让我更新答案。 – Stefan

+0

实际上,我在从SQL DB导入数据时将索引设置为TradingDate列: df = pd.read_sql_query(query,con,index_col = ['TradingDate']) – Vincent

0

设置

from StringIO import StringIO 
import pandas as pd 

text = """TradingDate StockCode  Size  ILLIQ 
0 20050131 000001 13.980320 77.7522 
1 20050131 000002 14.071253 19.1471 
2 20050131 000004 10.805564 696.2428 
3 20050131 000005 11.910485 621.3723 
4 20050131 000006 11.631550 339.0952""" 

df = pd.read_csv(StringIO(text), delim_whitespace=1, 
       converters=dict(TradingDate=pd.to_datetime)) 

解决方案

def regress(data,yvar,xvars): 
    # I changed this a bit to ensure proper dimensional alignment 
    Y = data[[yvar]].copy() 
    X = data[xvars].copy() 
    X['intercept'] = 1 
    result = sm.OLS(Y,X).fit() 
    # resid is an attribute not a method 
    return result.resid 

def append_resids(df, yvar, xvars): 
    """New helper to return DataFrame object within groupby apply 
    df = df.copy() 
    df['residuals'] = regress(df, yvar, xvars) 
    return df 

df.groupby('TradingDate').apply(lambda x: append_resids(x, 'ILLIQ', ['Size']))