2016-07-07 39 views
0

我正在尝试使用多元线性回归来计算系数。我正在使用statsmodels库来计算系数。问题是,通过此代码,我得到错误ValueError: endog and exog matrices are different sizes。我得到这个错误,因为在这个例子中,y设置了4个元素,而X设置了一个包含7个规则的列表,其中每个列表都有5个元素。在多元线性回归中计算系数

但我不明白的是,在x集(不X)是4名名单内(y有4种元素),其中每列由7个变量组成的列表。对我而言,xy具有相同数量的元素。

我该如何解决这个错误?

import numpy as np 
import statsmodels.api as sm 

def test_linear_regression(): 
    x = [[0.0, 1102249463.0, 44055788.0, 9.0, 2.0, 32000.0, 49222464.0], [0.0, 1102259506.0, 44049537.0, 9.0, 2.0, 32000.0, 49222464.0], [0.0, 1102249463.0, 44055788.0, 9.0, 2.0, 32000.0, 49222464.0], [0.0, 1102259506.0, 44049537.0, 10.0, 2.0, 32000.0, 49222464.0]] 

    y = [71.7554421425, 37.5205008984, 44.9945571423, 53.5441429615] 
    reg_m(y, x) 

def reg_m(y, x): 
    ones = np.ones(len(x[0])) 
    X = sm.add_constant(np.column_stack((x[0], ones))) 
    y.append(1) 
    for ele in x[1:]: 
     X = sm.add_constant(np.column_stack((ele, X))) 
    results = sm.OLS(y, X).fit() 
    return results 


if __name__ == "__main__": 
    test_linear_regression() 
+0

我认为你有一个非常好的了解这个问题。 'x'(小x)中的每个列表都有七个元素,而您的'y'是'x'中每个列表的单个元素。最后,'X'(大X)的形状为(7,5),但是你的'y'(这是一个列表)的len值为5.由于回归需要有相同数量的样本,预测7个样品的5个元素(y)将不起作用。你想做什么?你想每个'x'列表中的7个元素预测'y'吗? – Jarad

回答

1

假设x每个列表对应的y每个值:

x = [[0.0, 1102249463.0, 44055788.0, 9.0, 2.0, 32000.0, 49222464.0], 
    [0.0, 1102259506.0, 44049537.0, 9.0, 2.0, 32000.0, 49222464.0], 
    [0.0, 1102249463.0, 44055788.0, 9.0, 2.0, 32000.0, 49222464.0], 
    [0.0, 1102259506.0, 44049537.0, 10.0, 2.0, 32000.0, 49222464.0] 
    ] 

y = [71.7554421425, 37.5205008984, 44.9945571423, 53.5441429615] 

def reg_m(x, y): 
    x = np.array(x) 
    y = np.array(y) 

    # adds a constant of ones for y intercept 
    X = np.insert(x, 0, np.ones((1,)), axis=1) 

    # or, if you REALLY want to use add_constant, to add ones, use this 
    # X = sm.add_constant(x, has_constant='add') 

    return sm.OLS(y, X).fit() 

model = reg_m(x, y) 

要了解模型的汇总打印输出,只是model.summary()

""" 
          OLS Regression Results        
============================================================================== 
Dep. Variable:      y R-squared:      0.450 
Model:       OLS Adj. R-squared:     -0.649 
Method:     Least Squares F-statistic:     0.4096 
Date:    Thu, 07 Jul 2016 Prob (F-statistic):    0.741 
Time:      21:50:12 Log-Likelihood:    -14.665 
No. Observations:     4 AIC:        35.33 
Df Residuals:      1 BIC:        33.49 
Df Model:       2           
Covariance Type:   nonrobust           
============================================================================== 
       coef std err   t  P>|t|  [95.0% Conf. Int.] 
------------------------------------------------------------------------------ 
const  -1.306e-07 2.18e-07  -0.599  0.657  -2.9e-06 2.64e-06 
x1   -3.086e-11 5.15e-11  -0.599  0.657  -6.86e-10 6.24e-10 
x2   -0.0001  0.000  -0.900  0.534  -0.002  0.002 
x3    0.0031  0.003  0.900  0.534  -0.041  0.047 
x4   16.0236  26.761  0.599  0.657  -324.006 356.053 
x5   8.321e-12 9.25e-12  0.900  0.534  -1.09e-10 1.26e-10 
x6   1.331e-07 1.48e-07  0.900  0.534  -1.75e-06 2.01e-06 
x7    0.0002  0.000  0.900  0.534  -0.003  0.003 
============================================================================== 
Omnibus:       nan Durbin-Watson:     1.500 
Prob(Omnibus):     nan Jarque-Bera (JB):    0.167 
Skew:       -0.000 Prob(JB):      0.920 
Kurtosis:      2.000 Cond. No.       inf 
============================================================================== 

Warnings: 
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. 
[2] The input rank is higher than the number of observations. 
[3] The smallest eigenvalue is  0. This might indicate that there are 
strong multicollinearity problems or that the design matrix is singular. 
""" 
+0

强调:见警告(2)和(3),估计值是基于广义逆,pinv,并且参数未被识别。 – user333700