2017-08-17 202 views
0

我对编程颇为陌生,我在python上跳跃以熟悉数据分析和机器学习。[Statsmodels]:如何获得statsmodel以返回OLS对象的pvalue?

我正在关注多重线性回归的反向消除教程。下面是代码现在:

# Importing the libraries 
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd 

# Importing the dataset 
dataset = pd.read_csv('50_Startups.csv') 
X = dataset.iloc[:, :-1].values 
y = dataset.iloc[:, 4].values 

#Taking care of missin' data 
#np.set_printoptions(threshold=100) 
from sklearn.preprocessing import Imputer 
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0) 
imputer = imputer.fit(X[:, 1:3]) 
X[:, 1:3] = imputer.transform(X[:, 1:3]) 

#Encoding categorical data 
from sklearn.preprocessing import LabelEncoder, OneHotEncoder 
labelEncoder_X = LabelEncoder() 
X[:, 3] = labelEncoder_X.fit_transform(X[:, 3]) 
onehotecnoder = OneHotEncoder(categorical_features = [3]) 
X = onehotecnoder.fit_transform(X).toarray() 

#Avoid the Dummy Variables Trap 
X = X[:, 1:] 

#Splitting data in train and test 
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0) 

#Fitting multiple Linear Regression to Training set 
from sklearn.linear_model import LinearRegression 
regressor = LinearRegression() 
regressor.fit(X_train, y_train) 

#Predict Test set 
regressor = regressor.predict(X_test) 

#Building the optimal model using Backward Elimination 
import statsmodels.formula.api as sm 
a = 0 
b = 0 
a, b = X.shape 
X = np.append(arr = np.ones((a, 1)).astype(int), values = X, axis = 1) 
print (X.shape) 

X_optimal = X[:,[0,1,2,3,4,5]] 
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit() 
regressor_OLS.summary() 
X_optimal = X[:,[0,1,3,4,5]] 
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit() 
regressor_OLS.summary() 
X_optimal = X[:,[0,3,4,5]] 
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit() 
regressor_OLS.summary() 
X_optimal = X[:,[0,3,5]] 
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit() 
regressor_OLS.summary() 
X_optimal = X[:,[0,3]] 
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit() 
regressor_OLS.summary() 

现在,进行消除的方式似乎真的手册给我,我想自动化。为了做到这一点,我想知道是否有一种方法让我能够以某种方式返回回归者的pvalue(例如,如果有一种方法可以在statsmodels中实现)。用这种方式,我想我应该能够循环X_optimal数组的特性并查看pvalue是否大于我的SL并消除它。

谢谢!

回答

0

出现同样的问题。

您可以通过

regressor_OLS.pvalues 

他们存储在科学记数法float64s数组访问p值。我对Python有点新,我确信有更清洁,更优雅的解决方案,但这是我的:

sigLevel = 0.05 

X_opt = X[:,[0,1,2,3,4,5]] 
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() 
regressor_OLS.summary() 
pVals = regressor_OLS.pvalues 

while np.argmax(pVals) > sigLevel: 
    droppedDimIndex = np.argmax(regressor_OLS.pvalues) 
    keptDims = list(range(len(X_opt[0]))) 
    keptDims.pop(droppedDimIndex) 
    print("pval of dim removed: " + str(np.argmax(pVals))) 
    X_opt = X_opt[:,keptDims] 
    regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() 
    pVals = regressor_OLS.pvalues 
    print(str(len(pVals)-1) + " dimensions remaining...") 
    print(pVals) 

regressor_OLS.summary()