2017-08-07 18 views
2

我有在含有围绕在一些行和列缺失的数据(NAN)7500个数据点的数据集进行多元回归问题。每行至少有一个NaN值。有些行只包含NaN值。多个OLS回归与Statsmodel ValueError异常:零大小的数组到归约运算最大不具有同一性

我使用OLS Statsmodel进行回归分析。我试图不使用Scikit Learn来执行OLS回归,因为(我可能对此有错,但是)我不得不将数据集中的缺失数据计算在内,这会在一定程度上扭曲数据集。

我的数据集是这样的: KPI

这是我做过什么(目标变量是KP6,预测变量是剩余的变量):

est2 = ols(formula = KPI.KPI6.name + ' ~ ' + ' + '.join(KPI.drop('KPI6', axis = 1).columns.tolist()), data = KPI).fit() 

,并返回一个ValueError:零大小排列到没有标识的还原操作最大值。

--------------------------------------------------------------------------- 
ValueError        Traceback (most recent call last) 
<ipython-input-207-b24ba316a452> in <module>() 
     3 #test = KPI.dropna(how='all') 
     4 #test = KPI.fillna(0) 
----> 5 est2 = ols(formula = KPI.KPI6.name + ' ~ ' + ' + '.join(KPI.drop('KPI6', axis = 1).columns.tolist()), data = KPI).fit() 
     6 print(est2.summary()) 

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/model.py in from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs) 
    172      'formula': formula, # attach formula for unpckling 
    173      'design_info': design_info}) 
--> 174   mod = cls(endog, exog, *args, **kwargs) 
    175   mod.formula = formula 
    176 

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/regression/linear_model.py in __init__(self, endog, exog, missing, hasconst, **kwargs) 
    629     **kwargs): 
    630   super(OLS, self).__init__(endog, exog, missing=missing, 
--> 631         hasconst=hasconst, **kwargs) 
    632   if "weights" in self._init_keys: 
    633    self._init_keys.remove("weights") 

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/regression/linear_model.py in __init__(self, endog, exog, weights, missing, hasconst, **kwargs) 
    524    weights = weights.squeeze() 
    525   super(WLS, self).__init__(endog, exog, missing=missing, 
--> 526         weights=weights, hasconst=hasconst, **kwargs) 
    527   nobs = self.exog.shape[0] 
    528   weights = self.weights 

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/regression/linear_model.py in __init__(self, endog, exog, **kwargs) 
    93  """ 
    94  def __init__(self, endog, exog, **kwargs): 
---> 95   super(RegressionModel, self).__init__(endog, exog, **kwargs) 
    96   self._data_attr.extend(['pinv_wexog', 'wendog', 'wexog', 'weights']) 
    97 

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/model.py in __init__(self, endog, exog, **kwargs) 
    210 
    211  def __init__(self, endog, exog=None, **kwargs): 
--> 212   super(LikelihoodModel, self).__init__(endog, exog, **kwargs) 
    213   self.initialize() 
    214 

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/model.py in __init__(self, endog, exog, **kwargs) 
    61   hasconst = kwargs.pop('hasconst', None) 
    62   self.data = self._handle_data(endog, exog, missing, hasconst, 
---> 63          **kwargs) 
    64   self.k_constant = self.data.k_constant 
    65   self.exog = self.data.exog 

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/model.py in _handle_data(self, endog, exog, missing, hasconst, **kwargs) 
    86 
    87  def _handle_data(self, endog, exog, missing, hasconst, **kwargs): 
---> 88   data = handle_data(endog, exog, missing, hasconst, **kwargs) 
    89   # kwargs arrays could have changed, easier to just attach here 
    90   for key in kwargs: 

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/data.py in handle_data(endog, exog, missing, hasconst, **kwargs) 
    628  klass = handle_data_class_factory(endog, exog) 
    629  return klass(endog, exog=exog, missing=missing, hasconst=hasconst, 
--> 630     **kwargs) 

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/data.py in __init__(self, endog, exog, missing, hasconst, **kwargs) 
    77 
    78   # this has side-effects, attaches k_constant and const_idx 
---> 79   self._handle_constant(hasconst) 
    80   self._check_integrity() 
    81   self._cache = resettable_cache() 

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/data.py in _handle_constant(self, hasconst) 
    129    # detect where the constant is 
    130    check_implicit = False 
--> 131    const_idx = np.where(self.exog.ptp(axis=0) == 0)[0].squeeze() 
    132    self.k_constant = const_idx.size 
    133 

ValueError: zero-size array to reduction operation maximum which has no identity 

我怀疑出现由于目标变量(即KPI6)含有一些NaN的错误,所以我试图与KPI6 = NaN的丢弃这样的所有行,但问题仍然存在:

KPI.dropna(subset = ['KPI6']) 

我也尝试下探只包含NaN值的所有行,但问题依然存在:

KPI.dropna(how = 'all') 

我结合这两个步骤进行,问题仍然存在。消除这个错误的唯一方法是实际上用某种东西(例如0,平均值,中值等)来计算丢失的数据。但是,我希望尽可能避免使用这种方法,因为我想对原始数据执行OLS回归。

OLS回归也工作,当我试图选择只有几个变量作为预测变量,但是这又不是我的目标是尽。我想包括除KPI6之外的所有其他变量作为预测变量。

有没有解决这个问题的方法?这一周我一直非常紧张。任何帮助表示赞赏。我不是一个专业的Python编码器,所以如果你能用通俗的话来解决这个问题(&提出一个解决方案),我将不胜感激。

非常感谢。

回答

0

默认使用公式时丢失的处理是删除包含至少一个非任意行。如果每行包含一个nan,那么就没有剩下的意见。我认为这就是追溯ValueError: zero-size array的结尾。

如果你有足够的数据整体,那么你可以尝试归咎于与MICE将反复归咎于缺少的值每个变量估计。

+0

谢谢,现在我终于明白了什么是错。我会尝试你的建议。 –

相关问题