2016-10-06 34 views
1

我正在使用SVM构建分类器,并希望执行网格搜索以帮助自动查找最佳模型。下面的代码:支持SVM的GridSearch生成IndexError

from sklearn.svm import SVC 
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import GridSearchCV 
from sklearn.multiclass import OneVsRestClassifier 

X.shape  # (22343, 323) 
y.shape  # (22343, 1) 

X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.4, random_state=0 
) 

tuned_parameters = [ 
    { 
    'estimator__kernel': ['rbf'], 
    'estimator__gamma': [1e-3, 1e-4], 
    'estimator__C': [1, 10, 100, 1000] 
    }, 
    { 
    'estimator__kernel': ['linear'], 
    'estimator__C': [1, 10, 100, 1000] 
    } 
] 

model_to_set = OneVsRestClassifier(SVC(), n_jobs=-1) 
clf = GridSearchCV(model_to_set, tuned_parameters) 
clf.fit(X_train, y_train) 

,我得到以下错误信息(这是不是整个堆栈跟踪刚刚过去的3个电话。):

---------------------------------------------------- 
/anaconda/lib/python3.5/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups) 
    88   X, y, groups = indexable(X, y, groups) 
    89   indices = np.arange(_num_samples(X)) 
---> 90   for test_index in self._iter_test_masks(X, y, groups): 
    91    train_index = indices[np.logical_not(test_index)] 
    92    test_index = indices[test_index] 

/anaconda/lib/python3.5/site-packages/sklearn/model_selection/_split.py in _iter_test_masks(self, X, y, groups) 
    606 
    607  def _iter_test_masks(self, X, y=None, groups=None): 
--> 608   test_folds = self._make_test_folds(X, y) 
    609   for i in range(self.n_splits): 
    610    yield test_folds == i 

/anaconda/lib/python3.5/site-packages/sklearn/model_selection/_split.py in _make_test_folds(self, X, y, groups) 
    593   for test_fold_indices, per_cls_splits in enumerate(zip(*per_cls_cvs)): 
    594    for cls, (_, test_split) in zip(unique_y, per_cls_splits): 
--> 595     cls_test_folds = test_folds[y == cls] 
    596     # the test split can be too big because we used 
    597     # KFold(...).split(X[:max(c, n_splits)]) when data is not 100% 

IndexError: too many indices for array 

此外,当我试图重塑阵列所以y是(22343,)我发现即使将tuned_pa​​rameters设置为默认值,GridSearch也不会结束。

而且这里的版本所有的软件包是否有帮助:

的Python:3.5.2

scikit学习:0.18

大熊猫:0.19.0

+0

您是否试图减少样本数量并运行它? – MMF

回答

3

它似乎你的实现没有错误。

但是,正如sklearn文档中提到的那样,“拟合时间复杂度超过二次样本数,因此样本数很难通过多个10000样本缩放到数据集”。 See documentation here

对于您的情况,您有22343样本,这可能会导致一些计算问题/内存问题。这就是为什么当你做你的默认CV时,需要很多时间。尝试减少您的火车设置使用10000样本或更少。