2017-04-10 59 views
0

我已经定义了一个二进制分类器作为波纹管:我用'gbc'方法(梯度提升分类器)调用它,并且我得到错误min_samples_split must be at least 2 or in (0, 1], got 1最后一行featuresClasses是一个数据帧,并且featureLabels是功能列表min_samples_split必须至少为2或在(0,1]中,得到1

Binary_classifier(method, featureLabels, featuresClasses): 

    membershipIds = list(set(featuresClasses['membershipId'])) 
    n_membershipIds = len(membershipIds) 

    index_rand = np.random.permutation(n_membershipIds) 
    test_size = int(0.3 * n_membershipIds) 

    membershipIds_test = list(itemgetter(*index_rand[:test_size])(membershipIds)) 
    membershipIds_train = list(itemgetter(*index_rand[test_size+1:])(membershipIds)) 

    data_test = featuresClasses[featuresClasses['membershipId'].isin(membershipIds_test)] 
    data_train = featuresClasses[featuresClasses['membershipId'].isin(membershipIds_train)] 

    data_test = data_test[data_test['standing'].isin([0, 1])] 
    data_train = data_train[data_train['standing'].isin([0, 1])] 

    X_test = data_test[featureLabels].as_matrix() 
    y_test = data_test['standing'].values.astype(int) 

    X_train = data_train[featureLabels].as_matrix() 
    y_train = data_train['standing'].values.astype(int) 

    # -------------------------- Run classifier 
    print 'Binary classification by', method 

    if method == 'svm': 
     classifier = svm.SVC(kernel='linear', probability=True) 
     y_score = classifier.fit(X_train, y_train).decision_function(X_test) 

    elif method == 'gbc': 
     params = {'n_estimators': 200, 'max_depth': 3, 'min_samples_split': 1, 'learning_rate': 0.1, 'loss': 'deviance'} 

     classifier = GradientBoostingClassifier(**params) 
     y_score = classifier.fit(X_train, y_train).predict(X_test) 

回答

2

按照GradientBoostingClassifier documentation:。

min_samples_split:整数,浮点,可选的(缺省值= 2)

The minimum number of samples required to split an internal node: 

    If int, then consider min_samples_split as the minimum number. 
    If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) 
       are the minimum number of samples for each split. 

你,在你的代码指定'min_samples_split': 1。这不是一个有效的案例。它的最小int值是2 如果你想输入1为浮动(这意味着1 *的特征数)(即你想利用你所有的功能集成到min_samples_split),然后指定为'min_samples_split': 1.0。当指定为1时,它被视为一个整数,并因此发生错误。

这是一个差错显示为(0,1],而不是(0.0,1.0),这是造成混乱。这也已被问及scitit学习的github问题,并已实施下一个版本:

+0

谢谢@Vivek库马尔 – YNr

相关问题