如何将数据集分割/分割为训练和测试数据集以进行交叉验证？

71

如果要拆分数据在两半设置一次，就可以使用numpy.random.shuffle，或numpy.random.permutation如果你需要跟踪指数：

import numpy 
# x is your dataset 
x = numpy.random.rand(100, 5) 
numpy.random.shuffle(x) 
training, test = x[:80,:], x[80:,:]

或

import numpy 
# x is your dataset 
x = numpy.random.rand(100, 5) 
indices = numpy.random.permutation(x.shape[0]) 
training_idx, test_idx = indices[:80], indices[80:] 
training, test = x[training_idx,:], x[test_idx,:]

有repeatedly partition the same data set for cross validation有很多方法。一种策略是从数据集中重新采样，以重复：

import numpy 
# x is your dataset 
x = numpy.random.rand(100, 5) 
training_idx = numpy.random.randint(x.shape[0], size=80) 
test_idx = numpy.random.randint(x.shape[0], size=20) 
training, test = x[training_idx,:], x[test_idx,:]

最后，sklearn包含several cross validation methods（k折，离开正出，...）。它还包含更高级的"stratified sampling"方法，这些方法可以创建与某些功能相平衡的数据分区，例如确保训练和测试集中正负示例的比例相同。

来源

2010-09-09 14:00:59 pberkes

+7

感谢这些解决方案。但是，使用randint的最后一种方法是否可以为测试和训练集提供相同的索引？ – ggauravr 2013-11-05 22:21:28

0

我写了一个函数为我自己的项目要做到这一点（它不使用numpy的，虽然）：

def partition(seq, chunks): 
    """Splits the sequence into equal sized chunks and them as a list""" 
    result = [] 
    for i in range(chunks): 
     chunk = [] 
     for element in seq[i:len(seq):chunks]: 
      chunk.append(element) 
     result.append(chunk) 
    return result

如果您希望块被随机分组，通过它在之前刚刚改组列表

来源

2010-09-09 18:23:16 Colin

24

还有一个选择，只需要使用scikit学习。作为scikit's wiki describes，你可以使用如下指令：

from sklearn.model_selection import train_test_split 

data, labels = np.arange(10).reshape((5, 2)), range(5) 

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)

这样就可以在保持同步的数据你想分割为训练和测试的标签。

来源

2013-08-31 05:45:30

4

你也可以考虑分层划分为训练和测试集。启动分区也会随机生成训练和测试集，但保留原始分类比例。这使得训练和测试集更好地反映了原始数据集的属性。

import numpy as np 

def get_train_test_inds(y,train_proportion=0.7): 
    '''Generates indices, making random stratified split into training set and testing sets 
    with proportions train_proportion and (1-train_proportion) of initial sample. 
    y is any iterable indicating classes of each observation in the sample. 
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling). 
    ''' 

    y=np.array(y) 
    train_inds = np.zeros(len(y),dtype=bool) 
    test_inds = np.zeros(len(y),dtype=bool) 
    values = np.unique(y) 
    for value in values: 
     value_inds = np.nonzero(y==value)[0] 
     np.random.shuffle(value_inds) 
     n = int(train_proportion*len(value_inds)) 

     train_inds[value_inds[:n]]=True 
     test_inds[value_inds[n:]]=True 

    return train_inds,test_inds 

y = np.array([1,1,2,2,3,3]) 
train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5) 
print y[train_inds] 
print y[test_inds]

此代码输出：

[1 2 3] 
[1 2 3]

来源

2014-12-10 22:10:10 Apogentus

+0

谢谢！命名有些误导，'value_inds'是真正的索引，但输出不是索引，只是掩码。 – greenoldman 2017-09-02 12:53:50

18

刚一说明。如果您想训练，测试和验证集，你可以这样做：

from sklearn.cross_validation import train_test_split 

X = get_my_X() 
y = get_my_y() 
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3) 
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)

这些参数将给予70％的培训，以及各15％，测试和Val套。希望这可以帮助。

来源

2016-05-12 17:20:57 offwhitelotus

+4

可能应该添加到你的代码：'from sklearn.cross_validation import train_test_split'来清除你正在使用的模块 – Radix 2016-07-14 20:01:10

+0

这是否必须是随机的？ – liang 2017-01-21 12:21:55

+0

也就是说，是否可以根据X和y的给定顺序进行分割？ – liang 2017-01-21 12:27:59

0

下面是在分层方式

% X = data array 
% y = Class_label 
from sklearn.cross_validation import StratifiedKFold 
skf = StratifiedKFold(y, n_folds=5) 
for train_index, test_index in skf: 
    print("TRAIN:", train_index, "TEST:", test_index) 
    X_train, X_test = X[train_index], X[test_index] 
    y_train, y_test = y[train_index], y[test_index]

来源

2016-10-24 12:34:50 prashanth

0

由于pberkes将数据分割成n = 5倍的回答代码。我只是修改，以避免（1）更换，一边品尝（2）复制实例发生在训练和测试：

training_idx = np.random.choice(X.shape[0], int(np.round(X.shape[0] * 0.8)),replace=False) 
training_idx = np.random.permutation(np.arange(X.shape[0]))[:np.round(X.shape[0] * 0.8)] 
    test_idx = np.setdiff1d(np.arange(0,X.shape[0]), training_idx)

来源

2016-11-28 22:30:24 Zahran

6

由于sklearn.cross_validation模块被弃用，你可以使用：

import numpy as np 
from sklearn.model_selection import train_test_split 
X, y = np.arange(10).reshape((5, 2)), range(5) 

X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=42)

来源

2017-03-31 18:18:18

如何将数据集分割/分割为训练和测试数据集以进行交叉验证？

回答

相关问题