2010-09-09 224 views

回答

71

如果要拆分数据在两半设置一次,就可以使用numpy.random.shuffle,或numpy.random.permutation如果你需要跟踪指数:

import numpy 
# x is your dataset 
x = numpy.random.rand(100, 5) 
numpy.random.shuffle(x) 
training, test = x[:80,:], x[80:,:] 

import numpy 
# x is your dataset 
x = numpy.random.rand(100, 5) 
indices = numpy.random.permutation(x.shape[0]) 
training_idx, test_idx = indices[:80], indices[80:] 
training, test = x[training_idx,:], x[test_idx,:] 

repeatedly partition the same data set for cross validation有很多方法。一种策略是从数据集中重新采样,以重复:

import numpy 
# x is your dataset 
x = numpy.random.rand(100, 5) 
training_idx = numpy.random.randint(x.shape[0], size=80) 
test_idx = numpy.random.randint(x.shape[0], size=20) 
training, test = x[training_idx,:], x[test_idx,:] 

最后,sklearn包含several cross validation methods(k折,离开正出,...)。它还包含更高级的"stratified sampling"方法,这些方法可以创建与某些功能相平衡的数据分区,例如确保训练和测试集中正负示例的比例相同。

+7

感谢这些解决方案。但是,使用randint的最后一种方法是否可以为测试和训练集提供相同的索引? – ggauravr 2013-11-05 22:21:28

0

我写了一个函数为我自己的项目要做到这一点(它不使用numpy的,虽然):

def partition(seq, chunks): 
    """Splits the sequence into equal sized chunks and them as a list""" 
    result = [] 
    for i in range(chunks): 
     chunk = [] 
     for element in seq[i:len(seq):chunks]: 
      chunk.append(element) 
     result.append(chunk) 
    return result 

如果您希望块被随机分组​​,通过它在之前刚刚改组列表

24

还有一个选择,只需要使用scikit学习。作为scikit's wiki describes,你可以使用如下指令:

from sklearn.model_selection import train_test_split 

data, labels = np.arange(10).reshape((5, 2)), range(5) 

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42) 

这样就可以在保持同步的数据你想分割为训练和测试的标签。

4

你也可以考虑分层划分为训练和测试集。启动分区也会随机生成训练和测试集,但保留原始分类比例。这使得训练和测试集更好地反映了原始数据集的属性。

import numpy as np 

def get_train_test_inds(y,train_proportion=0.7): 
    '''Generates indices, making random stratified split into training set and testing sets 
    with proportions train_proportion and (1-train_proportion) of initial sample. 
    y is any iterable indicating classes of each observation in the sample. 
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling). 
    ''' 

    y=np.array(y) 
    train_inds = np.zeros(len(y),dtype=bool) 
    test_inds = np.zeros(len(y),dtype=bool) 
    values = np.unique(y) 
    for value in values: 
     value_inds = np.nonzero(y==value)[0] 
     np.random.shuffle(value_inds) 
     n = int(train_proportion*len(value_inds)) 

     train_inds[value_inds[:n]]=True 
     test_inds[value_inds[n:]]=True 

    return train_inds,test_inds 

y = np.array([1,1,2,2,3,3]) 
train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5) 
print y[train_inds] 
print y[test_inds] 

此代码输出:

[1 2 3] 
[1 2 3] 
+0

谢谢!命名有些误导,'value_inds'是真正的索引,但输出不是索引,只是掩码。 – greenoldman 2017-09-02 12:53:50

18

刚一说明。如果您想训练,测试和验证集,你可以这样做:

from sklearn.cross_validation import train_test_split 

X = get_my_X() 
y = get_my_y() 
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3) 
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5) 

这些参数将给予70%的培训,以及各15%,测试和Val套。希望这可以帮助。

+4

可能应该添加到你的代码:'from sklearn.cross_validation import train_test_split'来清除你正在使用的模块 – Radix 2016-07-14 20:01:10

+0

这是否必须是随机的? – liang 2017-01-21 12:21:55

+0

也就是说,是否可以根据X和y的给定顺序进行分割? – liang 2017-01-21 12:27:59

0

下面是在分层方式

% X = data array 
% y = Class_label 
from sklearn.cross_validation import StratifiedKFold 
skf = StratifiedKFold(y, n_folds=5) 
for train_index, test_index in skf: 
    print("TRAIN:", train_index, "TEST:", test_index) 
    X_train, X_test = X[train_index], X[test_index] 
    y_train, y_test = y[train_index], y[test_index] 
0

由于pberkes将数据分割成n = 5倍的回答代码。我只是修改,以避免(1)更换,一边品尝(2)复制实例发生在训练和测试:

training_idx = np.random.choice(X.shape[0], int(np.round(X.shape[0] * 0.8)),replace=False) 
training_idx = np.random.permutation(np.arange(X.shape[0]))[:np.round(X.shape[0] * 0.8)] 
    test_idx = np.setdiff1d(np.arange(0,X.shape[0]), training_idx) 
6

由于sklearn.cross_validation模块被弃用,你可以使用:

import numpy as np 
from sklearn.model_selection import train_test_split 
X, y = np.arange(10).reshape((5, 2)), range(5) 

X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=42) 
相关问题