2017-08-29 79 views
0

我试图在R中使用包caretEnsemble来合成模型。这里是一个最小可重现的示例。请让我知道这是否应该有额外的信息。R caret整套CV长度不正确

library(caret) 
library(caretEnsemble) 
library(xgboost) 
library(plyr) 


# Load iris data and convert to binary classification problem 
data(iris) 
data = iris 
data$target = ifelse(data$Species == "setosa",1,0) 
data = subset(data,select = -c(Species)) 

# Train control for models. 5 fold CV 
set.seed(123) 
index=createFolds(data$target, k=5,returnTrain = FALSE) 
myControl = trainControl(method='cv', number=5, 
          returnResamp='none', classProbs=TRUE, 
          returnData=FALSE, savePredictions=TRUE, 
          verboseIter=FALSE, allowParallel=TRUE, 
          summaryFunction=twoClassSummary, 
          index=index) 

# Layer 1 models 
model1 = train(target ~ Sepal.Length,data=data, trControl = myControl, method = "glm", family = "binomial", metric = "ROC") 
model2 = train(target ~ Sepal.Length,data=data, trControl = myControl, method = "xgbTree", metric = "ROC", 
       tuneGrid=expand.grid(nrounds = 50, max_depth=1, eta = .05,                       gamma = .5, colsample_bytree = 1,min_child_weight=1, subsample=1)) 

# Stack models 
all.models <- list(model1, model2) 
names(all.models) <- c("glm","xgb") 
class(all.models) <- "caretList" 

stacked <- caretStack(all.models, method = "glm", family = "binomial", metric = "ROC", 
          trControl=trainControl(method='cv', number=5, 
          returnResamp='none', classProbs=TRUE, 
          returnData=FALSE, savePredictions=TRUE, 
          verboseIter=FALSE, allowParallel=TRUE, 
          summaryFunction=twoClassSummary) 
         ) 

stacked 

这是关注我的主要输出。

A glm ensemble of 2 base models: glm, xgb 

Ensemble results: 
Generalized Linear Model 

No pre-processing 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 480, 480, 480, 480, 480 
Resampling results: 

    ROC  Sens Spec 
    0.9509688 0.92 0.835 

我的问题是,有150行在基础数据集,在5倍CV的每个折叠所以30行。如果你看看“索引”,你会发现这是正确的。现在,如果您查看“堆叠”的结果,则会看到每次叠加的元/堆叠模型的5倍长度为480。总共为480 * 5 = 2400,比原始数据集大16倍。我不知道这是为什么。

我的主要问题是:
1)这个观察列表在每个折叠中是否正确?
2)如果是这样,为什么会发生这种情况?

回答

0

找出了这个问题,以防其他人绊倒在这。我创建的索引是从样品的行的指示,所以代码应该是:

myControl = trainControl(method='cv', number=5, 
          returnResamp='none', classProbs=TRUE, 
          returnData=FALSE, savePredictions=TRUE, 
          verboseIter=FALSE, allowParallel=TRUE, 
          summaryFunction=twoClassSummary, 
          indexOut=index) 

代替索引=它应该是indexOut =。数据在20%的数据上进行了训练,并在之前的80%进行了预测,这就解释了重叠。现在该选项已正确设置,不存在重叠。