0
我试图在R中使用包caretEnsemble来合成模型。这里是一个最小可重现的示例。请让我知道这是否应该有额外的信息。R caret整套CV长度不正确
library(caret)
library(caretEnsemble)
library(xgboost)
library(plyr)
# Load iris data and convert to binary classification problem
data(iris)
data = iris
data$target = ifelse(data$Species == "setosa",1,0)
data = subset(data,select = -c(Species))
# Train control for models. 5 fold CV
set.seed(123)
index=createFolds(data$target, k=5,returnTrain = FALSE)
myControl = trainControl(method='cv', number=5,
returnResamp='none', classProbs=TRUE,
returnData=FALSE, savePredictions=TRUE,
verboseIter=FALSE, allowParallel=TRUE,
summaryFunction=twoClassSummary,
index=index)
# Layer 1 models
model1 = train(target ~ Sepal.Length,data=data, trControl = myControl, method = "glm", family = "binomial", metric = "ROC")
model2 = train(target ~ Sepal.Length,data=data, trControl = myControl, method = "xgbTree", metric = "ROC",
tuneGrid=expand.grid(nrounds = 50, max_depth=1, eta = .05, gamma = .5, colsample_bytree = 1,min_child_weight=1, subsample=1))
# Stack models
all.models <- list(model1, model2)
names(all.models) <- c("glm","xgb")
class(all.models) <- "caretList"
stacked <- caretStack(all.models, method = "glm", family = "binomial", metric = "ROC",
trControl=trainControl(method='cv', number=5,
returnResamp='none', classProbs=TRUE,
returnData=FALSE, savePredictions=TRUE,
verboseIter=FALSE, allowParallel=TRUE,
summaryFunction=twoClassSummary)
)
stacked
这是关注我的主要输出。
A glm ensemble of 2 base models: glm, xgb
Ensemble results:
Generalized Linear Model
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 480, 480, 480, 480, 480
Resampling results:
ROC Sens Spec
0.9509688 0.92 0.835
我的问题是,有150行在基础数据集,在5倍CV的每个折叠所以30行。如果你看看“索引”,你会发现这是正确的。现在,如果您查看“堆叠”的结果,则会看到每次叠加的元/堆叠模型的5倍长度为480。总共为480 * 5 = 2400,比原始数据集大16倍。我不知道这是为什么。
我的主要问题是:
1)这个观察列表在每个折叠中是否正确?
2)如果是这样,为什么会发生这种情况?