再采样中的R

假设我们有以下数据再采样中的R

set.seed(123) 
dat <- data.frame(var1=c(10,35,13,19,15,20,19), id=c(1,1,2,2,2,3,4)) 
(sampledIDs <- sample(min(dat$id):max(dat$id), size=3, replace=TRUE)) 
> [1] 2 4 2

的sampledIDs是从dat$id采样（与替换）的ID的向量。我需要导致（和作品也为大量的数据可能有更多的变量）的代码：

代码dat[which(dat$id%in%sampledIDs),]不给我我想要的东西，因为这个代码的结果是

其中dat$id==2的主题在此数据中只出现一次（我理解为什么这是结果，但不知道如何得到我想要的）。有人可以帮忙吗？

编辑：谢谢你的答案，在这里所有的答案的运行时间（对于那些有兴趣谁）：

                test replications elapsed relative user.self 
3 dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ]   1000 0.67 1.000  0.64 
1 dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[, 1], ]   1000 0.67 1.000  0.67 
2  do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)])   1000 1.83 2.731  1.83 
4        setkey(setDT(dat), id)[J(sampledIDs)]   1000 1.33 1.985  1.33

来源

2014-11-03 Giuseppe

+1用于提供答案分析以及明确说明的问题。 – 2014-11-03 12:19:56

数据大小是多少？你提到你有一个大数据 – 2014-11-03 12:21:56

不是很大的数据，但比例子中有更多的观察/变量：''data.frame'：\t 4454 obs。的15个变量'。 – Giuseppe 2014-11-03 12:28:19

这将可能是一个大数据的最快方法设置使用data.tablebinary search

library(data.table) 
setkey(setDT(dat), id)[J(sampledIDs)] 
# var1 id 
# 1: 13 2 
# 2: 19 2 
# 3: 15 2 
# 4: 19 4 
# 5: 13 2 
# 6: 19 2 
# 7: 15 2

编辑：下面是一个不标杆如此大的数据集（1e + 05行）说明哪个是明显的赢家

library(data.table) 
library(microbenchmark) 

set.seed(123) 
n <- 1e5 
dat <- data.frame(var1 = sample(seq_len(100), n, replace = TRUE), id = sample(seq_len(10), n, replace = TRUE)) 
(sampledIDs <- sample(min(dat$id) : max(dat$id), size = 3, replace = TRUE)) 
dat2 <- copy(dat) 

Sven1 <- function(dat) dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ] 
Sven2 <- function(dat) dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[ , 1], ] 
flodel <- function(dat) do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)]) 
David <- function(dat2) setkey(setDT(dat2), id)[J(sampledIDs)] 

Res <- microbenchmark(Sven1(dat), 
         Sven2(dat), 
         flodel(dat), 
         David(dat2)) 
Res 
# Unit: milliseconds 
#  expr  min  lq median  uq  max neval 
# Sven1(dat) 4.356151 4.817557 6.715533 7.313877 45.407768 100 
# Sven2(dat) 9.750984 12.385677 14.324671 16.655005 54.797096 100 
# flodel(dat) 36.097602 39.680006 42.236017 44.314981 82.261879 100 
# David(dat2) 1.813387 2.068749 2.154774 2.335442 8.665379 100 

boxplot(Res)

enter image description here

如果，例如，我们想品尝更多的则仅有3 ID，但让说，10，基准变得可笑

(sampledIDs <- sample(min(dat$id) : max(dat$id), size = 10, replace = TRUE)) 
[1] 7 6 10 9 5 9 5 3 7 3 
# Unit: milliseconds 
#  expr  min   lq  median   uq  max neval 
# Sven1(dat) 80.124502 89.141162 97.908365 104.111738 175.40919 100 
# Sven2(dat) 99.010410 127.797966 159.404395 170.751069 209.96887 100 
# flodel(dat) 129.722435 144.847505 157.737362 178.242103 232.41293 100 
# David(dat2) 2.431682 2.721038 2.855103 3.057796 19.60826 100

enter image description here

来源

2014-11-03 11:49:22

你可以这样做：

do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)])

来源

2014-11-03 11:28:26 flodel

一方法：

dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ] 
#  var1 id 
# 3  13 2 
# 4  19 2 
# 5  15 2 
# 7  19 4 
# 3.1 13 2 
# 4.1 19 2 
# 5.1 15 2

的另一种方法：

dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[ , 1], ]

来源

2014-11-03 11:30:39

回答

相关问题