与条件

随机抽样

我的问题坐在一个循环中，我有一个大的数据集（DF），一个子集，它看起来是这样的：与条件

ID  Site Species 
101  4 x 
101  4 y 
101  4 z 
102  6 x 
102  6 z 
102  6 a 
102  6 b 
103  6 a 
103  6 z 
103  6 c 
103  6 x 
103  6 y 
105  6 x 
105  6 y 
105  6 a 
105  6 z 
108  1 x 
108  1 a 
108  1 c 
108  1 z

我想随机选择，使用的每一次迭代我循环（so，i）来自每个网站的个人ID的所有行。但关键的是，每个网站只有一个ID。我有一个单独的函数，用于将我的大型数据集分为多个站点，因此如果i=1那么只有一个上述站点（例如）会出现在子集中。

如果i=3，作为本贴例子，那么我希望101所有行，要么102，103或105，和所有的108

所有行我认为像ddply()与sample()应做到这一点，但我无法让它随机发生。

任何建议将不胜感激。感谢

詹姆斯

来源

2014-01-15 user3122022

你能解释为什么'I = 3'指那些'ID's应选择以及为什么是'108'不同于'102，103 ，105'？你可以展示一些代码来说明你在做什么，一些一般的设置。目前还不清楚“我”是什么。 –

好的，对不起，这里有更多的上下文。我使用specaccum（）在不同数量的远程摄像机（ID列）和不同数量的站点（站点列）之间引导物种积累曲线的生成。所以我需要一个站点的曲线，一个摄像机，两个摄像机等，然后是两个站点，一个摄像机的曲线，两个摄像机等。我的第一个循环：for（l in 1：length（sitelist）），subset into l可能的网站，并在这些网站上生成所有可能的相机列表。我的下一个嵌套循环：for（i in 1：l）是我想要采样一个摄像头，两个摄像头（来自diff站点）等的地方。 – user3122022

108与102,103和105不同，因为它位于不同的站点网站栏）。我想随机选择一个来自每个网站的ID。我提供的数据集显示了i = 3（3站点）的迭代，其他迭代（更多站点）中有更多的ID，但我仍然只需要来自每个站点的一个ID，而不管我有多大有很多网站）。我希望这更有意义。 – user3122022

这个怎么样？我添加了一个函数来模拟我认为你的数据看起来像什么。

#dependencies 
require(plyr) 

#function to make data (just to work with) 
make_data<-function(id){ 
    set.seed(id) 
    num_sites<-round(runif(1)*3,0)+1 
    num_sp<-round(runif(1)*7,0)+1 
    sites<-sample(1:10,num_sites,FALSE) 
    ldply(sites,function(x)data.frame(sites=x,sp=sample(letters[1:26],num_sp,FALSE))) 
} 

#make a data frame for example use (as per question) 
ids<-100:200 
df<-ldply(ids,function(x)data.frame(id=x,make_data(x))) 

################################################ 
# HERE'S THE CODE FOR THE ANSWER    # 
# use ddply to summarise by site & sampled ids # 
filter<-ddply(df,.(sites),summarise,set=sample(id,1)) 
# then apply this filter to the original list 
ddply(filter,.(sites),.fun=function(x){return(df[df$site==x$sites & df$id==x$set,])})

来源

2014-01-15 08:27:59 Troy

谢谢，这两个答案都很好，但是我用了这个，因为它只有2行代码。 – user3122022

我想你可以使用unique查找所有可能的ID /网站，然后从独特的子集采样。

例如，让我们创建一个数据集

# Set the RNG seed for reproducibility 
set.seed(12345) 
ID <- rep(100:110, c(2, 6, 3, 1, 3, 8, 9, 2, 4, 5, 6)) 
site <- rep(1:6, c(8, 7, 8, 11, 4, 11)) 
species <- sample(letters[1:5], length(ID), replace=T) 

df <- data.frame(ID=ID, Site=site, Species=species)

因此，DF是这样的：

> head(df, 15) 
    ID Site Species 
1 100 1  d 
2 100 1  e 
3 101 1  d 
4 101 1  e 
5 101 1  c 
6 101 1  a 
7 101 1  b 
8 101 1  c 
9 102 2  d 
10 102 2  e 
11 102 2  a 
12 103 2  a 
13 104 2  d 
14 104 2  a 
15 104 2  b

总结数据，我们有：

Site 1 -> 100, 101 
Site 2 -> 102, 103, 104 
Site 3 -> 105 
Site 4 -> 106, 107 
Site 5 -> 108 
Site 6 -> 109, 110

现在，让我们说我想从3个网站中选择

# The number of sites we want to sample 
num.sites <- 3 
# Find all the sites 
all.sites <- unique(df$Site) 
# Pick the sites. 
# You may also want to check that num.sites <= length(all.sites) 
sites <- sample(all.sites, num.sites)

在这种情况下，我们选择了

> sites 
[1] 4 5 6

好了，现在我们发现可供每个站点

# Now find the IDs in each of those sites 
# simplify=F is VERY important to ensure we get a list even if every 
# site has the same number of IDs 
IDs <- sapply(chosen.sites, function(s) 
    { 
    unique(df$ID[df$Site==s]) 
    }, simplify=FALSE)

这让我们

> IDs 
[[1]] 
[1] 106 107 

[[2]] 
[1] 108 

[[3]] 
[1] 109 110

的ID现在选择每一个ID网站

# NOTE: this assumes the same ID is not found in multiple sites 
# but it's easy to deal with the opposite case 
# Again, we return a list, because sapply does not seem 
# to play well with data frames... (try it!) 
res <- sapply(IDs, function(i) 
    { 
    chosen.ID <- sample(as.list(i), 1) 
    df[df$ID==chosen.ID,] 
    }, simplify=FALSE) 

# Finally convert the list to a data frame 
res <- do.call(rbind, res) 


> res 
    ID Site Species 
24 106 4  d 
25 106 4  d 
26 106 4  b 
27 106 4  d 
28 106 4  c 
29 106 4  b 
30 106 4  c 
31 106 4  d 
32 106 4  a 
35 108 5  b 
36 108 5  b 
37 108 5  e 
38 108 5  e 
44 110 6  d 
45 110 6  b 
46 110 6  b 
47 110 6  a 
48 110 6  a 
49 110 6  a

因此，一切都在一个单一的功能

pickSites <- function(df, num.sites) 
    { 
    all.sites <- unique(df$Site) 
    chosen.sites <- sample(all.sites, num.sites) 

    IDs <- sapply(chosen.sites, function(s) 
     { 
     unique(df$ID[df$Site==s]) 
     }, simplify=FALSE) 

    res <- sapply(IDs, function(i) 
     { 
     chosen.ID <- sample(as.list(i), 1) 
     df[df$ID==chosen.ID,] 
     }, simplify=FALSE) 

    res <- do.call(rbind, res) 
    }

来源

2014-01-15 08:38:14 nico

回答

相关问题