按照

2016-12-07 32 views
3

的通用组合并数据帧我有两个由不同采样器采集的龙虾卵尺寸数据的数据集,这些数据集将用于评估测量变异性。每个采样器测量来自多个龙虾的〜50个鸡蛋和龙虾。然而,偶尔有一些龙虾由采样器1处理,而不是采样器2处理,反之亦然。我想将来自两个采样器的数据合并为一个新的数据集,但要删除所有仅由一个采样器处理的龙虾数据。我用semi_join和dplyr玩过相交,但我需要在数据集1 - > 2和2 < -1之间执行匹配。我能够创建一个新的数据集,该数据集绑定来自两个采样器的行,但不清楚如何删除新数据集中两个数据集之间的所有唯一龙虾ID。按照

这里是我的数据的简化版本,其中从多个龙虾取得多个鸡蛋面积测量结果,但采样并不总是重叠(即,鸡蛋仅由一个采样器而不是从另一个采样器测量):

install.packages(dplyr) 
library(dplyr) 

sampler1 <- data.frame(LobsterID=c("Lobster1","Lobster1","Lobster2", 
            "Lobster2","Lobster2","Lobster2", 
            "Lobster2","Lobster3","Lobster3","Lobster3"), 
         Area=c(.4,.35,1.1,1.04,1.14,1.1,1.05,1.7,1.63,1.8), 
         Sampler=c(rep("Sampler1", 10))) 
sampler2 <- data.frame(LobsterID=c("Lobster1","Lobster1","Lobster1", 
            "Lobster1","Lobster1","Lobster2", 
            "Lobster2","Lobster2","Lobster4","Lobster4"), 
         Area=c(.41,.44,.47,.43,.38,1.14,1.11,1.09,1.41,1.4), 
         Sampler=c(rep("Sampler2", 10))) 

combined <- bind_rows(sampler1, sampler2) 

desiredresult <- combined[-c(8, 9, 10, 19, 20), ] 

该脚本的底线是模拟数据所需的结果。我曾希望限制使用R或dplyr。

回答

6
sampler1 %>% rbind(sampler2) %>% filter(LobsterID %in% intersect(sampler1$LobsterID, sampler2$LobsterID)) 
+0

干得子集的行!谢谢! – user24537

2
combined <- bind_rows(sampler1, sampler2) 


Lobsters.2.sample <- as.character(unique(sampler1$LobsterID)[unique(sampler1$LobsterID) %in% unique(sampler2$LobsterID)]) 

combined <- combined[combined$LobsterID %in% Lobsters.2.sample,] 
1

绑定的行中,基团,并且通过不同的采样的每个组中的数目的滤波器:

sampler1 %>% bind_rows(sampler2) %>% 
    group_by(LobsterID) %>% 
    filter(n_distinct(Sampler) == 2) 

## Source: local data frame [15 x 3] 
## Groups: LobsterID [2] 
## 
## LobsterID Area Sampler 
##  <chr> <dbl> <chr> 
## 1 Lobster1 0.40 Sampler1 
## 2 Lobster1 0.35 Sampler1 
## 3 Lobster2 1.10 Sampler1 
## 4 Lobster2 1.04 Sampler1 
## 5 Lobster2 1.14 Sampler1 
## 6 Lobster2 1.10 Sampler1 
## 7 Lobster2 1.05 Sampler1 
## 8 Lobster1 0.41 Sampler2 
## 9 Lobster1 0.44 Sampler2 
## 10 Lobster1 0.47 Sampler2 
## 11 Lobster1 0.43 Sampler2 
## 12 Lobster1 0.38 Sampler2 
## 13 Lobster2 1.14 Sampler2 
## 14 Lobster2 1.11 Sampler2 
## 15 Lobster2 1.09 Sampler2 
2

使用碱R

combined <-rbind(sampler1, sampler2) 
inBoth <- intersect(sampler1[["LobsterID"]], sampler2[["LobsterID"]]) 
output <- combined[combined[["LobsterID"]] %in% inBoth, ] 

intersect发现并集的两个载体,给你两个样本的龙虾。所有功能都是矢量化的,所以它应该运行得非常快。

1

这是一个使用data.table的选项。由“LobsterID”使用rbindlist绑定数据集,组以及使用基于在“取样”独特的元素即相等的数量的逻辑条件来2.

library(data.table) 
rbindlist(list(sampler1, sampler2))[, if(uniqueN(Sampler)==2) .SD , by = LobsterID]