2017-08-17 39 views
3

假设我有两个数据帧:通过多列搜索值,返回行的编号

A =包含唯一电话号码和额外因子列的数据帧。假设Nrow(A)= 20

B =由代表独特住户的行和列出电话号码的四列组成的数据框,以及用于唯一家庭ID的第五列。有可能在B列的多个列中重复相同的数字。假设Nrow(B)= 100

我想在检查A个电话号码是否在四列之一中之后,返回一个表格,该表格带有家庭ID为“B”的“A”唯一电话号码。

因此,例如:

a <- data.frame(phone=c("12345","12346","12456"), 
       factor=c("OK","BAD","BAD")) 
b <- data.frame(ph1 = c("12345","","12346","12347",""), 
       ph2 = c("","","12346","","12348"), 
       ph3 = c("","","","12456","67890"), 
       hhid = seq(1121,1125)) 

我怎样才能回复C,将如下所示:

c <- data.frame(phone = c("12345","12346","12456"), 
       factor = c("OK","BAD","BAD"), 
       hhid = c("1121","1123","1124")) 

我敢肯定,这是可能做到这一点的一个非常优雅的方式或用最少量的代码。我想过使用for循环或合并,但认为这是错误的轨道上。打开使用任何软件包。

+0

更新 - 我收到一束不同的建议使用不同的软件包。这有助于我了解不同的软件包,但也知道基地可以做什么。我的需求是充满的 - 但是,请随时分享其他可能对此问题有益的知识。 –

回答

2

这里是一个选项与data.table

library(data.table) 
setDT(a)[unique(setDT(b)[, .(phone = unlist(.SD)), hhid][phone != ""]), 
      hhid := hhid, on = .(phone)] 
a 
# phone factor hhid 
#1: 12345  OK 1121 
#2: 12346 BAD 1123 
#3: 12456 BAD 1124 
+1

啊,我一直听到臭名昭着的data.table包。非常感谢。我尝试了它,它运行得非常漂亮。尽管如此,需要更多的时间来了解你拉什么样的魔法! –

3
library(dplyr) 
library(tidyr) 

a <- data.frame(phone=c("12345","12346","12456"), 
       factor=c("OK","BAD","BAD")) 
b <- data.frame(ph1 = c("12345","","12346","12347",""), 
       ph2 = c("","","12346","","12348"), 
       ph3 = c("","","","12456","67890"), 
       hhid = seq(1121,1125)) 

# reshape data and keep unique combinations 
b2 = b %>% 
    gather(ph, phone, -hhid) %>% 
    select(-ph) %>% 
    distinct() 

# join data frames 
left_join(a, b2, by = "phone") 

# phone factor hhid 
# 1 12345  OK 1121 
# 2 12346 BAD 1123 
# 3 12456 BAD 1124 
+0

啊 - 这是伟大而优雅的。非常感谢!我希望我曾想过聚会。 –

0

这里给出base R的解决方案,你在为字符或选择读取数据:options(stringsAsFactors = F)

tmp <- unique(reshape(b, 
    direction="long", 
    varying = 1:3, 
    v.names="phone", 
    timevar = "variable")[,c(1, 3)]) 
tmp[tmp$phone!="",] 
merge(tmp, a, by="phone") 
# phone hhid factor 
#1 12345 1121  OK 
#2 12346 1123 BAD 
#3 12456 1124 BAD