2017-02-16 94 views
1

我在R中的以下两个dataframes:比较并合并两个dataframes

df1 = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5)) 
colnames(df1) = c("X", "Y", "Z", "score") 

df1 
    X Y Z score 
1 A 1 6  1 
2 A 11 20  2 
3 A 21 30  3 
4 B 35 40  4 
5 B 45 60  5 

df2 = data.frame(c("A", "A", "A", "A", "B", "B", "B", "C"), c(1, 6, 21, 50, 20, 31, 50, 10), c(5, 20, 30, 60, 30, 40, 60, 20), c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8")) 
colnames(df2) = c("X", "Y", "Z", "out") 

df2 
    X Y Z out 
1 A 1 5 x1 
2 A 6 20 x2 
3 A 21 30 x3 
4 A 50 60 x4 
5 B 20 30 x5 
6 B 31 40 x6 
7 B 50 60 x7 
8 C 10 20 x8 

对于DF1每一行,我要检查:

  • 有与'价值匹配如果上述条件成立,我想检查'Y'和'Z'的值是否在值'Y'和'Z'的范围内df2
  • 如果两者都是真的,那么我想添加th e值从'out'到df1。

这是输出应该什么样子:

output = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5), c("x1, x2", "x2", "x3", "x4", "x5")) 
colnames(output) = c("X", "Y", "Z", "score", "out") 

    X Y Z score out 
1 A 1 6  1 x1, x2 
2 A 11 20  2  x2 
3 A 21 30  3  x3 
4 B 35 40  4  x6 
5 B 45 60  5  x7 

原来DF1保持与添加一个额外的列“出来”。

第1行来自'output',在'out'列中包含'x1,x2'。原因:列“X”中的值与范围1至6中的值与df2中的行1和2重叠。

我在(Compare values from two dataframes and merge)之前询问过此问题,建议使用foverlaps函数。但是由于df1和df2之间的列不同以及df2中的额外行,我无法使其工作。

回答

1
library(dplyr) 

df1 = data.frame(c("A", "A", "A", "B", "B"), c(1, 11, 21, 35, 45), 
       c(6, 20, 30, 40, 60), c(1, 2, 3, 4, 5), stringsAsFactors = F) 
colnames(df1) = c("X", "Y", "Z", "score") 

df2 = data.frame(c("A", "A", "A", "A", "B", "B", "B", "C"), c(1, 6, 21, 50, 20, 31, 50, 10), 
       c(5, 20, 30, 60, 30, 40, 60, 20), 
       c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8"), stringsAsFactors = F) 
colnames(df2) = c("X", "Y", "Z", "out") 


df1 %>% 
    left_join(df2, by="X") %>%   # join on main column 
    rowwise() %>%      # for each row 
    mutate(counter = sum(seq(Y.x, Z.x) %in% seq(Y.y, Z.y))) %>% # get how many elements of those ranges overlap 
    filter(counter > 0) %>%   # keep rows with overlap 
    group_by(X, Y.x, Z.x, score) %>% # for each combination of those columns 
    summarise(out = paste(out, collapse=", ")) %>%    # combine out column 
    ungroup() %>% 
    rename(Y = Y.x, 
     Z = Z.x) 

# # A tibble: 5 × 5 
#  X  Y  Z score out 
# <chr> <dbl> <dbl> <dbl> <chr> 
# 1  A  1  6  1 x1, x2 
# 2  A 11 20  2  x2 
# 3  A 21 30  3  x3 
# 4  B 35 40  4  x6 
# 5  B 45 60  5  x7 

上述过程是基于dplyr包,并涉及join和一些分组和过滤。如果您的初始数据集(df1,df2)非常大,那么join将创建一个更大的数据集,这将需要一些时间来创建。

此外,请注意,此过程适用于character而不是factor变量。如果它尝试加入具有不同级别的factor变量,则该过程可能会将factor变量转换为character

我建议你一步一步地运行链接命令,看看它是如何工作的,并发现如果我错过了任何可能导致代码中的错误。

+0

我怎样才能设置变量“stringAsFactors = F”对于一个已经存在的数据帧? – user1987607

+0

首先,尝试运行带有'factor'变量相同的过程,因为它可能将它们转换当它尝试加入不同级别的因素时,将其转换为“字符” – AntoniosK

+1

@AntioniosK:我的df1有9000行,我的df2有862行,您的代码可以很流畅地处理一个小子集,但是对于完整的数据,它需要很长时间我想...... – user1987607

0

下面是使用sqldf

library(sqldf) 
xx=sqldf('select t1.*,t2.out from df1 t1 left join df2 t2 on t1.X=t2.X and ((t2.Y between t1.Y and t1.Z) or (t2.Z between t1.Y and t1.Z))') 
aggregate(xx[ncol(xx)], xx[-ncol(xx)], FUN = function(X) paste(unique(X), collapse=", ")) 
2

另一个选项这里有两种可能的方式,a)使用新实施的不相等连接功能,以及b)foverlaps为你特别提到..

一个)非相等联接

dt2[dt1, on=.(X, Z>=Y, Y<=Z), 
     .(score, out=paste(out, collapse=",")), 
    by=.EACHI] 

其中dt1dt2是对应于df1df2的data.tables。请注意,您必须在列结果中还原列名称ZY(因为列名来自dt2,但值为dt1。从dt2对应于每个行是dt1

匹配的行是基于提供给on参数的条件和.()针对每个那些匹配行(因为by=.EACHI)的评价发现。

B)foverlaps

setkey(dt1, X, Y, Z) 
olaps <- foverlaps(dt2, dt1, type="any", nomatch=0L) 
olaps[, .(score=score[1L], out=paste(out, collapse=",")), by=.(X,Y,Z)]