数值比较

数据帧d1：数值比较

数据帧d2：

如何合并d1和d2通过"x"其中d1$x应该对精确匹配或下一个更高的匹配号码为d2$x。输出应该是这样的：

x y z 
4 10 200 # (4 is matched against next higher value that is 6) 
6 20 200 # (6 is matched against 6) 
7 30 300 # (7 is matched against next higher value that is 9)

如果merge()不能做到这一点，那么有没有其他的方法来做到这一点？ For循环痛苦地缓慢。

来源

2014-06-07 user3718192

输入数据：

d1 <- data.frame(x=c(4,6,7), y=c(10,20,30)) 
d2 <- data.frame(x=c(3,6,9), z=c(100,200,300))

基本上，你希望通过一个新列延伸d1。所以让我们来复制它。

d3 <- d1

下一页我认为d2$x被nondecreasingly和max(d1$x) <= max(d2$x)排序。

d3$z <- sapply(d1$x, function(x) d2$z[which(x <= d2$x)[1]])

其内容为：在d1$x每个x，得到d2$x最小的值不小于x小。

在这些假设下，上面也可以写为（&应该是快了一点）：

d3$z <- sapply(d1$x, function(x) d2$z[which.max(x <= d2$x)])

在结果我们得到：

d3 
## x y z 
## 1 4 10 200 
## 2 6 20 200 
## 3 7 30 300

EDIT1：由@灵感MatthewLundberg的cut为基础的解决方案，这里是另一个使用findInterval：

d3$z <- d2$z[findInterval(d1$x, d2$x+1)+1]

EDIT2：（基准）

示例性数据：

set.seed(123) 
d1 <- data.frame(x=sort(sample(1:10000, 1000)), y=sort(sample(1:10000, 1000))) 
d2 <- data.frame(x=sort(c(sample(1:10000, 999), 10000)), z=sort(sample(1:10000, 1000)))

结果：

microbenchmark::microbenchmark(
{d3 <- d1; d3$z <- d2$z[findInterval(d1$x, d2$x+1)+1] }, 
{d3 <- d1; d3$z <- sapply(d1$x, function(x) d2$z[which(x <= d2$x)[1]]) }, 
{d3 <- d1; d3$z <- sapply(d1$x, function(x) d2$z[which.max(x <= d2$x)]) }, 
{d1$x2 <- d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))]; merge(d1, d2, by.x='x2', by.y='x')}, 
{d1a <- d1; setkey(setDT(d1a), x); d2a <- d2; setkey(setDT(d2a), x); d2a[d1a, roll=-Inf] } 
) 
## Unit: microseconds 
##   expr  min   lq median  uq  max neval 
## findInterval 221.102  1357.558 1394.246 1429.767 17810.55 100 
## which  66311.738  70619.518 85170.175 87674.762 220613.09 100 
## which.max 69832.069  73225.755 83347.842 89549.326 118266.20 100 
## cut   8095.411  8347.841 8498.486 8798.226 25531.58 100 
## data.table 1668.998  1774.442 1878.028 1954.583 17974.10 100

来源

2014-06-07 16:59:14 gagolews

这是相当简单的使用轧制加入与data.table：

require(data.table) ## >= 1.9.2 
setkey(setDT(d1), x) ## convert to data.table, set key for the column to join on 
setkey(setDT(d2), x) ## same as above 

d2[d1, roll=-Inf] 

# x z y 
# 1: 4 200 10 
# 2: 6 200 20 
# 3: 7 300 30

来源

2014-06-07 17:02:01 Arun

cut可用于查找d2$x中d1$x中值的适当匹配项。

找到匹配与cut计算如下：

as.numeric(cut(d1$x, c(-Inf, d2$x, Inf))) 
## [1] 2 2 3

这些是值：

d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))] 
[1] 6 6 9

这些可添加到d1和合并进行的：

d1$x2 <- d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))] 
merge(d1, d2, by.x='x2', by.y='x') 
## x2 x y z 
## 1 6 4 10 200 
## 2 6 6 20 200 
## 3 9 7 30 300

如果需要，可以删除添加的列。

来源

2014-06-07 18:50:46

+1对于'cut'。另外，我想，'findInterval'将起到类似的作用。 – gagolews

@gagolews'findInterval'使用左侧关闭的时间间隔。 'cut'给出了一个选择（右边是默认值）。 –

'rightmost.closed'？ – gagolews

请尝试：sapply(d1$x,function(y) d2$z[d2$x > y][which.min(abs(y - d2$x[d2$x > y]))])

来源

2014-06-08 01:59:38 userNaN

回答

相关问题