我想对data.tables执行操作,目前我可以成功地使用data.frames。实质上,它是两个data.frames的合并函数,它为df1找到了df2中匹配变量之一的最接近的匹配项。该代码如下。通过两个变量合并data.table和最接近的第三个
我想这样做是data.tables,因为我data.frames是非常大的,和我的当前设置崩溃,如果我试图完成对全部数据这一操作。 Data.table可能允许我直接在整套数据集上完成它,但如果不是这样,我发现在使用多个数据子集时,data.table更容易处理。
我要找的Id
(及其相应value
)从具有最接近的匹配由变量State
小号value
在DF1 MM
和variable
(在此data.frame方法DF2,多个配对可以发生,如果有最接近的匹配关系(例如存在加1和减1的值))。当使用data.frames时,我在下面得到解决方案final
。我不知道如何设置data.table来给我相同的结果。我试过我的钥匙的变体,下面是一个例子。有一个answer在data.frame问题,我在代码中参考使用data.table,但是,我不能让它与我的示例数据工作。
# data.frame method
# used info from this thread: https://stackoverflow.com/questions/16095680
df1 <- structure(list(State = structure(c(1L, 1L, 3L, 3L, 2L, 2L, 1L,
1L, 1L), .Label = c("AK", "CO", "MS"), class = "factor"), MM = c(1L,
2L, 1L, 2L, 3L, 4L, 3L, 4L, 2L), variable = structure(c(1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("TMN", "TMX"), class = "factor"),
value = c(1L, 2L, 3L, 4L, 2L, 3L, 5L, 6L, 7L)), .Names = c("State",
"MM", "variable", "value"), class = "data.frame", row.names = c(NA,
-9L))
df2 <- structure(list(Id = c(1L, 2L, 3L, 1L, 2L, 3L, 5L, 6L, 7L, 5L,
6L, 7L, 8L), MM = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L,
4L, 5L), variable = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("TMN", "TMX"), class = "factor"),
value = c(1, 2, 3, 2, 3, 4, 2, 3, 5.5, 6.5, 3.5, 2.5, 8)), .Names = c("Id",
"MM", "variable", "value"), class = "data.frame", row.names = c(NA,
-13L))
#Find rows that match by x and y
res <- merge(df1, df2, by = c("MM", "variable"), all.x = TRUE)
res$dif <- abs(res$value.x - res$value.y)
#Find rows that need to be merged
res1 <- merge(aggregate(dif ~ MM + variable, data = res, FUN = min), res)
#Finally merge the result back into df1
final <- merge(df1, res1[res1$dif <= 1, c("MM", "variable", "State", "Id", "value.y")], all.x = TRUE)
### one Data.table attempts
# create data.tables with the same key columns
keycols1 = c("MM", "variable", "value")
df1t <- data.table(df1, key = keycols1)
df2t <- data.table(df2, key = key(df1t))
setkey(df1t, value)
setkey(df2t, value)
test.final <- df2t[df1t, roll='nearest', allow.cartesian=TRUE]
结果在数据帧'final'在你的例子似乎并不匹配你所希望得到说明。例如,为什么用组合(状态= AK,变量= TMN,MM = 1)产生'final',它should't只产生一个标识与所述最接近的匹配的两行? –
@YT谢谢,在data.frame'final'代码中缺少''State''' – nofunsally