2011-10-27 53 views
14

我对data.table“非连接”的成语提出了一个问题,它来自Iterator的question。这里有一个例子:non-join with data.tables

library(data.table) 

dt1 <- data.table(A1=letters[1:10], B1=sample(1:5,10, replace=TRUE)) 
dt2 <- data.table(A2=letters[c(1:5, 11:15)], B2=sample(1:5,10, replace=TRUE)) 

setkey(dt1, A1) 
setkey(dt2, A2) 

data.table看看这样

> dt1    > dt2 
     A1 B1    A2 B2 
[1,] a 1   [1,] a 2 
[2,] b 4   [2,] b 5 
[3,] c 2   [3,] c 2 
[4,] d 5   [4,] d 1 
[5,] e 1   [5,] e 1 
[6,] f 2   [6,] k 5 
[7,] g 3   [7,] l 2 
[8,] h 3   [8,] m 4 
[9,] i 2   [9,] n 1 
[10,] j 4   [10,] o 1 

要找到这dt2行在dt1相同的键,设置which选项TRUE

> dt1[dt2, which=TRUE] 
[1] 1 2 3 4 5 NA NA NA NA NA 

马修在这answer中建议,一个“不加入”的成语

dt1[-dt1[dt2, which=TRUE]] 

于子集dt1那些有没有出现在dt2索引行。在我的机器有data.table V1.7.1我得到一个错误:

Error in `[.default`(x[[s]], irows): only 0's may be mixed with negative subscripts 

相反,用选项nomatch=0, “不参加” 的作品

> dt1[-dt1[dt2, which=TRUE, nomatch=0]] 
    A1 B1 
[1,] f 2 
[2,] g 3 
[3,] h 3 
[4,] i 2 
[5,] j 4 

这是预期的行为?

+2

刚刚添加到v1.8.3是_not-join_语法。在这种情况下'dt1 [!dt2]'。将添加详细的答案... –

回答

5

据我所知,这是基R.的一部分

# This works 
(1:4)[c(-2,-3)] 

# But this gives you the same error you described above 
(1:4)[c(-2, -3, NA)] 
# Error in (1:4)[c(-2, -3, NA)] : 
# only 0's may be mixed with negative subscripts 

的文本错误消息,指示它是意图行为。

这里是我的最佳猜测为什么这是预期的行为:

从他们的方式对待NA的其他地方(例如,通常默认为na.rm=FALSE),似乎是R的设计师查看NA的作为载有重要信息,并且不愿意在没有明确指示的情况下放弃这些信息。 (幸运的是,设置nomatch=0给你一个干净的方式沿着传递指令!)

在此背景下,设计师的偏好可能解释了为什么NA的接受积极的索引,而不是消极索引:

# Positive indexing: works, because the return value retains info about NA's 
(1:4)[c(2,3,NA)] 

# Negative indexing: doesn't work, because it can't easily retain such info 
(1:4)[c(-2,-3,NA)] 
+1

+1好的答案!是的,它来自基地。 [FR#1384](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1384&group_id=240&atid=978)是为了使'X [-Y]'语法的意思是'不'加入'。在此期间'which = TRUE,nomatch = 0'是必要的。 –

2

新的data.table的1.7.3版本:v1.8.3

New option datatable.nomatch allows the default for nomatch to be changed from NA to 0, ...

+3

这种改变可能会有所帮助,但并非真正用于'不加入'。 [FR#1384](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1384&group_id=240&atid=978)仍然有效。很高兴看到有人阅读新闻虽然:) –

17

新:

A new "!" prefix on i signals 'not-join' (a.k.a. 'not-where'), #1384. 
    DT[-DT["a", which=TRUE, nomatch=0]] # old not-join idiom, still works 
    DT[!"a"]        # same result, now preferred. 
    DT[!J(6),...]       # !J == not-join 
    DT[!2:3,...]       # ! on all types of i 
    DT[colA!=6L | colB!=23L,...]   # multiple vector scanning approach 
    DT[!J(6L,23L)]      # same result, faster binary search 
'!' has been used rather than '-' : 
    * to match the 'not-join' and 'not-where' nomenclature 
    * with '-', DT[-0] would return DT rather than DT[0] and not be backwards 
    compatibile. With '!', DT[!0] returns DT both before (since !0 is TRUE in 
    base R) and after this new feature. 
    * to leave DT[+...] and DT[-...] available for future use