2015-12-24 69 views
0

我有一个数据帧mydf。我也有一个叫做myvec <- c("chr5:11", "chr3:112", "chr22:334")的载体。如果任何向量元素与mydf中的键相匹配并且生成mydfresult)的子集,我想要做的是选择行的范围(包括上面的3个值和下面的3个值)。如何选择R的行范围

由于在myvec我们CHR5:11匹配与mydf关键,我们选择行匹配CHR5:8(下面三个值),以CHR5:14(上述三个值)在result

mydf<- structure(list(key = structure(c(5L, 2L, 7L, 8L, 4L, 1L, 6L, 
3L, 11L, 10L, 9L), .Names = c("34", "35", "36", "37", "38", "39", 
"40", "41", "42", "43", "44"), .Label = c("chr5:10", "chr5:11", 
"chr5:1123", "chr5:118", "chr5:12", "chr5:123", "chr5:13", "chr5:14", 
"chr5:19", "chr5:8", "chr5:9"), class = "factor"), variantId = structure(1:11, .Names = c("34", 
"35", "36", "37", "38", "39", "40", "41", "42", "43", "44"), .Label = c("9920068", 
"9920069", "9920070", "9920071", "9920072", "9920073", "9920074", 
"9920075", "9920076", "9920077", "9920078"), class = "factor")), .Names = c("key", 
"variantId"), row.names = c("34", "35", "36", "37", "38", "39", 
"40", "41", "42", "43", "44"), class = "data.frame") 

结果

 key   variant 
43 "chr5:8" "9920077" 
42 "chr5:9" "9920076" 
39 "chr5:10" "9920073" 
35 "chr5:11" "9920069" 
34 "chr5:12" "9920068" 
36 "chr5:13" "9920070" 
37 "chr5:14" "9920071" 
+1

根据您的dput,'mydf'是一个矩阵,而不是一个data.frame 。请修复。 –

+0

@Pascal谢谢,我已修复它。 – MAPK

回答

2

可以使用GenomicRanges包。

library(GenomicRanges) 

myvec <- c("chr5:11", "chr3:112", "chr22:334") 
myvec.gr <- GRanges(gsub(":.+", "", myvec), 
        IRanges(as.numeric(gsub(".+:", "", myvec))-3, 
          as.numeric(gsub(".+:", "", myvec)))+3) 

mydf.gr <- GRanges(gsub(":.+", "", mydf[,"key"]), 
        IRanges(as.numeric(gsub(".+:", "", mydf[,"key"])), 
          as.numeric(gsub(".+:", "", mydf[,"key"])))) 

d.v.op <- findOverlaps(mydf.gr, myvec.gr) 

mydf[queryHits(d.v.op), ] 
# key  variantId 
# 34 "chr5:12" "9920068" 
# 35 "chr5:11" "9920069" 
# 36 "chr5:13" "9920070" 
# 37 "chr5:14" "9920071" 
# 39 "chr5:10" "9920073" 
# 42 "chr5:9" "9920076" 
# 43 "chr5:8" "9920077" 
+0

非常感谢,我认为格兰杰有多种用途。 – MAPK

3

如何以下(我用data.tablebase版本几乎是相同的)

library(data.table) 
mydf <- as.data.table(mydf) #(if mydf really is stored as a matrix currently) 

myvec2 <- lapply(strsplit(gsub("chr", "", myvec), split=":"), as.integer) 

mydf[unique(Reduce(c, sapply(myvec2, function(x){ 
    which(key %in% paste0("chr", x[1], ":", seq((x2 <- x[2]) - 3L, x2 + 3L)))} 
))), ] 

(在base,更换as.data.tableas.data.framekeymydf$key,并更换右方括号],]

用于分类的额外选项

其实,我认为这个选项总的来说比较好,因为它首先以更柔韧的方式存储您的信息。这个版本在data.table说法中有点重。

mydf <- as.data.table(mydf) 

#Split your `key` variable into its pre- and post-colon components 
# (of course using better names if those numbers mean something 
# more specific to you) 
mydf[ , c("chr", "sub") := 
     .(as.integer(gsub("chr|:.*", "", key)), 
      as.integer(gsub(".*:", "", key)))] 

现在,有轻微的调整像往常一样继续:

myvec2<-lapply(strsplit(gsub("chr","",myvec),split=":"),as.integer) 

mydf[unique(Reduce(c, sapply(myvec2, function(x){ 
    which(chr == x[1] & sub %in% seq((x2 <- x[2]) - 3L, x2 + 3L))} 
)))][order(chr, sub)] 

输出:

 key variantId chr sub 
1: chr5:8 9920077 5 8 
2: chr5:9 9920076 5 9 
3: chr5:10 9920073 5 10 
4: chr5:11 9920069 5 11 
5: chr5:12 9920068 5 12 
6: chr5:13 9920070 5 13 
7: chr5:14 9920071 5 14 
+0

@Pascal固定。 OP:按照你想要的顺序来打印东西很困难(而非间接)。这很关键吗? – MichaelChirico

+0

谢谢,顺序只是保持升序。我想我现在可以使用排序选项。 – MAPK

+0

@MAPK问题是,因为它存储为一个字符串,'sort'实际上不能正确工作 - '“chr5:1123”'紧接在'“chr5:11”之后。 – MichaelChirico