2017-07-06 84 views
0

大数据一个数据帧的多行我R.是比较新我有一个数据帧test,看起来像这样:重新排列中的R

PMID # id 
LID 
STAT 
MH 
RN 
OT 
PST  # cue 
LID 
STAT 
MH 
PMID # id 
OT 
PST  # cue 
LID 
DEP 
RN 
PMID # id 
PST  # cue 

,我希望它看起来像这样:

PMID # id 
LID 
STAT 
MH 
RN 
OT 
PST  # cue 
PMID # id 
LID 
STAT 
MH 
OT 
PST  # cue 
PMID # id 
LID 
DEP 
RN 
PST  # cue 

基本上,我希望PMID之后的条目适用于特定的PMID,第一个PMID就是这种情况。但是,在第一个PMID之后,PMID随机地位于其条目之间。但是,每个PMID都以PST结束,所以我希望在第一个PMID在上一个PST位置之后移动到该位置。我有两个数据帧包含每个PMID和PST的索引位置。因此,例如,对于PMID,DF a_new包含

1 
11 
17 

和PST,DF b包含

7 
13 
18 

这是我已经尽力了,但因为我有超过24万行,这没” T结束后的运行时间,当我停止了它,我的数据帧并没有改变:

for (i in 1:nrow(test)) 
{  
    if (i %in% a_new$X1) # if it's a PMID 
    { 
    entry <- match(i, a_new$X1) # find entry index of PMID 
    if (entry != 1) # as long as not first row from a_new (that's corrected) 
    { 
     r <- b[i, 1] # row of PST 
     test <- rbind(test[1:r, ], test[entry, 1], test[-(1:r), ]) 
     test <- test[-c(i+1), ] # remove duplicate PMID 
    } 
    } 
} 

正如你可以看到,rbind会在极在这种情况下高效。请指教。

+0

'test'看起来不像'data.frame':它没有列名和行号 – HubertL

+0

它是2400万个观察值/行和1列 – sweetmusicality

+0

我不知道如何在列中添加列和行数stackoverflow(没有它手动) – sweetmusicality

回答

2

下面是使用data.table一个答案。

library(data.table) 

dat <- fread("Origcol 
      PMID 
      LID 
      STAT 
      MH 
      RN 
      OT 
      PST  
      LID 
      STAT 
      MH 
      PMID  
      OT 
      PST  
      LID 
      DEP 
      RN 
      PMID 
      PST") 

dat[, old_order := 1:.N] 
pst_index <- c(0, which(dat$Origcol == "PST")) 
dat[, grp := unlist(lapply(1:(length(pst_index)-1), 
          function(x) rep(x, 
              times = (pst_index[x+1] - pst_index[x]))))] 
dat[, Origcol := factor(Origcol, levels = c("PMID", "LID", "STAT", 
              "MH", "RN", "OT", 
              "DEP", "PST"))] 
dat[order(grp, Origcol)] 

结果:

Origcol old_order grp 
1: PMID   1 1 
2:  LID   2 1 
3: STAT   3 1 
4:  MH   4 1 
5:  RN   5 1 
6:  OT   6 1 
7:  PST   7 1 
8: PMID  11 2 
9:  LID   8 2 
10: STAT   9 2 
11:  MH  10 2 
12:  OT  12 2 
13:  PST  13 2 
14: PMID  17 3 
15:  LID  14 3 
16:  RN  16 3 
17:  DEP  15 3 
18:  PST  18 3 

这样做的好处是data.table通过引用做了很多的操作,一旦你扩大规模要快。你说你有1400万行,让我们试试看。产生这种规模的一些合成数据:

dat_big <- data.table(Origcol = c("PMID", "LID", "STAT", "MH", "RN", "OT", "PST")) 
dat_big_add <- rbindlist(lapply(1:10000, 
           function(x) data.table(Origcol = c(sample(c("PMID", "LID", "STAT", 
                      "MH", "RN", "OT")), 
                    "PST")))) 
dat_big <- rbindlist(list(dat_big, 
          dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add, 
          dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add, 
          dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add, 
          dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add)) 

dat <- rbindlist(list(dat_big, dat_big, dat_big, dat_big, dat_big, 
         dat_big, dat_big, dat_big, dat_big, dat_big)) 

我们现在有:

  Origcol 
     1: PMID 
     2:  LID 
     3: STAT 
     4:  MH 
     5:  RN 
     ---   
14000066: STAT 
14000067:  MH 
14000068:  OT 
14000069: PMID 
14000070:  PST 

应用与上面相同的代码:

dat[, old_order := 1:.N] 
pst_index <- c(0, which(dat$Origcol == "PST")) 
dat[, grp := unlist(lapply(1:(length(pst_index)-1), 
          function(x) rep(x, 
              times = (pst_index[x+1] - pst_index[x]))))] 
dat[, Origcol := factor(Origcol, levels = c("PMID", "LID", "STAT", 
              "MH", "RN", "OT", 
              "DEP", "PST"))] 
dat[order(grp, Origcol)] 

现在,我们得到:

  Origcol old_order  grp 
     1: PMID   1  1 
     2:  LID   2  1 
     3: STAT   3  1 
     4:  MH   4  1 
     5:  RN   5  1 
     ---       
14000066: STAT 14000066 2000010 
14000067:  MH 14000067 2000010 
14000068:  RN 14000064 2000010 
14000069:  OT 14000068 2000010 
14000070:  PST 14000070 2000010 

需要多长时间?

library(microbenchmark) 
microbenchmark(
    "data.table" = { 
    dat[, old_order := 1:.N] 
    pst_index <- c(0, which(dat$Origcol == "PST")) 
    dat[, grp := unlist(lapply(1:(length(pst_index)-1), 
           function(x) rep(x, 
               times = (pst_index[x+1] - pst_index[x]))))] 
    dat[, Origcol := factor(Origcol, levels = c("PMID", "LID", "STAT", 
               "MH", "RN", "OT", 
               "DEP", "PST"))] 
    dat[order(grp, Origcol)] 
    }, 
    times = 10) 

而且它需要:

Unit: seconds 
     expr  min  lq  mean median  uq  max neval 
data.table 5.755276 5.813267 6.059665 5.87151 6.034506 7.310169 10 

在10秒1400万行。生成测试数据花了很长时间。

+0

哇,谢谢你这样一个彻底的答案!你的解释看起来很有希望。然而,在运行'test [,grp:= unlist ....'行时,我遇到了这个错误:'rep in error(x,time =(pst_index [x + 1] - pst_index [x])) 无效的'times'参数' – sweetmusicality

+0

对于'test'数据集有什么不同吗?它是否会在您的计算机上按原样复制我的代码失败? –

+0

哦,你的代码工作得很好,原样复制。我的实际数据集中有一些行开始相同(你会在下一句中看到我的意思),并且是相互连续的。而且,每行不只是一个单词 - 例如,“PMID - 234254”或“MH - 人类”,但我不知道为什么会影响错误。在看到你的代码后,我使用'setDT(df)'将数据框更改为data.table ...是否为适当的响应? – sweetmusicality

1

这是一个使用which的索引方法。

# get positions of PST, the final value 
endSpot <- which(temp == "PST") 
# increment to get the desired positions of the PMID 
# (dropping final value as we don't need to change it) 
startSpot <- head(endSpot + 1, -1) 
# get the current positions of the PMID, except the first one 
PMIDSpot <- tail(which(temp == "PMID"), -1) 

现在,用这些指标来交换行

temp[c(startSpot, PMIDSpot), ] <- temp[c(PMIDSpot, startSpot), ] 

这将返回(我增加了一个叫做计数行位置变量来跟踪)。

temp 
    V1 count 
1 PMID  1 
2 LID  2 
3 STAT  3 
4 MH  4 
5 RN  5 
6 OT  6 
7 PST  7 
8 PMID 11 
9 STAT  9 
10 MH 10 
11 LID  8 
12 OT 12 
13 PST 13 
14 PMID 17 
15 DEP 15 
16 RN 16 
17 LID 14 
18 PST 18 

数据

temp <- 
structure(list(V1 = c("PMID", "LID", "STAT", "MH", "RN", "OT", 
"PST", "LID", "STAT", "MH", "PMID", "OT", "PST", "LID", "DEP", 
"RN", "PMID", "PST"), count = 1:18), .Names = c("V1", "count" 
), row.names = c(NA, -18L), class = "data.frame")