重新排列中的R

大数据一个数据帧的多行我R.是比较新我有一个数据帧test，看起来像这样：重新排列中的R

PMID # id 
LID 
STAT 
MH 
RN 
OT 
PST  # cue 
LID 
STAT 
MH 
PMID # id 
OT 
PST  # cue 
LID 
DEP 
RN 
PMID # id 
PST  # cue

，我希望它看起来像这样：

PMID # id 
LID 
STAT 
MH 
RN 
OT 
PST  # cue 
PMID # id 
LID 
STAT 
MH 
OT 
PST  # cue 
PMID # id 
LID 
DEP 
RN 
PST  # cue

基本上，我希望PMID之后的条目适用于特定的PMID，第一个PMID就是这种情况。但是，在第一个PMID之后，PMID随机地位于其条目之间。但是，每个PMID都以PST结束，所以我希望在第一个PMID在上一个PST位置之后移动到该位置。我有两个数据帧包含每个PMID和PST的索引位置。因此，例如，对于PMID，DF a_new包含

1 
11 
17

和PST，DF b包含

7 
13 
18

这是我已经尽力了，但因为我有超过24万行，这没” T结束后的运行时间，当我停止了它，我的数据帧并没有改变：

for (i in 1:nrow(test)) 
{  
    if (i %in% a_new$X1) # if it's a PMID 
    { 
    entry <- match(i, a_new$X1) # find entry index of PMID 
    if (entry != 1) # as long as not first row from a_new (that's corrected) 
    { 
     r <- b[i, 1] # row of PST 
     test <- rbind(test[1:r, ], test[entry, 1], test[-(1:r), ]) 
     test <- test[-c(i+1), ] # remove duplicate PMID 
    } 
    } 
}

正如你可以看到，rbind会在极在这种情况下高效。请指教。

来源

2017-07-06 sweetmusicality

'test'看起来不像'data.frame'：它没有列名和行号 – HubertL

它是2400万个观察值/行和1列 – sweetmusicality

我不知道如何在列中添加列和行数stackoverflow（没有它手动） – sweetmusicality

下面是使用data.table一个答案。

library(data.table) 

dat <- fread("Origcol 
      PMID 
      LID 
      STAT 
      MH 
      RN 
      OT 
      PST  
      LID 
      STAT 
      MH 
      PMID  
      OT 
      PST  
      LID 
      DEP 
      RN 
      PMID 
      PST") 

dat[, old_order := 1:.N] 
pst_index <- c(0, which(dat$Origcol == "PST")) 
dat[, grp := unlist(lapply(1:(length(pst_index)-1), 
          function(x) rep(x, 
              times = (pst_index[x+1] - pst_index[x]))))] 
dat[, Origcol := factor(Origcol, levels = c("PMID", "LID", "STAT", 
              "MH", "RN", "OT", 
              "DEP", "PST"))] 
dat[order(grp, Origcol)]

结果：

Origcol old_order grp 
1: PMID   1 1 
2:  LID   2 1 
3: STAT   3 1 
4:  MH   4 1 
5:  RN   5 1 
6:  OT   6 1 
7:  PST   7 1 
8: PMID  11 2 
9:  LID   8 2 
10: STAT   9 2 
11:  MH  10 2 
12:  OT  12 2 
13:  PST  13 2 
14: PMID  17 3 
15:  LID  14 3 
16:  RN  16 3 
17:  DEP  15 3 
18:  PST  18 3

这样做的好处是data.table通过引用做了很多的操作，一旦你扩大规模要快。你说你有1400万行，让我们试试看。产生这种规模的一些合成数据：

dat_big <- data.table(Origcol = c("PMID", "LID", "STAT", "MH", "RN", "OT", "PST")) 
dat_big_add <- rbindlist(lapply(1:10000, 
           function(x) data.table(Origcol = c(sample(c("PMID", "LID", "STAT", 
                      "MH", "RN", "OT")), 
                    "PST")))) 
dat_big <- rbindlist(list(dat_big, 
          dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add, 
          dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add, 
          dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add, 
          dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add)) 

dat <- rbindlist(list(dat_big, dat_big, dat_big, dat_big, dat_big, 
         dat_big, dat_big, dat_big, dat_big, dat_big))

我们现在有：

  Origcol 
     1: PMID 
     2:  LID 
     3: STAT 
     4:  MH 
     5:  RN 
     ---   
14000066: STAT 
14000067:  MH 
14000068:  OT 
14000069: PMID 
14000070:  PST

应用与上面相同的代码：

dat[, old_order := 1:.N] 
pst_index <- c(0, which(dat$Origcol == "PST")) 
dat[, grp := unlist(lapply(1:(length(pst_index)-1), 
          function(x) rep(x, 
              times = (pst_index[x+1] - pst_index[x]))))] 
dat[, Origcol := factor(Origcol, levels = c("PMID", "LID", "STAT", 
              "MH", "RN", "OT", 
              "DEP", "PST"))] 
dat[order(grp, Origcol)]

现在，我们得到：

  Origcol old_order  grp 
     1: PMID   1  1 
     2:  LID   2  1 
     3: STAT   3  1 
     4:  MH   4  1 
     5:  RN   5  1 
     ---       
14000066: STAT 14000066 2000010 
14000067:  MH 14000067 2000010 
14000068:  RN 14000064 2000010 
14000069:  OT 14000068 2000010 
14000070:  PST 14000070 2000010

需要多长时间？

library(microbenchmark) 
microbenchmark(
    "data.table" = { 
    dat[, old_order := 1:.N] 
    pst_index <- c(0, which(dat$Origcol == "PST")) 
    dat[, grp := unlist(lapply(1:(length(pst_index)-1), 
           function(x) rep(x, 
               times = (pst_index[x+1] - pst_index[x]))))] 
    dat[, Origcol := factor(Origcol, levels = c("PMID", "LID", "STAT", 
               "MH", "RN", "OT", 
               "DEP", "PST"))] 
    dat[order(grp, Origcol)] 
    }, 
    times = 10)

而且它需要：

Unit: seconds 
     expr  min  lq  mean median  uq  max neval 
data.table 5.755276 5.813267 6.059665 5.87151 6.034506 7.310169 10

在10秒1400万行。生成测试数据花了很长时间。

来源

2017-07-06 19:04:27

哇，谢谢你这样一个彻底的答案！你的解释看起来很有希望。然而，在运行'test [，grp：= unlist ....'行时，我遇到了这个错误：'rep in error（x，time =（pst_index [x + 1] - pst_index [x]））无效的'times'参数' – sweetmusicality

对于'test'数据集有什么不同吗？它是否会在您的计算机上按原样复制我的代码失败？ –

哦，你的代码工作得很好，原样复制。我的实际数据集中有一些行开始相同（你会在下一句中看到我的意思），并且是相互连续的。而且，每行不只是一个单词 - 例如，“PMID - 234254”或“MH - 人类”，但我不知道为什么会影响错误。在看到你的代码后，我使用'setDT（df）'将数据框更改为data.table ...是否为适当的响应？ – sweetmusicality

这是一个使用which的索引方法。

# get positions of PST, the final value 
endSpot <- which(temp == "PST") 
# increment to get the desired positions of the PMID 
# (dropping final value as we don't need to change it) 
startSpot <- head(endSpot + 1, -1) 
# get the current positions of the PMID, except the first one 
PMIDSpot <- tail(which(temp == "PMID"), -1)

现在，用这些指标来交换行

temp[c(startSpot, PMIDSpot), ] <- temp[c(PMIDSpot, startSpot), ]

这将返回（我增加了一个叫做计数行位置变量来跟踪）。

temp 
    V1 count 
1 PMID  1 
2 LID  2 
3 STAT  3 
4 MH  4 
5 RN  5 
6 OT  6 
7 PST  7 
8 PMID 11 
9 STAT  9 
10 MH 10 
11 LID  8 
12 OT 12 
13 PST 13 
14 PMID 17 
15 DEP 15 
16 RN 16 
17 LID 14 
18 PST 18

数据

temp <- 
structure(list(V1 = c("PMID", "LID", "STAT", "MH", "RN", "OT", 
"PST", "LID", "STAT", "MH", "PMID", "OT", "PST", "LID", "DEP", 
"RN", "PMID", "PST"), count = 1:18), .Names = c("V1", "count" 
), row.names = c(NA, -18L), class = "data.frame")

来源

2017-07-06 18:38:12 lmo

重新排列中的R

回答

相关问题