下面是使用data.table
一个答案。
library(data.table)
dat <- fread("Origcol
PMID
LID
STAT
MH
RN
OT
PST
LID
STAT
MH
PMID
OT
PST
LID
DEP
RN
PMID
PST")
dat[, old_order := 1:.N]
pst_index <- c(0, which(dat$Origcol == "PST"))
dat[, grp := unlist(lapply(1:(length(pst_index)-1),
function(x) rep(x,
times = (pst_index[x+1] - pst_index[x]))))]
dat[, Origcol := factor(Origcol, levels = c("PMID", "LID", "STAT",
"MH", "RN", "OT",
"DEP", "PST"))]
dat[order(grp, Origcol)]
结果:
Origcol old_order grp
1: PMID 1 1
2: LID 2 1
3: STAT 3 1
4: MH 4 1
5: RN 5 1
6: OT 6 1
7: PST 7 1
8: PMID 11 2
9: LID 8 2
10: STAT 9 2
11: MH 10 2
12: OT 12 2
13: PST 13 2
14: PMID 17 3
15: LID 14 3
16: RN 16 3
17: DEP 15 3
18: PST 18 3
这样做的好处是data.table通过引用做了很多的操作,一旦你扩大规模要快。你说你有1400万行,让我们试试看。产生这种规模的一些合成数据:
dat_big <- data.table(Origcol = c("PMID", "LID", "STAT", "MH", "RN", "OT", "PST"))
dat_big_add <- rbindlist(lapply(1:10000,
function(x) data.table(Origcol = c(sample(c("PMID", "LID", "STAT",
"MH", "RN", "OT")),
"PST"))))
dat_big <- rbindlist(list(dat_big,
dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add,
dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add,
dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add,
dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add))
dat <- rbindlist(list(dat_big, dat_big, dat_big, dat_big, dat_big,
dat_big, dat_big, dat_big, dat_big, dat_big))
我们现在有:
Origcol
1: PMID
2: LID
3: STAT
4: MH
5: RN
---
14000066: STAT
14000067: MH
14000068: OT
14000069: PMID
14000070: PST
应用与上面相同的代码:
dat[, old_order := 1:.N]
pst_index <- c(0, which(dat$Origcol == "PST"))
dat[, grp := unlist(lapply(1:(length(pst_index)-1),
function(x) rep(x,
times = (pst_index[x+1] - pst_index[x]))))]
dat[, Origcol := factor(Origcol, levels = c("PMID", "LID", "STAT",
"MH", "RN", "OT",
"DEP", "PST"))]
dat[order(grp, Origcol)]
现在,我们得到:
Origcol old_order grp
1: PMID 1 1
2: LID 2 1
3: STAT 3 1
4: MH 4 1
5: RN 5 1
---
14000066: STAT 14000066 2000010
14000067: MH 14000067 2000010
14000068: RN 14000064 2000010
14000069: OT 14000068 2000010
14000070: PST 14000070 2000010
需要多长时间?
library(microbenchmark)
microbenchmark(
"data.table" = {
dat[, old_order := 1:.N]
pst_index <- c(0, which(dat$Origcol == "PST"))
dat[, grp := unlist(lapply(1:(length(pst_index)-1),
function(x) rep(x,
times = (pst_index[x+1] - pst_index[x]))))]
dat[, Origcol := factor(Origcol, levels = c("PMID", "LID", "STAT",
"MH", "RN", "OT",
"DEP", "PST"))]
dat[order(grp, Origcol)]
},
times = 10)
而且它需要:
Unit: seconds
expr min lq mean median uq max neval
data.table 5.755276 5.813267 6.059665 5.87151 6.034506 7.310169 10
在10秒1400万行。生成测试数据花了很长时间。
'test'看起来不像'data.frame':它没有列名和行号 – HubertL
它是2400万个观察值/行和1列 – sweetmusicality
我不知道如何在列中添加列和行数stackoverflow(没有它手动) – sweetmusicality