自定义索引列

我有一个包含不规则日期列的数据集。我想创建一个索引列。对于三个不同的连续日期，索引ID（例如1）是相同的，然后对于下三个不同的连续日期等改变（例如到2）。下面是日期的样本，以及如何所需的列应是这样的：自定义索引列

structure(list(Date = c(42370, 42371, 42371, 42371, 42372, 42372, 
42375, 42375, 42375, 42377, 42377, 42383, 42383, 42385, 42386, 
42386, 42386, 42393, 42393, 42394, 42394, 42395, 42398, 42398, 
42398, 42398), Index = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 
2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4)), .Names = c("Date", 
"Index"), row.names = c(NA, 26L), class = "data.frame")

来源

2016-05-13 Polar Bear

这个问题可能看起来很奇怪，但对我的项目来说非常重要。 –

请阅读关于[如何提出一个好问题]（http://stackoverflow.com/help/how-to-ask）以及如何给出[可重现的示例]的信息（http://stackoverflow.com/questions/5963269 /如何对化妆一个伟大-R-重复性，例如/ 5963610）。这会让其他人更容易帮助你。 – Jaap

现在好吗？我附上了一个包含日期列和期望列（索引）的文件。至于代码，我不知道。 –

这构建了一个按3的索引分组的唯一值，然后使用字符名称来管理转换的查找表：

fac <- ((seq(length(unique(dat$Date)))-1) %/%3) +1 
names(fac) <- unique(dat$Date) 

dat$myIndex <- fac[as.character(dat$Date)] 
dat 
#------- 
    Date Index myIndex 
1 42370  1  1 
2 42371  1  1 
3 42371  1  1 
4 42371  1  1 
5 42372  1  1 
6 42372  1  1 
7 42375  2  2 
8 42375  2  2 
9 42375  2  2 
10 42377  2  2 
11 42377  2  2 
12 42383  2  2 
13 42383  2  2 
14 42385  3  3 
15 42386  3  3 
16 42386  3  3 
17 42386  3  3 
18 42393  3  3 
19 42393  3  3 
20 42394  4  4 
21 42394  4  4 
22 42395  4  4 
23 42398  4  4 
24 42398  4  4 
25 42398  4  4 
26 42398  4  4

来源

2016-05-13 20:23:44

谢谢！请解释代码的工作原理？ –

我移动整数序列，所以它是从零开始的，而不是从1开始，然后使用模分割'％/％'，然后加1回到结果以使分组向量从1开始。我想我可以添加2而不是减1，然后我不需要第二步。 –

这一步是做什么的：names（fac）< - unique（dat $ Date）？ –

使用rleid从data.table包和cumsum：

library(data.table) 
setDT(d1)[, index := (rleid(Date)-1) %% 3 
      ][, index := cumsum(index < shift(index, fill=1))][]

给出：

 Date index 
1: 01-01-16  1 
2: 02-01-16  1 
3: 02-01-16  1 
4: 02-01-16  1 
5: 03-01-16  1 
6: 03-01-16  1 
7: 06-01-16  2 
8: 06-01-16  2 
9: 06-01-16  2 
10: 08-01-16  2 
11: 08-01-16  2 
12: 14-01-16  2 
13: 14-01-16  2 
14: 16-01-16  3 
15: 17-01-16  3 
16: 17-01-16  3 
17: 17-01-16  3 
18: 24-01-16  3 
19: 24-01-16  3 
20: 25-01-16  4 
21: 25-01-16  4 
22: 26-01-16  4 
23: 29-01-16  4 
24: 29-01-16  4 
25: 29-01-16  4 
26: 29-01-16  4

说明：

rleid函数创建一个游程长度标识。这意味着每次Date更改时，游程长度标识会增加1。
通过从该游程ID从其减去1并考虑它的弹性模量（在%% 3部分），你得到的0序列的载体，1 & 2的。
作为最后一步，您将这些值与以前的值进行比较的累计和。当index < shift(index, fill=1)为TRUE时，cumsum函数会将其计为一个。

为了更好地看到这段代码的含义，请参见下面的代码的输出，创造了每一步的变量：

setDT(d1)[, index1 := (rleid(Date)-1) %% 3 
      ][, index2 := cumsum(index1 < shift(index1, fill=1))][]

使用的数据

d1 <- structure(list(Date = structure(c(16801, 16802, 16802, 16802, 16803, 16803, 16806, 
             16806, 16806, 16808, 16808, 16814, 16814, 16816, 
             16817, 16817, 16817, 16824, 16824, 16825, 16825, 
             16826, 16829, 16829, 16829, 16829), class = "Date")), 
       .Names = "Date", row.names = c(NA, 26L), class = "data.frame")

来源

2016-05-13 20:06:19 Jaap

非常好！请解释代码的进展（我仍然处于初级水平） –

@PolarBear我已经更新了我的答案并附有解释，HTH – Jaap

基R.我们可以修改该对象的rle（行程长度编码）到值的三元组群：

DF$index = with(rle(DF$Date), { 
    g = ceiling(seq_along(values)/3) 
    split(values, g) <- seq(tail(g,1)) 
    inverse.rle(list(lengths = lengths, values = values)) 
})

怪异split(x,g) <-位从ave借来的。如果Date列在增加，这可以更简单地完成（感谢@Jaap）：

DF$index = ceiling(match(DF$Date, unique(DF$Date))/3) # or... 
DF$index = ceiling(as.integer(factor(DF$Date))/3)

data.table。数据。表类似物是简单的：

library(data.table) 
setDT(DF)[, index := ceiling(rleid(Date)/3)]

来源

2016-05-13 20:50:29 Frank

我从问题的早期版本使用的数据：

df <- data.frame(Date = c("01-01-16", "02-01-16", "02-01-16", "02-01-16", 
         "03-01-16", "03-01-16", "06-01-16", "06-01-16", "06-01-16", "08-01-16", 
         "08-01-16", "14-01-16", "14-01-16", "16-01-16", "17-01-16", "17-01-16", 
         "17-01-16", "24-01-16", "24-01-16", "25-01-16", "25-01-16", "26-01-16", 
         "29-01-16", "29-01-16", "29-01-16", "29-01-16"), 
        Index = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
         3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L))

我会从角色转换日期列日期，并确保数据帧的开始按日期排序（你不需要的部分与在Date已经是数字数据的新版本，如果你是确保数据帧已按日期排序）：

df$Date <- as.Date(df$Date, format="%d-%m-%y") 
df <- df[ order(df$Date),]

然后我会的日期转换为连续整数 - 这样做是为了转换为因素的一种方式，然后unclass（这里我用c作为简写做） - 然后cut它以相等的间隔：

df$ndx <- c(factor(as.numeric(df$Date))) 
df$ndx <- cut(df$ndx, seq(0.5, max(df$ndx)+0.5, by=3), labels=FALSE)

来源

2016-05-13 21:43:23 lebatsnok

自定义索引列

回答

相关问题