R中 - 优文库

唯一值的累积计数。我的数据集的简化版本会是什么样子：R中

depth value 
    1  a 
    1  b 
    2  a 
    2  b 
    2  b 
    3  c

我想提出一个新的数据集，其中对于“深度”每个值，我会从顶部开始的唯一值的累计数量。例如

depth cumsum 
1  2 
2  2 
3  3

任何想法如何做到这一点？我是比较新的R.

来源

2013-03-29 user2223405

良好的第一步是创建的TRUE或FALSE，它是TRUE第一个每个值和FALSE供日后价值的外观列。这可以很容易地使用duplicated来完成：

mydata$first.appearance = !duplicated(mydata$value)

重塑使用aggregate数据是最好的做法。在这种情况下，它说给depth每个子集内总结在first.appearance柱：

newdata = aggregate(first.appearance ~ depth, data=mydata, FUN=sum)

结果将类似于：

depth first.appearance 
1  1 2 
2  2 0 
3  3 1

这仍然不是一个累计总和，虽然。对于您可以使用cumsum功能（然后摆脱你的旧列）：

newdata$cumsum = cumsum(newdata$first.appearance) 
newdata$first.appearance = NULL

因此，要回顾：

mydata$first.appearance = !duplicated(mydata$value) 
newdata = aggregate(first.appearance ~ depth, data=mydata, FUN=sum) 
newdata$cumsum = cumsum(newdata$first.appearance) 
newdata$first.appearance = NULL

输出：

depth cumsum 
1  1  2 
2  2  2 
3  3  3

来源

2013-03-29 06:37:44

这里是另一个解决方案使用lapply()。用unique(df$depth)作出唯一的depth值的向量，然后对于每个这样的值子集只有那些value值，其中depth等于或小于特定的depth值。然后计算独特的value值的长度。该长度值存储在cumsum中，然后depth=x将给出特定深度级别的值。用do.call(rbind,...)作为一个数据框。

do.call(rbind,lapply(unique(df$depth), 
       function(x) 
      data.frame(depth=x,cumsum=length(unique(df$value[df$depth<=x]))))) 
    depth cumsum 
1  1  2 
2  2  2 
3  3  3

来源

2013-03-29 06:45:31

我觉得这是用factor和精心设置levels的完美案例。在这里我将使用data.table这个想法。确保你的value列是character（不是绝对的要求）。

第1步：通过采取只unique行让您的data.frame转化为data.table。

require(data.table) 
dt <- as.data.table(unique(df)) 
setkey(dt, "depth") # just to be sure before factoring "value"

步骤2：转换value到factor和强迫到numeric。确保自己设定的水平（这很重要）。
```
dt[, id := as.numeric(factor(value, levels = unique(value)))] 
```

第3步：设置键列depth的子集和只挑选的最后一个值

setkey(dt, "depth", "id") 
dt.out <- dt[J(unique(depth)), mult="last"][, value := NULL] 

# depth id 
# 1:  1 2 
# 2:  2 2 
# 3:  3 3

第4步：由于随深度增加而行的所有值应该有在至少上一行的值，您应该使用cummax来获取最终输出。
```
dt.out[, id := cummax(id)] 
```

编辑：上面的代码是用于说明目的。事实上，你根本不需要第三栏。这是我编写最终代码的方式。

require(data.table) 
dt <- as.data.table(unique(df)) 
setkey(dt, "depth") 
dt[, value := as.numeric(factor(value, levels = unique(value)))] 
setkey(dt, "depth", "value") 
dt.out <- dt[J(unique(depth)), mult="last"] 
dt.out[, value := cummax(value)]

这里是一个更棘手的例子，从代码的输出：

df <- structure(list(depth = c(1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 6), 
       value = structure(c(1L, 2L, 3L, 4L, 1L, 3L, 4L, 5L, 6L, 1L, 1L), 
       .Label = c("a", "b", "c", "d", "f", "g"), class = "factor")), 
       .Names = c("depth", "value"), row.names = c(NA, -11L), 
       class = "data.frame") 
# depth value 
# 1:  1  2 
# 2:  2  4 
# 3:  3  4 
# 4:  4  5 
# 5:  5  6 
# 6:  6  6

来源

2013-03-29 09:43:27 Arun

这里的一个'dplyr'版本：'DF％>％安排（深度）％>％突变（值= cummax（as.numeric（因子（值，水平=独特（value）））））％>％arrange（depth，desc（value））％>％distinct（depth）'。 –

当深度和值都是字符串值时，通常可以应用此方法。谢谢！ – ecoe

@Arun这是一个很棒的解决方案！谢谢！ – asterx

这里是另一个尝试：

numvals <- cummax(as.numeric(factor(mydf$value))) 
aggregate(numvals, list(depth=mydf$depth), max)

其中给出：

似乎与@ Arun的例子一起工作：

来源

2013-03-29 10:19:54 juba

我不完全确定，但似乎'深度'和'值'必须同时排序。例如，这个方法不会计算'c'的唯一出现，不管你如何设置这个'data.table'：'mydf = data.table（data.frame（depth = c（1,1 ，2,2,6,7），value = c（“a”，“b”，“g”，“h”，“b”，“c”）））'。 – ecoe

这可以使用sqldf包以相对干净的方式用单个SQL语句编写。假设DF是原始数据帧：

library(sqldf) 

sqldf("select b.depth, count(distinct a.value) as cumsum 
    from DF a join DF b 
    on a.depth <= b.depth 
    group by b.depth" 
)

来源

2013-03-29 15:03:14

假设'深度'是数字，这是非常有用的。如果“深度”是一个日期的字符串或字符串表示形式，就像我的情况一样，它可能是一个非常昂贵的操作。 – ecoe

在很多情况下，速度并不重要，清晰度是更重要的问题。如果性能很重要，那么你必须测试它，而不是做出假设，如果发现速度太慢，请添加索引并再次测试。 –

R中

回答

相关问题