我有一个文本变量和一个分组变量。我想将文本变量折叠为每行一个字符串(合并)。所以只要小组专栏说m
我想将文本分组在一起等等。我在前后提供了一个样本数据集。我正在编写这个包,并且迄今为止避免了对除wordcloud
之外的其他包的所有依赖,并且希望以此方式保留它。通过分组变量折叠列(以基数为单位)
我怀疑rle
可能对cumsum
很有用,但一直没能弄清楚这一点。
预先感谢您。
什么数据看起来像
text group
1 Computer is fun. Not too fun. m
2 No its not, its dumb. m
3 How can we be certain? f
4 There is no way. m
5 I distrust you. m
6 What are you talking about? f
7 Shall we move on? Good then. f
8 Im hungry. Lets eat. You already? m
我想要什么数据看起来像
text group
1 Computer is fun. Not too fun. No its not, its dumb. m
2 How can we be certain? f
3 There is no way. I distrust you. m
4 What are you talking about? Shall we move on? Good then. f
5 Im hungry. Lets eat. You already? m
数据
dat <- structure(list(text = c("Computer is fun. Not too fun.", "No its not, its dumb.",
"How can we be certain?", "There is no way.", "I distrust you.",
"What are you talking about?", "Shall we move on? Good then.",
"Im hungry. Lets eat. You already?"), group = structure(c(2L,
2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor")), .Names = c("text",
"group"), row.names = c(NA, 8L), class = "data.frame")
编辑:我发现我可以用于与该组变量的每个运行添加独特的列:
x <- rle(as.character(dat$group))[[1]]
dat$new <- as.factor(rep(1:length(x), x))
产量:
text group new
1 Computer is fun. Not too fun. m 1
2 No its not, its dumb. m 1
3 How can we be certain? f 2
4 There is no way. m 3
5 I distrust you. m 3
6 What are you talking about? f 4
7 Shall we move on? Good then. f 4
8 Im hungry. Lets eat. You already? m 5
我不相信你需要“以次(长度(k $ len))“,因为序列会将”seq_along“作为k $长度向量,给出相应的数字序列:id < - rep(seq(k $ length),k $ length) – 2012-03-25 05:04:28
@BryanGoodrich Good catch 。本来我只是打算做1:长度(k $ len),但最近我一直在更多地使用seq和seq_along,并且我想最终会导致两种方法的混淆。 – Dason 2012-03-25 05:28:35
我通常只是坚持seq,但为了清晰起见,我可以看到seq_along如何明确表示您正在数值遍历值的向量。当我处理使用x [[(某些逻辑在这里...)]的布尔向量上的多余时,我经常倾向于走这条清晰的路线。这不是必要的,但它确实给了我更喜欢的编码的语言清晰度。 – 2012-03-26 07:16:24