2012-03-25 41 views
3

我有一个文本变量和一个分组变量。我想将文本变量折叠为每行一个字符串(合并)。所以只要小组专栏说m我想将文本分组在一起等等。我在前后提供了一个样本数据集。我正在编写这个包,并且迄今为止避免了对除wordcloud之外的其他包的所有依赖,并且希望以此方式保留它。通过分组变量折叠列(以基数为单位)

我怀疑rle可能对cumsum很有用,但一直没能弄清楚这一点。

预先感谢您。

什么数据看起来像

        text group 
1  Computer is fun. Not too fun.  m 
2    No its not, its dumb.  m 
3    How can we be certain?  f 
4     There is no way.  m 
5      I distrust you.  m 
6   What are you talking about?  f 
7  Shall we move on? Good then.  f 
8 Im hungry. Lets eat. You already?  m 

我想要什么数据看起来像

             text group 
1  Computer is fun. Not too fun. No its not, its dumb.  m 
2         How can we be certain?  f 
3       There is no way. I distrust you.  m 
4 What are you talking about? Shall we move on? Good then.  f 
5      Im hungry. Lets eat. You already?  m 

数据

dat <- structure(list(text = c("Computer is fun. Not too fun.", "No its not, its dumb.", 
"How can we be certain?", "There is no way.", "I distrust you.", 
"What are you talking about?", "Shall we move on? Good then.", 
"Im hungry. Lets eat. You already?"), group = structure(c(2L, 
2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor")), .Names = c("text", 
"group"), row.names = c(NA, 8L), class = "data.frame") 

编辑:我发现我可以用于与该组变量的每个运行添加独特的列:

x <- rle(as.character(dat$group))[[1]] 
dat$new <- as.factor(rep(1:length(x), x)) 

产量:

        text group new 
1  Computer is fun. Not too fun.  m 1 
2    No its not, its dumb.  m 1 
3    How can we be certain?  f 2 
4     There is no way.  m 3 
5      I distrust you.  m 3 
6   What are you talking about?  f 4 
7  Shall we move on? Good then.  f 4 
8 Im hungry. Lets eat. You already?  m 5 

回答

5

这使得使用RLE来创建一个ID组句子上。它采用tapply连同粘贴带来一起输出

## Your example data 
dat <- structure(list(text = c("Computer is fun. Not too fun.", "No its not, its dumb.", 
"How can we be certain?", "There is no way.", "I distrust you.", 
"What are you talking about?", "Shall we move on?  Good then.", 
"Im hungry.  Lets eat.  You already?"), group = structure(c(2L, 
2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor")), .Names = c("text", 
"group"), row.names = c(NA, 8L), class = "data.frame") 


# Needed for later 
k <- rle(as.numeric(dat$group)) 
# Create a grouping vector 
id <- rep(seq_along(k$len), k$len) 
# Combine the text in the desired manner 
out <- tapply(dat$text, id, paste, collapse = " ") 
# Bring it together into a data frame 
answer <- data.frame(text = out, group = levels(dat$group)[k$val]) 
+1

我不相信你需要“以次(长度(k $ len))“,因为序列会将”seq_along“作为k $长度向量,给出相应的数字序列:id < - rep(seq(k $ length),k $ length) – 2012-03-25 05:04:28

+0

@BryanGoodrich Good catch 。本来我只是打算做1:长度(k $ len),但最近我一直在更多地使用seq和seq_along,并且我想最终会导致两种方法的混淆。 – Dason 2012-03-25 05:28:35

+0

我通常只是坚持seq,但为了清晰起见,我可以看到seq_along如何明确表示您正在数值遍历值的向量。当我处理使用x [[(某些逻辑在这里...)]的布尔向量上的多余时,我经常倾向于走这条清晰的路线。这不是必要的,但它确实给了我更喜欢的编码的语言清晰度。 – 2012-03-26 07:16:24

1

我得到了答案,回来后却达诚打我给它比我自己更理解。

x <- rle(as.character(dat$group))[[1]] 
dat$new <- as.factor(rep(1:length(x), x)) 

Paste <- function(x) paste(x, collapse=" ") 
aggregate(text~new, dat, Paste) 

编辑 如何我会用骨料和我从你的回应教训(虽然tapply是一个更好的解决方案)做到这一点:

y <- rle(as.character(dat$group)) 
x <- y[[1]] 
dat$new <- as.factor(rep(1:length(x), x)) 

text <- aggregate(text~new, dat, paste, collapse = " ")[, 2] 
data.frame(text, group = y[[2]]) 
+1

请注意,您不需要定义“粘贴”,因为聚合允许您将其他参数传递给正在应用的功能。你应该能够删除粘贴并使用它来代替'aggregate(text〜new,dat,paste,collapse =“”)'' – Dason 2012-03-25 04:06:25