分割字符串，并生成频数表中的R

我的事务所名称在[R据帧一列是这样的：分割字符串，并生成频数表中的R

"ABC Industries" 
"ABC Enterprises" 
"123 and 456 Corporation" 
"XYZ Company"

等。我试图产生出现在此列中的每个词的频数表，因此，例如，像这样：

Industries 10 
Corporation 31 
Enterprise 40 
ABC   30 
XYZ   40

我是比较新的[R，所以我想知道的好方法来解决这个问题。我应该分割字符串并将每一个不同的单词放入一个新的列吗？有没有办法将一个多字行分成多行并且有一个字？

来源

2011-12-30 aesir

如果你想，你能做到这一点的一个班轮：

R> text <- c("ABC Industries", "ABC Enterprises", 
+   "123 and 456 Corporation", "XYZ Company") 
R> table(do.call(c, lapply(text, function(x) unlist(strsplit(x, " "))))) 

     123   456   ABC   and  Company 
      1   1   2   1   1 
Corporation Enterprises Industries   XYZ 
      1   1   1   1 
R>

这里我用strsplit()打破每个条目介绍的组件;这将返回一个列表（在一个列表中）。我使用do.call()，因此只需简单地将所有结果列表连接成一个向量，即table()总结。

来源

2011-12-30 04:38:35

非常感谢。我一直在摆弄原始代码，我发现我得到了相同的结果： table（unlist（strsplit（text，“”））） lapply（）和do.call（）的用途是什么？ – aesir 2012-01-03 22:31:08

这是另一个单线程。它采用paste()所有列项的合并成一个长文本字符串，它然后分裂开来并列表：

text <- c("ABC Industries", "ABC Enterprises", 
     "123 and 456 Corporation", "XYZ Company") 

table(strsplit(paste(text, collapse=" "), " "))

来源

2011-12-30 07:00:19

+1非常好，我只会添加split =“\\ s {1，}”以使它更稳健 – 2011-12-31 12:38:49

@WojciechSobala是的 - 我有同样的想法，并且它可能更好/更接近OP想要的东西。 'split =“\\ s +”'或'split =“[[：space：]] +”'是另外两个完全相同的选项。 – 2011-12-31 15:11:35

您可以使用包tidytext和dplyr：

set.seed(42) 

text <- c("ABC Industries", "ABC Enterprises", 
     "123 and 456 Corporation", "XYZ Company") 

data <- data.frame(category = sample(text, 100, replace = TRUE), 
        stringsAsFactors = FALSE) 

library(tidytext) 
library(dplyr) 

data %>% 
    unnest_tokens(word, category) %>% 
    group_by(word) %>% 
    count() 

#> # A tibble: 9 x 2 
#> # Groups: word [9] 
#>   word  n 
#>   <chr> <int> 
#> 1   123 29 
#> 2   456 29 
#> 3   abc 45 
#> 4   and 29 
#> 5  company 26 
#> 6 corporation 29 
#> 7 enterprises 21 
#> 8 industries 24 
#> 9   xyz 26

来源

2018-02-02 14:03:27 FilipW

分割字符串，并生成频数表中的R

回答

相关问题