2017-02-18 70 views
1

我想执行以下计算:NGRAM在R:计算单词频率和值的总和

输入:

Column_A     Column_B 
Word_A      10 
Word_A Word_B    20 
Word_B Word_A    30 
Word_A Word_B Word_C  40 

输出:

Column_A1     Column_B1 
Word_A      100 = 10+20+30+40 
Word_B      90 = 20+30+40 
Word_C      40 = 40 
Word_A Word_B    90 = 20+30+40 
Word_A Word_C    40 = 40 
Word_B Word_C    40 = 40 
Word_A Word_B Word_C  40 = 40 

的输出中单词的顺序无关紧要,所以Word_A Word_B = 90 = Word_B Word_A。使用RWeka和TM库,我能提取unigram进行(只有一个字),位我需要有n元,其中n = 1,2,3和计算column_B1

回答

1

一个tidyverse方法:

library(tidyverse) 
library(tokenizers) 

df %>% 
    rowwise() %>% 
    mutate(ngram = list(c(tokenize_ngrams(Column_A, lowercase = FALSE, n = 3, n_min = 1), 
           tokenize_skip_ngrams(Column_A, lowercase = FALSE, n = 2), 
          recursive = TRUE)), 
      ngram = list(unique(map_chr(strsplit(ngram, ' '), 
             ~paste(sort(.x), collapse = ' '))))) %>% 
    unnest() %>% 
    count(ngram, wt = Column_B) 

## # A tibble: 7 × 2 
##     ngram  n 
##     <chr> <int> 
## 1    Word_A 100 
## 2  Word_A Word_B 90 
## 3 Word_A Word_B Word_C 40 
## 4  Word_A Word_C 40 
## 5    Word_B 90 
## 6  Word_B Word_C 40 
## 7    Word_C 40 

请注意,目前只有通过三个字的字符串才能生效。对于更长的字符串,你必须弄清楚你想要跳过多少ngrams,或者采取不同的方法。