保留来自R corpus的确切单词

从发表答案：将文档ID与R语料库保持@MrFlick保留来自R corpus的确切单词

我想略微修改一个很好的例子。

问题：如何修改content_transformer功能只保留确切话吗？您可以在检查输出中看到奇妙的计数为奇迹和比率计为基本原理。我对gregexpr和regmatches没有深入的了解。

创建数据帧：现在

dd <- data.frame(
    id = 10:13, 
    text = c("No wonderful, then, that ever", 
      "So that in many cases such a ", 
      "But there were still other and", 
      "Not even at the rationale") 
    , stringsAsFactors = F 
)

，为了从data.frame读取特殊的属性，我们将使用readTabular功能，使我们自己的自定义data.frame读者

library(tm) 
myReader <- readTabular(mapping = list(content = "text", id = "id"))

指定用于data.frame中的内容和id的列。现在我们用DataframeSource读取它，但使用我们的自定义阅读器。

tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))

现在，如果我们只想保留一定的单词集，我们可以创建自己的content_transformer函数。一种方法是

keepOnlyWords <- content_transformer(function(x, words) { 
     regmatches(x, 
      gregexpr(paste0("\\b(", paste(words, collapse = "|"), "\\b)"), x) 
     , invert = T) <- " " 
     x 
    })

这将用空格替换不在单词列表中的所有内容。请注意，您可能希望在此之后运行stripWhitespace。因此，我们的转换看起来就像

keep <- c("wonder", "then", "that", "the") 

tm <- tm_map(tm, content_transformer(tolower)) 
tm <- tm_map(tm, keepOnlyWords, keep) 
tm <- tm_map(tm, stripWhitespace)

检查DTM矩阵：

> inspect(dtm) 
<<DocumentTermMatrix (documents: 4, terms: 4)>> 
Non-/sparse entries: 7/9 
Sparsity   : 56% 
Maximal term length: 6 
Weighting   : term frequency (tf) 

    Terms 
Docs ratio that the wonder 
    10  0 1 1  1 
    11  0 1 0  0 
    12  0 0 1  0 
    13  1 0 1  0

来源

2016-12-02 BEMR

切换语法来tidytext，当前的转型将是

library(tidyverse) 
library(tidytext) 
library(stringr) 

dd %>% unnest_tokens(word, text) %>% 
    mutate(word = str_replace_all(word, setNames(keep, paste0('.*', keep, '.*')))) %>% 
    inner_join(data_frame(word = keep)) 

## id word 
## 1 10 wonder 
## 2 10 the 
## 3 10 that 
## 4 11 that 
## 5 12 the 
## 6 12 the 
## 7 13 the

保持精确的匹配比较容易，因为你可以使用连接（使用==）代替正则表达式：

dd %>% unnest_tokens(word, text) %>% 
    inner_join(data_frame(word = keep)) 

## id word 
## 1 10 then 
## 2 10 that 
## 3 11 that 
## 4 13 the

把它收回来的文档长期矩阵，

library(tm) 

dd %>% mutate(id = factor(id)) %>% # to keep empty rows of DTM 
    unnest_tokens(word, text) %>% 
    inner_join(data_frame(word = keep)) %>% 
    mutate(i = 1) %>% 
    cast_dtm(id, word, i) %>% 
    inspect() 

## <<DocumentTermMatrix (documents: 4, terms: 3)>> 
## Non-/sparse entries: 4/8 
## Sparsity   : 67% 
## Maximal term length: 4 
## Weighting   : term frequency (tf) 
## 
##  Terms 
## Docs then that the 
## 10 1 1 0 
## 11 0 1 0 
## 12 0 0 0 
## 13 0 0 1

目前，您的函数后或之前匹配words与边界。后它和之前改变，改变collapse参数包括界限：

tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader)) 

keepOnlyWords<-content_transformer(function(x,words) { 
     regmatches(x, 
      gregexpr(paste0("(\\b", paste(words, collapse = "\\b|\\b"), "\\b)"), x) 
     , invert = T) <- " " 
     x 
    }) 

tm <- tm_map(tm, content_transformer(tolower)) 
tm <- tm_map(tm, keepOnlyWords, keep) 
tm <- tm_map(tm, stripWhitespace) 

inspect(DocumentTermMatrix(tm)) 

## <<DocumentTermMatrix (documents: 4, terms: 3)>> 
## Non-/sparse entries: 4/8 
## Sparsity   : 67% 
## Maximal term length: 4 
## Weighting   : term frequency (tf) 
## 
##  Terms 
## Docs that the then 
## 10 1 0 1 
## 11 1 0 0 
## 12 0 0 0 
## 13 0 1 0

来源

2016-12-02 15:56:45 alistaire

谢谢你的详细解答。很棒！ @alistaire – BEMR

我得到相同的结果用@alistaire TM，与第一通过@BEMR定义keepOnlyWords内容变压器以下修改线：

gregexpr(paste0("\\b(", paste(words, collapse = "|"), ")\\b"), x)

有在第一通过@BEMR即指定gregexpr放错地方的 “）”应 “）\\ B” 不 “\\ B）”

我觉得上面的gregexpr等同于由@alistaire规定：

gregexpr(paste0("(\\b", paste(words, collapse = "\\b|\\b"), "\\b)"), x)

来源

2017-09-18 04:33:54

保留来自R corpus的确切单词

回答

相关问题