2016-12-02 9 views
0

从发表答案:将文档ID与R语料库保持@MrFlick保留来自R corpus的确切单词

我想略微修改一个很好的例子。

问题:如何修改content_transformer功能只保留确切话吗?您可以在检查输出中看到奇妙的计数为奇迹和比率计为基本原理。我对gregexprregmatches没有深入的了解。

创建数据帧:现在

dd <- data.frame(
    id = 10:13, 
    text = c("No wonderful, then, that ever", 
      "So that in many cases such a ", 
      "But there were still other and", 
      "Not even at the rationale") 
    , stringsAsFactors = F 
) 

,为了从data.frame读取特殊的属性,我们将使用readTabular功能,使我们自己的自定义data.frame读者

library(tm) 
myReader <- readTabular(mapping = list(content = "text", id = "id")) 

指定用于data.frame中的内容和id的列。现在我们用DataframeSource读取它,但使用我们的自定义阅读器。

tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader)) 

现在,如果我们只想保留一定的单词集,我们可以创建自己的content_transformer函数。一种方法是

keepOnlyWords <- content_transformer(function(x, words) { 
     regmatches(x, 
      gregexpr(paste0("\\b(", paste(words, collapse = "|"), "\\b)"), x) 
     , invert = T) <- " " 
     x 
    }) 

这将用空格替换不在单词列表中的所有内容。请注意,您可能希望在此之后运行stripWhitespace。因此,我们的转换看起来就像

keep <- c("wonder", "then", "that", "the") 

tm <- tm_map(tm, content_transformer(tolower)) 
tm <- tm_map(tm, keepOnlyWords, keep) 
tm <- tm_map(tm, stripWhitespace) 

检查DTM矩阵:

> inspect(dtm) 
<<DocumentTermMatrix (documents: 4, terms: 4)>> 
Non-/sparse entries: 7/9 
Sparsity   : 56% 
Maximal term length: 6 
Weighting   : term frequency (tf) 

    Terms 
Docs ratio that the wonder 
    10  0 1 1  1 
    11  0 1 0  0 
    12  0 0 1  0 
    13  1 0 1  0 

回答

1

切换语法来tidytext,当前的转型将是

library(tidyverse) 
library(tidytext) 
library(stringr) 

dd %>% unnest_tokens(word, text) %>% 
    mutate(word = str_replace_all(word, setNames(keep, paste0('.*', keep, '.*')))) %>% 
    inner_join(data_frame(word = keep)) 

## id word 
## 1 10 wonder 
## 2 10 the 
## 3 10 that 
## 4 11 that 
## 5 12 the 
## 6 12 the 
## 7 13 the 

保持精确的匹配比较容易,因为你可以使用连接(使用==)代替正则表达式:

dd %>% unnest_tokens(word, text) %>% 
    inner_join(data_frame(word = keep)) 

## id word 
## 1 10 then 
## 2 10 that 
## 3 11 that 
## 4 13 the 

把它收回来的文档长期矩阵,

library(tm) 

dd %>% mutate(id = factor(id)) %>% # to keep empty rows of DTM 
    unnest_tokens(word, text) %>% 
    inner_join(data_frame(word = keep)) %>% 
    mutate(i = 1) %>% 
    cast_dtm(id, word, i) %>% 
    inspect() 

## <<DocumentTermMatrix (documents: 4, terms: 3)>> 
## Non-/sparse entries: 4/8 
## Sparsity   : 67% 
## Maximal term length: 4 
## Weighting   : term frequency (tf) 
## 
##  Terms 
## Docs then that the 
## 10 1 1 0 
## 11 0 1 0 
## 12 0 0 0 
## 13 0 0 1 

目前,您的函数后之前匹配words与边界。后它之前改变,改变collapse参数包括界限:

tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader)) 

keepOnlyWords<-content_transformer(function(x,words) { 
     regmatches(x, 
      gregexpr(paste0("(\\b", paste(words, collapse = "\\b|\\b"), "\\b)"), x) 
     , invert = T) <- " " 
     x 
    }) 

tm <- tm_map(tm, content_transformer(tolower)) 
tm <- tm_map(tm, keepOnlyWords, keep) 
tm <- tm_map(tm, stripWhitespace) 

inspect(DocumentTermMatrix(tm)) 

## <<DocumentTermMatrix (documents: 4, terms: 3)>> 
## Non-/sparse entries: 4/8 
## Sparsity   : 67% 
## Maximal term length: 4 
## Weighting   : term frequency (tf) 
## 
##  Terms 
## Docs that the then 
## 10 1 0 1 
## 11 1 0 0 
## 12 0 0 0 
## 13 0 1 0 
+0

谢谢你的详细解答。很棒! @alistaire – BEMR

0

我得到相同的结果用@alistaire TM,与第一通过@BEMR定义keepOnlyWords内容变压器以下修改线:

gregexpr(paste0("\\b(", paste(words, collapse = "|"), ")\\b"), x) 

有在第一通过@BEMR即指定gregexpr放错地方的 “)”应 “)\\ B” 不 “\\ B)”

我觉得上面的gregexpr等同于由@alistaire规定:

gregexpr(paste0("(\\b", paste(words, collapse = "\\b|\\b"), "\\b)"), x)