基于文本文件的内容对文集进行子集

我正在使用R和tm包来进行一些文本分析。我正在尝试根据在单个文本文件的内容中是否找到某个表达式来构建语料库的一个子集。基于文本文件的内容对文集进行子集

我创建20个TEXTFILES语料库（谢谢你lukeA在这个例子中）：

reut21578 <- system.file("texts", "crude", package = "tm") 
corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))

我现在想只选择那些包含字符串“降价” TEXTFILES创建一个子集，文集。

检查该文件的第一文本文件，我知道有包含字符串中的至少一个文本文件：

writeLines(as.character(corp[1]))

我怎么会去最好这样做呢？

来源

2016-03-24 tarti

下面是使用一种方法tm_filter：

library(tm) 
reut21578 <- system.file("texts", "crude", package = "tm") 
corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain)) 

(corp_sub <- tm_filter(corp, function(x) any(grep("price reduction", content(x), fixed=TRUE)))) 
# <<VCorpus>> 
# Metadata: corpus specific: 0, document level (indexed): 0 
# Content: documents: 1 

cat(content(corp_sub[[1]])) 
# Diamond Shamrock Corp said that 
# effective today it had cut its contract prices for crude oil by 
# 1.50 dlrs a barrel. 
#  The reduction brings its posted price for West Texas 
# Intermediate to 16.00 dlrs a barrel, the copany said. 
#  "The price reduction today was made in the light of falling # <===== 
# oil product prices and a weak crude oil market," a company 
# spokeswoman said. 
#  Diamond is the latest in a line of U.S. oil companies that 
# have cut its contract, or posted, prices over the last two days 
# citing weak oil markets. 
# Reuter

我怎么到那里？通过查看packages' vignette，搜索子集，然后查看tm_filter（帮助：?tm_filter）的示例，其中提到了该示例。可能还需要查看?grep来检查模式匹配的选项。

来源

2016-03-24 15:41:39 lukeA

@ lukeA的解决方案有效。我想提供另一种我更喜欢的解决方案。

library(tm) 

     reut21578 <- system.file("texts", "crude", package = "tm") 
     corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain)) 

     corpTF <- lapply(corp, function(x) any(grep("price reduction", content(x), fixed=TRUE))) 

     for(i in 1:length(corp)) 
      corp[[i]]$meta["mySubset"] <- corpTF[i] 

     idx <- meta(corp, tag ="mySubset") == 'TRUE' 
     filtered <- corp[idx] 

     cat(content(filtered[[1]]))

利用这一解决方案采用meta标签，我们可以看到所有语料库元素与选择标签mySubset，价值我们选择的“TRUE”和否则价值“FALSE” 。

来源

2016-03-24 19:53:15 Vezir

非常感谢您的加入。我同意，这非常有用！ – tarti

下面是使用quanteda包的一种更简单的方法，它与重用其他R对象已经定义的现有方法的方式更加一致。 quanteda对于语料库对象有一个subset方法，其工作方式与data.frame的子集方法类似，但在逻辑向量上进行选择，包括在语料库中定义的文档变量。下面，我使用语料库对象的texts()方法从语料库中提取文本，并在grep()中使用该方法搜索您的一对单词。

require(tm) 
data(crude) 

require(quanteda) 
# corpus constructor recognises tm Corpus objects 
(qcorpus <- corpus(crude)) 
## Corpus consisting of 20 documents. 
# use subset method 
(qcorpussub <- subset(qcorpus, grepl("price\\s+reduction", texts(qcorpus)))) 
## Corpus consisting of 1 document. 

# see the context 
## kwic(qcorpus, "price reduction") 
##      contextPre   keyword    contextPost 
## [127, 45:46] copany said." The [ price reduction ] today was made in the

注：我昏昏沉沉的正则表达式用“\ S +”，因为你可以有空格，制表符，换行符，而不是只是一个单一的空间的某种变体。

来源

2016-03-24 21:27:50

基于文本文件的内容对文集进行子集

回答

相关问题