2013-08-22 28 views
1

我知道我可以使用字典功能使用TM包来算的特定词的出现在语料:如何在TermDocumentMatrix中使用正则表达式进行文本挖掘?

require(tm) 
data(crude) 

dic <- Dictionary("crude") 
tdm <- TermDocumentMatrix(crude, control = list(dictionary = dic, removePunctuation = TRUE)) 
inspect(tdm) 

我想知道是否有一个设施,而不是提供一个正则表达式字典而不是一个固定的词?

有时制止可能不是我想要的东西(例如我可能要拿起拼写错误),所以我想这样做:

dic <- Dictionary(c("crude", 
        "\\bcrud[[:alnum:]]+"), 
        "\\bcrud[de]") 

,从而继续使用TM的设施包?

回答

3

我不确定是否可以在字典函数中放置正则表达式,因为它只接受字符向量或术语文档矩阵。该工作围绕我使用正则表达式来子集术语文档矩阵的条款建议,然后做字数:

# What I would do instead 
tdm <- TermDocumentMatrix(crude, control = list(removePunctuation = TRUE)) 
# subset the tdm according to the criteria 
# this is where you can use regex 
crit <- grep("cru", tdm$dimnames$Terms) 
# have a look to see what you got 
inspect(tdm[crit]) 
     A term-document matrix (2 terms, 20 documents) 

    Non-/sparse entries: 10/30 
    Sparsity   : 75% 
    Maximal term length: 7 
    Weighting   : term frequency (tf) 

      Docs 
    Terms  127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543 
     crucial 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 
     crude  2 0 2 3 0 2 0 0 0 0 5 2 0 2 0 0 0 2 
      Docs 
    Terms  704 708 
     crucial 0 0 
     crude  0 1 
# and count the number of times that criteria is met in each doc 
colSums(as.matrix(tdm[crit])) 
127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543 704 708 
    2 0 2 3 0 2 2 0 0 0 5 2 0 2 0 0 0 2 0 1 
# count the total number of times in all docs 
sum(colSums(as.matrix(tdm[crit]))) 
[1] 23 

如果这不是你想要的,继续前进,编辑你的问题是什么包括一些正确代表您实际使用情况的示例数据,以及您希望的输出示例。

2

如果指定valuetype = "regex",文本分析包quanteda允许使用正则表达式进行特征选择。

require(tm) 
require(quanteda) 
data(crude) 

dfm(corpus(crude), keptFeatures = "^cru", valuetype = "regex", verbose = FALSE) 
# Document-feature matrix of: 20 documents, 2 features. 
# 20 x 2 sparse Matrix of class "dfmSparse" 
#  features 
# docs crude crucial 
# 127  2  0 
# 144  0  0 
# 191  2  0 
# 194  3  0 
# 211  0  0 
# 236  2  0 
# 237  0  2 
# 242  0  0 
# 246  0  0 
# 248  0  0 
# 273  5  0 
# 349  2  0 
# 352  0  0 
# 353  2  0 
# 368  0  0 
# 489  0  0 
# 502  0  0 
# 543  2  0 
# 704  0  0 
# 708  1  0 

另请参阅?selectFeatures