2015-06-24 77 views
0

我有一个数据集(Facebook的帖子)(通过netvizz),我用R中的quanteda软件包。这是我的R代码。R采用量化的文本挖掘

# Load the relevant dictionary (relevant for analysis) 
liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC") 

# Read File 
# Facebooks posts could be generated by FB Netvizz 
# https://apps.facebook.com/netvizz 
# Load FB posts as .csv-file from .zip-file 
fbpost <- read.csv("D:/FB-com.csv", sep=";") 

# Define the relevant column(s) 
fb_test <-as.character(FB_com$comment_message) #one column with 2700 entries 
# Define as corpus 
fb_corp <-corpus(fb_test) 
class(fb_corp) 

# LIWC Application 
fb_liwc<-dfm(fb_corp, dictionary=liwcdict) 
View(fb_liwc) 

一切工作,直到:

> fb_liwc<-dfm(fb_corp, dictionary=liwcdict) 
Creating a dfm from a corpus ... 
    ... indexing 2,760 documents 
    ... tokenizing texts, found 77,923 total tokens 
    ... cleaning the tokens, 1584 removed entirely 
    ... applying a dictionary consisting of 68 key entries 
Error in `dimnames<-.data.frame`(`*tmp*`, value = list(docs = c("text1", : 
    invalid 'dimnames' given for data frame 

你会如何解释错误消息?有什么建议可以解决这个问题吗?

+0

很难说,因为我没有文本输入文件,但是如果您尝试'dfm(inaugTexts,dictionary = liwcdict)',会发生什么?我有'LIWC2001_English.dic'文件,'dfm'命令可以在'inaugTexts'下正常工作 - 尽管速度很慢,需要重写才能优化它(列表中的下一部分)。 –

+0

它现在已经在dev分支中修复,您可以按照下面的答案进行安装。 –

回答

1

Quanteda版本0.7.2中存在一个错误,导致dfm()在使用字典时,其中一个文档不包含任何功能。你的例子失败了,因为在清理阶段,Facebook的某些“文档”最终会通过清理步骤删除所有功能。

这不仅固定在0.8.0,而且还改变了字典dfm()的基础实现,从而显着提高了速度。 (该LIWC仍然是一个庞大而复杂的词典和正则表达式仍然意味着它是慢得多比简单索引标记使用。我们将在进一步优化这方面的工作。)

devtools::install_github("kbenoit/quanteda") 
liwcdict <- dictionary(file = "LIWC2001_English.dic", format = "LIWC") 
mydfm <- dfm(inaugTexts, dictionary = liwcdict) 
## Creating a dfm from a character vector ... 
## ... indexing 57 documents 
## ... lowercasing 
## ... tokenizing 
## ... shaping tokens into data.table, found 134,024 total tokens 
## ... applying a dictionary consisting of 68 key entries 
## ... summing dictionary-matched features by document 
## ... indexing 68 feature types 
## ... building sparse matrix 
## ... created a 57 x 68 sparse dfm 
## ... complete. Elapsed time: 14.005 seconds. 
topfeatures(mydfm, decreasing=FALSE) 
## Fillers Nonfl Swear  TV Eating Sleep Groom Death Sports Sexual 
##  0  0  0  42  47  49  53  76  81  100 

它也将工作,如果一个文档在标记和清理之后包含零个特征,这可能是打破您正在使用的Facebook文本的旧dfm

mytexts <- inaugTexts 
mytexts[3] <- "" 
mydfm <- dfm(mytexts, dictionary = liwcdict, verbose = FALSE) 
which(rowSums(mydfm)==0) 
## 1797-Adams 
##   3