2016-04-14 21 views
1

我想在我的单字节频率表中保留两个字母缩写词,它们之间用句点分隔,例如“t.v.”和“美国”。当我用quanteda构建我的单字节频率表时,终止时期正在被截断。这里是一个小的测试语料库来说明。我已删除了句号,句分隔符:如何保持unigrams中的单词间句点? R quanteda

SOS This is the u.s. where our politics is crazy EOS

SOS In the US we watch a lot of t.v. aka TV EOS

SOS TV is an important part of life in the US EOS

SOS folks outside the u.s. probably don't watch so much t.v. EOS

SOS living in other countries is probably not any less crazy EOS

SOS i enjoy my sanity when it comes to visit EOS

我加载到R作为字符向量:

acro.test <- c("SOS This is the u.s. where our politics is crazy EOS", "SOS In the US we watch a lot of t.v. aka TV EOS", "SOS TV is an important part of life in the US EOS", "SOS folks outside the u.s. probably don't watch so much t.v. EOS", "SOS living in other countries is probably not any less crazy EOS", "SOS i enjoy my sanity when it comes to visit EOS") 

这里是我使用建立我单字组频数分布表中的代码:

library(quanteda) 
dat.dfm <- dfm(acro.test, ngrams=1, verbose=TRUE, concatenator=" ", toLower=FALSE, removeNumbers=TRUE, removePunct=FALSE, stopwords=FALSE) 
dat.mat <- as.data.frame(as.matrix(docfreq(dat.dfm))) 
ng.sorted <- sort(rowSums(dat.mat), decreasing=TRUE) 
freqTable <- data.frame(ngram=names(ng.sorted), frequency = ng.sorted) 
row.names(freqTable) <- NULL 
freqTable 

这将产生以下:

 ngram frequency 
1  SOS   6 
2  EOS   6 
3  the   4 
4   is   3 
5   .   3 
6  u.s   2 
7  crazy   2 
8   US   2 
9  watch   2 
10  of   2 
11  t.v   2 
12  TV   2 
13  in   2 
14 probably   2 
15  This   1 
16  where   1 
17  our   1 
18 politics   1 
19  In   1 
20  we   1 
21   a   1 
22  lot   1 
23  aka   1 

等...

我想保留t他在t.v.终止期间。和美国以及消除表中的条目。频率为3.

我也不明白为什么期间(。)在此表中计数为3,同时正确计数u.s和t.v unigrams(每个2)。

回答

2

这样做的原因行为是quanteda的默认字tokeniser用来字边界基于ICU清晰度(从stringi封装)。 u.s.显示为单词u.s.后跟句点.令牌。如果你的名字是will.i.am,这很好,但对你而言可能不是那么好。但是,您可以轻松地使用参数what = "fasterword"传递给tokens()(通过调用...部分的dfm()中的可用选项)轻松切换到空格符号表示符。

tokens(acro.test, what = "fasterword")[[1]] 
## [1] "SOS"  "This"  "is"  "the"  "u.s."  "where" "our"  "politics" "is"  "crazy" "EOS" 

你可以看到在这里,u.s.被保留。 回应您的最后一个问题,终端.的文档频率为3,因为它在三个文档中作为单独的标记出现,这是remove_punct = FALSE时的默认单词标记化行为。

要传递给dfm(),然后构造文档的文档频率的data.frame,下面的代码工作(为了提高效率,我已经整理了一下)。请注意关于文档和术语频率差异的评论 - 我注意到有些用户对docfreq()有点困惑。

# I removed the options that were the same as the default 
# note also that stopwords = TRUE is not a valid argument - see remove parameter 
dat.dfm <- dfm(acro.test, tolower = FALSE, remove_punct = FALSE, what = "fasterword") 

# sort in descending document frequency 
dat.dfm <- dat.dfm[, names(sort(docfreq(dat.dfm), decreasing = TRUE))] 
# Note: this would sort the dfm in descending total term frequency 
#  not the same as docfreq 
# dat.dfm <- sort(dat.dfm) 

# this creates the data.frame in one more efficient step 
freqTable <- data.frame(ngram = featnames(dat.dfm), frequency = docfreq(dat.dfm), 
         row.names = NULL, stringsAsFactors = FALSE) 
head(freqTable, 10) 
## ngram frequency 
## 1 SOS   6 
## 2 EOS   6 
## 3 the   4 
## 4  is   3 
## 5 u.s.   2 
## 6 crazy   2 
## 7  US   2 
## 8 watch   2 
## 9  of   2 
## 10 t.v.   2 

在我看来,通过docfreq()在DFM产生的命名载体是用于存储的结果比你data.frame方法更有效的方法,但你可能要添加其他变量。

+1

完美。这正是我正在寻找的。很好的编辑标题。感谢这样一个彻底的答复,并感谢这个包的所有伟大工作。 –

相关问题