我想在我的单字节频率表中保留两个字母缩写词,它们之间用句点分隔,例如“t.v.”和“美国”。当我用quanteda构建我的单字节频率表时,终止时期正在被截断。这里是一个小的测试语料库来说明。我已删除了句号,句分隔符:如何保持unigrams中的单词间句点? R quanteda
SOS This is the u.s. where our politics is crazy EOS
SOS In the US we watch a lot of t.v. aka TV EOS
SOS TV is an important part of life in the US EOS
SOS folks outside the u.s. probably don't watch so much t.v. EOS
SOS living in other countries is probably not any less crazy EOS
SOS i enjoy my sanity when it comes to visit EOS
我加载到R作为字符向量:
acro.test <- c("SOS This is the u.s. where our politics is crazy EOS", "SOS In the US we watch a lot of t.v. aka TV EOS", "SOS TV is an important part of life in the US EOS", "SOS folks outside the u.s. probably don't watch so much t.v. EOS", "SOS living in other countries is probably not any less crazy EOS", "SOS i enjoy my sanity when it comes to visit EOS")
这里是我使用建立我单字组频数分布表中的代码:
library(quanteda)
dat.dfm <- dfm(acro.test, ngrams=1, verbose=TRUE, concatenator=" ", toLower=FALSE, removeNumbers=TRUE, removePunct=FALSE, stopwords=FALSE)
dat.mat <- as.data.frame(as.matrix(docfreq(dat.dfm)))
ng.sorted <- sort(rowSums(dat.mat), decreasing=TRUE)
freqTable <- data.frame(ngram=names(ng.sorted), frequency = ng.sorted)
row.names(freqTable) <- NULL
freqTable
这将产生以下:
ngram frequency
1 SOS 6
2 EOS 6
3 the 4
4 is 3
5 . 3
6 u.s 2
7 crazy 2
8 US 2
9 watch 2
10 of 2
11 t.v 2
12 TV 2
13 in 2
14 probably 2
15 This 1
16 where 1
17 our 1
18 politics 1
19 In 1
20 we 1
21 a 1
22 lot 1
23 aka 1
等...
我想保留t他在t.v.终止期间。和美国以及消除表中的条目。频率为3.
我也不明白为什么期间(。)在此表中计数为3,同时正确计数u.s和t.v unigrams(每个2)。
完美。这正是我正在寻找的。很好的编辑标题。感谢这样一个彻底的答复,并感谢这个包的所有伟大工作。 –