我试图从我的数据文本分析中删除拼写错误。所以我正在使用量子包的字典功能。它适用于Unigrams。但它为Bigrams提供了意想不到的输出。不知道如何处理拼写错误,以便他们不会潜入我的Bigrams和Trigrams。使用词典在Quanteda中创建Bigram
ZTestCorp1 <- c("The new law included a capital gains tax, and an inheritance tax.",
"New York City has raised a taxes: an income tax and a sales tax.")
ZcObj <- corpus(ZTestCorp1)
mydict <- dictionary(list("the"="the", "new"="new", "law"="law",
"capital"="capital", "gains"="gains", "tax"="tax",
"inheritance"="inheritance", "city"="city"))
Zdfm1 <- dfm(ZcObj, ngrams=2, concatenator=" ",
what = "fastestword",
toLower=TRUE, removeNumbers=TRUE,
removePunct=TRUE, removeSeparators=TRUE,
removeTwitter=TRUE, stem=FALSE,
ignoredFeatures=NULL,
language="english",
dictionary=mydict, valuetype="fixed")
wordsFreq1 <- colSums(sort(Zdfm1))
电流输出
> wordsFreq1
the new law capital gains tax inheritance city
0 0 0 0 0 0 0 0
不使用词典,输出如下:
> wordsFreq
tax and the new new law law included included a a capital
2 1 1 1 1 1
capital gains gains tax and an an inheritance inheritance tax new york
1 1 1 1 1 1
york city city has has raised raised a a taxes taxes an
1 1 1 1 1 1
an income income tax and a a sales sales tax
1 1 1 1 1
预期两字组
The new
new law
law capital
capital gains
gains tax
tax inheritance
inheritance city
p.s.我假设标记是在字典匹配后完成的。但看起来情况并非如我所见。
在另一方面,我试图创建我的字典对象作为
mydict <- dictionary(list(mydict=c("the", "new", "law", "capital", "gains",
"tax", "inheritance", "city")))
但没有奏效。所以我不得不使用上面我认为效率不高的方法。
UPDATE 基于Ken的溶液输出:
> (myDfm1a <- dfm(ZcObj, verbose = FALSE, ngrams=2,
+ keptFeatures = c("the", "new", "law", "capital", "gains", "tax", "inheritance", "city")))
Document-feature matrix of: 2 documents, 14 features.
2 x 14 sparse Matrix of class "dfmSparse" features
docs the_new new_law law_included a_capital capital_gains gains_tax tax_and an_inheritance
text1 1 1 1 1 1 1 1 1
text2 0 0 0 0 0 0 1 0
features
docs inheritance_tax new_york york_city city_has income_tax sales_tax
text1 1 0 0 0 0 0
text2 0 1 1 1 1 1
感谢您的慷慨和详细的解释。我收到这个错误。有任何想法吗?? '>(toksDict < - selectFeatures(toks,mydict,选择= “保持”)) 错误UseMethod( “selectFeatures”): 没有适用的方法关于 'selectFeatures' 应用于类“C的目的( 'tokenizedTexts', 'list')“' – PeterV
可能是因为'selectFeatures()'的方法仅在最新的(GitHub)版本的quanteda中扩展,并且您正在使用CRAN版本。按照https://github.com/kbenoit/quanteda从GitHub安装,截至今天的版本是0.9.1-7。 (将于2016年1月更新CRAN版本。) –
谢谢@Ken。这很棒!我将安装最新的cran软件包。事实上,我喜欢你提供的第二个解决方案,因为它考虑了停用词。这对我来说很重要,因为我正在从事一个单词预测项目。然而,我很好奇它是如何设法拉入**纽约**。我认为纽约不是一个停词。当我使用ngrams = 2选项时,我得到了这个。 – PeterV