2015-12-26 62 views
2

我试图从我的数据文本分析中删除拼写错误。所以我正在使用量子包的字典功能。它适用于Unigrams。但它为Bigrams提供了意想不到的输出。不知道如何处理拼写错误,以便他们不会潜入我的Bigrams和Trigrams。使用词典在Quanteda中创建Bigram

ZTestCorp1 <- c("The new law included a capital gains tax, and an inheritance tax.", 
       "New York City has raised a taxes: an income tax and a sales tax.") 

ZcObj <- corpus(ZTestCorp1) 

mydict <- dictionary(list("the"="the", "new"="new", "law"="law", 
         "capital"="capital", "gains"="gains", "tax"="tax", 
         "inheritance"="inheritance", "city"="city")) 

Zdfm1 <- dfm(ZcObj, ngrams=2, concatenator=" ", 
     what = "fastestword", 
     toLower=TRUE, removeNumbers=TRUE, 
     removePunct=TRUE, removeSeparators=TRUE, 
     removeTwitter=TRUE, stem=FALSE, 
     ignoredFeatures=NULL, 
     language="english", 
     dictionary=mydict, valuetype="fixed") 

wordsFreq1 <- colSums(sort(Zdfm1)) 

电流输出

> wordsFreq1 
    the   new   law  capital  gains   tax inheritance  city 
     0   0   0   0   0   0   0   0 

不使用词典,输出如下:

> wordsFreq 
    tax and   the new   new law law included  included a  a capital 
      2    1    1    1    1    1 
capital gains  gains tax   and an an inheritance inheritance tax  new york 
      1    1    1    1    1    1 
    york city  city has  has raised  raised a   a taxes  taxes an 
      1    1    1    1    1    1 
    an income  income tax   and a   a sales  sales tax 
      1    1    1    1    1 

预期两字组

The new 
new law 
law capital 
capital gains 
gains tax 
tax inheritance 
inheritance city 

p.s.我假设标记是在字典匹配后完成的。但看起来情况并非如我所见。

在另一方面,我试图创建我的字典对象作为

mydict <- dictionary(list(mydict=c("the", "new", "law", "capital", "gains", 
         "tax", "inheritance", "city"))) 

但没有奏效。所以我不得不使用上面我认为效率不高的方法。

UPDATE 基于Ken的溶液输出:

> (myDfm1a <- dfm(ZcObj, verbose = FALSE, ngrams=2, 
+    keptFeatures = c("the", "new", "law", "capital", "gains", "tax", "inheritance", "city"))) 
Document-feature matrix of: 2 documents, 14 features. 
2 x 14 sparse Matrix of class "dfmSparse" features 
docs the_new new_law law_included a_capital capital_gains gains_tax tax_and an_inheritance 
text1  1  1   1   1    1   1  1    1 
text2  0  0   0   0    0   0  1    0 
    features 
docs inheritance_tax new_york york_city city_has income_tax sales_tax 
text1    1  0   0  0   0   0 
text2    0  1   1  1   1   1 

回答

4

更新2017年12月21日为quanteda

高兴的新版本就看你与这个软件包上!我认为在你遇到困难时有两个问题。首先是如何在形成ngram之前应用特征选择。其次是如何定义特征选择(使用量子)。

第一个问题:如何在形成ngrams之前应用特征选择。在这里你已经定义了一个字典来做到这一点。 (正如我将在下面显示的,这里没有必要。)您想删除所有不在选择列表中的术语,然后形成bigrams。 quanteda默认不会这样做,因为它不是一个标准形式的“bigram”,其中的单词不是按照由相邻性严格定义的某个窗口来并置的。例如,在您的预期结果中,law capital不是一对相邻的术语,这是bigram的通常定义。

但是,我们可以通过更“手动”地构建文档特征矩阵来覆盖此行为。

首先,标记文本。

# tokenize the original 
toks <- tokens(ZcObj, removePunct = TRUE, removeNumbers = TRUE) %>% 
    tokens_tolower() 
toks 
## tokens object from 2 documents. 
## text1 : 
## [1] "the"   "new"   "law"   "included" "a"   "capital"  "gains"  "tax"   "and"   "an"   "inheritance" "tax"   
## 
## text2 : 
## [1] "new" "york" "city" "has" "raised" "a"  "taxes" "an"  "income" "tax" "and" "a"  "sales" "tax" 

现在,我们运用你的字典mydict的符号化文本使用tokens_select()

(toksDict <- tokens_select(toks, mydict, selection = "keep")) 
## tokens object from 2 documents. 
## text1 : 
## [1] "the"   "new"   "law"   "capital"  "gains"  "tax"   "inheritance" "tax"   
## 
## text2 : 
## [1] "new" "city" "tax" "tax" 

从这个选定的一组令牌,我们现在可以形成双字母组(或者我们可以直接喂toksDictdfm()) :

(toks2 <- tokens_ngrams(toksDict, n = 2, concatenator = " ")) 
## tokens object from 2 documents. 
## text1 : 
## [1] "the new"   "new law"   "law capital"  "capital gains" "gains tax"  "tax inheritance" "inheritance tax" 
## 
## text2 : 
## [1] "new city" "city tax" "tax tax" 

# now create the dfm 
(myDfm2 <- dfm(toks2)) 
## Document-feature matrix of: 2 documents, 10 features. 
## 2 x 10 sparse Matrix of class "dfm" 
##  features 
## docs the new new law law capital capital gains gains tax tax inheritance inheritance tax new city city tax tax tax 
## text1  1  1   1    1   1    1    1  0  0  0 
## text2  0  0   0    0   0    0    0  1  1  1 
topfeatures(myDfm2) 
#  the new   new law  law capital capital gains  gains tax tax inheritance inheritance tax  new city  city tax   tax tax 
#   1    1    1    1    1    1    1    1    1    1 

功能列表现在非常接近你想要的。

第二个问题就是为什么你的字典的方法似乎效率不高。这是因为你正在创建一个字典来执行特征选择,但并没有真正将它用作字典 - 换句话说,就是一个字典,其中每个键都等于它自己的键值,因为值不是字典。简单地给它一个选择令牌的字符向量,而不是它的工作正常,例如:

(myDfm1 <- dfm(ZcObj, verbose = FALSE, 
       keptFeatures = c("the", "new", "law", "capital", "gains", "tax", "inheritance", "city"))) 
## Document-feature matrix of: 2 documents, 8 features. 
## 2 x 8 sparse Matrix of class "dfm" 
##  features 
## docs the new law capital gains tax inheritance city 
## text1 1 1 1  1  1 2   1 0 
## text2 0 1 0  0  0 2   0 1 
+0

感谢您的慷慨和详细的解释。我收到这个错误。有任何想法吗?? '>(toksDict < - selectFeatures(toks,mydict,选择= “保持”)) 错误UseMethod( “selectFeatures”): 没有适用的方法关于 'selectFeatures' 应用于类“C的目的( 'tokenizedTexts', 'list')“' – PeterV

+1

可能是因为'selectFeatures()'的方法仅在最新的(GitHub)版本的quanteda中扩展,并且您正在使用CRAN版本。按照https://github.com/kbenoit/quanteda从GitHub安装,截至今天的版本是0.9.1-7。 (将于2016年1月更新CRAN版本。) –

+0

谢谢@Ken。这很棒!我将安装最新的cran软件包。事实上,我喜欢你提供的第二个解决方案,因为它考虑了停用词。这对我来说很重要,因为我正在从事一个单词预测项目。然而,我很好奇它是如何设法拉入**纽约**。我认为纽约不是一个停词。当我使用ngrams = 2选项时,我得到了这个。 – PeterV