如何将自定义函数应用于量子文集

我正尝试将使用tm的脚本迁移到量子。阅读量子文档有一个关于应用“下游”变化的原理，以便原始语料库不变。好。如何将自定义函数应用于量子文集

我以前写过一个脚本来查找我们的tm语料库中的拼写错误，并得到了我们团队的支持以创建手动查找。所以，我有一个包含2列的csv文件，第一列是拼写错误术语，第二列是该术语的正确版本。

利用TM包之前我这样做：

# Write a custom function to pass to tm_map 
# "Spellingdoc" is the 2 column csv 
library(stringr) 
library(stringi) 
library(tm) 
stringi_spelling_update <- content_transformer(function(x, lut = spellingdoc) stri_replace_all_regex(str = x, pattern = paste0("\\b", lut[,1], "\\b"), replacement = lut[,2], vectorize_all = FALSE))

然后我TM语料库转换我这样做内：

mycorpus <- tm_map(mycorpus, function(i) stringi_spelling_update(i, spellingdoc))

什么是这个自定义功能应用到我的quanteda语料库equivilent方式？

来源

2017-08-30 Doug Fir

不可能知道这是否会从你的榜样，这让一些地区失去工作，但一般：

如果您要访问的quanteda语料文本，你可以使用texts()，和以取代那些文本，texts()<-。

你的情况

因此，假设mycorpus是TM语料库，你可以这样做：

library("quanteda") 
stringi_spelling_update2 <- function(x, lut = spellingdoc) { 
    stringi::stri_replace_all_regex(str = x, 
            pattern = paste0("\\b", lut[,1], "\\b"), 
            replacement = lut[,2], 
            vectorize_all = FALSE) 
} 

myquantedacorpus <- corpus(mycorpus) 
texts(mycorpus) <- stringi_spelling_update2(texts(mycorpus), spellingdoc)

来源

2017-08-30 16:05:30

嗨@Ken，实际上mycorpus是quanteda语料库。我刚刚正在学习这个软件包。我想你的第二句话是我在找什么？然而，对于这个特殊的问题，我注意到你为dfm（）提供的字典功能，所以我用它来代替，但很好的知道，如果我需要对每个文档应用自定义函数，我会去'''texts（mycorpus）< - myCustomFunction（myCorpus））'''（尽管如果坚持量化不改变语料库的哲学，我应该避免这样做） –

语料库中的清理文本仍然与** quanteda **的非破坏性工作流原则一致，如果语料库包含您从未感兴趣的拼写错误（例如OCR错误的产品）。我们想要阻止的是应用stemmers或从语料库本身中删除停用词的人。 –

我想我通过here找到了间接答案。

texts(myCorpus) <- myFunction(myCorpus)

来源

2017-08-30 08:49:26

如何将自定义函数应用于量子文集

回答

相关问题