所以,一段时间后,我结束了这个愚蠢的代码。 注意:这是而不是完全自动化的替换过程,因为每次正确的匹配应该由人来验证,并且每次我们需要一个微调的agrep max.distance
参数。我完全相信有办法让它更好更快,但这可以帮助完成工作。
##########
# Manual renaming with partial matches
##########
# a) Take a look at the desired column of factor variables
sort(unique(MYDATA$names)) # take a look
# ****
Sensthreshold <- 0.2 # sensitivity of agrep, usually 0.1-0.2 get it right
Searchstring <- "Invesstment LLC" # what should I search?
# ****
# User-defined function: returns similar string on query in column
Searcher <- function(input, similarity = 0.1) {
unique(agrep(input,
MYDATA$names, # <-- define your column here
ignore.case = TRUE, value = TRUE,
max.distance = similarity))
}
# b) Make a search of desired string
Searcher(Searchstring, Sensthreshold) # using user-def function
### PLEASE INSPECT THE OUTPUT OF THE SEARCH
### Did it get it right?
=============================================================================#
## ACTION! This changes your dataframe!
## Please make backup before proceeding
## Please execute this code as a whole to avoid errors
# c) Make a vector of cells indexes after checking output
vector_of_cells <- agrep(Searchstring,
MYDATA$names, ignore.case = TRUE,
max.distance = Sensthreshold)
# d) Apply the changes
MYDATA$names[vector_of_cells] <- Searchstring # <--- CHANGING STRING
# e) Check result
unique(agrep(Searchstring, MYDATA$names,
ignore.case = TRUE, value = TRUE, max.distance = Sensthreshold))
=============================================================================#
你能描述一下你到目前为止所尝试过的吗?例如,为什么base :: agrep不适合你? – Calimo
亲爱的@Calimo,base :: agrep在寻找类似的字符串方面工作得很好,但是我不能强迫他一行一行地替换字符串。我尝试了一些和while循环,但没有运气。该算法应该如下:1)R在向量中找到一个字符串2)将其与其他字符串进行比较3)每个与它类似的字符串(提供一些距离测量)必须用该字符串替换。 – user16
请发布您已有的代码,以便我们可以从此处取得。顺便说一句,我从你对评论的理解中明白,选择拼写错误的“Invezzstment LLC”可以吗? – Calimo