部分字符串匹配和替换R中

我有这样部分字符串匹配和替换R中

> myDataFrame 
      company 
1 Investment LLC 
2 Hyperloop LLC 
3 Invezzstment LLC 
4 Investment_LLC 
5 Haiperloop LLC 
6 Inwestment LLC

我需要匹配所有这些模糊串的数据帧，所以最终的结果应该是这样的：

> myDataFrame 
      company 
1 Investment LLC 
2 Hyperloop LLC 
3 Investment LLC 
4 Investment LLC 
5 Hyperloop LLC 
6 Investment LLC

所以，实际上，我必须解决分类变量的部分匹配和替换任务。在R和包中有很多很棒的函数来解决字符串匹配问题，但我坚持为这种匹配和替换找到一个解决方案。我不在乎哪个事件会取代其他事件，例如“Investment LLC”或“Invezzstment LLC”都是同样好的。只需要它们是一致的。

是否有任何单一的所有功能于一身的功能或循环？

来源

2016-05-18 user16

你能描述一下你到目前为止所尝试过的吗？例如，为什么base :: agrep不适合你？ – Calimo

亲爱的@Calimo，base :: agrep在寻找类似的字符串方面工作得很好，但是我不能强迫他一行一行地替换字符串。我尝试了一些和while循环，但没有运气。该算法应该如下：1）R在向量中找到一个字符串2）将其与其他字符串进行比较3）每个与它类似的字符串（提供一些距离测量）必须用该字符串替换。 – user16

请发布您已有的代码，以便我们可以从此处取得。顺便说一句，我从你对评论的理解中明白，选择拼写错误的“Invezzstment LLC”可以吗？ – Calimo

所以，一段时间后，我结束了这个愚蠢的代码。注意：这是而不是完全自动化的替换过程，因为每次正确的匹配应该由人来验证，并且每次我们需要一个微调的agrep max.distance参数。我完全相信有办法让它更好更快，但这可以帮助完成工作。

########## 
    # Manual renaming with partial matches 
    ########## 

    # a) Take a look at the desired column of factor variables 
    sort(unique(MYDATA$names)) # take a look 

    # **** 
    Sensthreshold <- 0.2 # sensitivity of agrep, usually 0.1-0.2 get it right 
    Searchstring <- "Invesstment LLC" # what should I search? 
    # **** 

    # User-defined function: returns similar string on query in column 
    Searcher <- function(input, similarity = 0.1) { 
     unique(agrep(input, 
        MYDATA$names, # <-- define your column here 
        ignore.case = TRUE, value = TRUE, 
        max.distance = similarity)) 
    } 

    # b) Make a search of desired string 
    Searcher(Searchstring, Sensthreshold) # using user-def function 
    ### PLEASE INSPECT THE OUTPUT OF THE SEARCH 
    ### Did it get it right? 

=============================================================================# 
    ## ACTION! This changes your dataframe! 
    ## Please make backup before proceeding 
    ## Please execute this code as a whole to avoid errors 

    # c) Make a vector of cells indexes after checking output 
    vector_of_cells <- agrep(Searchstring, 
         MYDATA$names, ignore.case = TRUE, 
         max.distance = Sensthreshold) 
    # d) Apply the changes 
    MYDATA$names[vector_of_cells] <- Searchstring # <--- CHANGING STRING 
    # e) Check result 
    unique(agrep(Searchstring, MYDATA$names, 
       ignore.case = TRUE, value = TRUE, max.distance = Sensthreshold)) 
=============================================================================#

来源

2016-05-26 08:04:56 user16

如果你有正确拼写的载体，agrep使这相当容易：

myDataFrame$company <- sapply(myDataFrame$company, 
           function(val){agrep(val, 
                c('Investment LLC', 'Hyperloop LLC'), 
                value = TRUE)}) 

myDataFrame 
#   company 
# 1 Investment LLC 
# 2 Hyperloop LLC 
# 3 Investment LLC 
# 4 Investment LLC 
# 5 Hyperloop LLC 
# 6 Investment LLC

如果你没有这样的载体，可以可能会让一个与adist巧妙应用，甚至只是table如果正确的拼写比其他的更重要，它可能会（但不在这里）。

来源

2016-05-18 07:53:33 alistaire

感谢您的回复，alistaire！我没有正确拼写的实体向量。我尝试了一下关于adist函数的建议，但是由于记录数n = 59396，R无法计算这个字符串距离矩阵，所以这个大的矩阵对象超过26.3Gb。表是一个好主意，我会尝试。 – user16

部分字符串匹配和替换R中

回答

相关问题