2015-05-22 45 views
1

我已经得到了我想要转换为正确的大小写全部大写所有者名称的列表的适当资本与公司名称混合名称字符串

owner1<-c("DXXXXX JOSEPH V JR","MIRNA NXXXXX","ADRIAN TXXXX", 
      "CUTLER PXXXXXXXXX LLC","GVM PXXXXXXXXX LLC", 
      "EARLENA RXXXXXXX","NATHANIEL TXXXXX","DXXXXXX DONNA", 
      "LXXXX ELAINE E TR","SXXXXXX KIMBERLY") 

希望的输出:

    owner1 
1: Dxxxxx Joseph V. Jr 
2:   Mirna Nxxxxx 
3:   Adrian Txxxx 
4: Cutler Pxxxxxxxxx LLC 
5: GVM Pxxxxxxxxx LLC 
6:  Earlena Rxxxxxxx 
7:  Nathaniel Txxxxx 
8:   Dxxxxxx Donna 
9: Lxxxx Elaine E. TR 
10:  Sxxxxxx Kimberly 

一个很大的第一步是在?chartr提到的.simpleCap功能的版本:

.simpleCap <- function(x) { 
    s <- strsplit(tolower(x), " ")[[1]] 
    paste(toupper(substring(s, 1, 1)), substring(s, 2), 
      sep = "", collapse = " ") 
} 

这是问题的一大块,但未能在4,5和9,我可以补充这种治疗的关键短语(LLC,TR等)分开,但是这仍然留下像观察5

这里是至今(我已经得到了奇妙的功能加快通过以下@ eipi10的解决方案,它向量化的.simpleCap功能,允许整个功能用于载体):

to.proper<-function(strings){ 
    #vectorized version of .simpleCap; 
    # I've also built in that I know `strings` is all caps 
    res<-gsub("\\b([A-Z])([A-Z]+)*","\\U\\1\\L\\2",strings,perl=T) 
    #In my data, some Irish/Scottish names separated the MC prefix 
    # Also, re-capitalize following a hyphen 
    res<-gsub("\\bMc\\s","Mc",gsub("(-.)","\\U\\1",res,perl=T)) 
    for (init in c("[A-Z]","Inc","Assoc","Co", 
       "Jr","Sr","Tr","Bros")){ 
    #Add a period after common abbreviations 
    res<-gsub(paste0("\\b(",init,")\\b"),"\\1.",res) 
    } 
    for (abbr in c("[B-DF-HJ-NP-TV-XZ][b-df-hj-np-tv-xz]{2,}", 
       "Pa","Ii","Iii","Iv","Lp","Tj", 
       "Xiv","Ll","Yml","Us")){ 
    #Re-capitalize any string of >=3 consonants (excluding 
    # Y for such names as LYNN and WYNN), as well as 
    # some other common phrases that need upper-casing 
    res<-gsub(paste0("\\b(",abbr,")\\b"),"\\U\\1",res,perl=T) 
    } 
    #Re-capitalize post-Mc letters, e.g. in Mcmahon 
    gsub("\\bMc([a-z])","Mc\\U\\1",res,perl=T) 
} 

对于在这个过程中单独留下潜在的不可预测的缩写(特别是像那些不常见的观察5中的那些缩写),有什么想法?

+1

我想你可能需要后缀的一些列表离开'LLC,TR'了比赛,而不是在资本 – akrun

+1

使用除了@ akrun的建议,你有没有从尝试stri_trans_totitle() stringi包? – lawyeR

+0

@lawyeR这也应该给同样的问题。我试过了:-) – akrun

回答

2

这是一个使用正则表达式将字符串转换为标题大小写的函数(改编自@BenBolker's answer to a question I asked on SO a while back)。

该函数的编写方式使您可以传递一个参数exceptions来处理GVM等特殊情况。我不确定这是否足够灵活以满足您的需求,因为您必须对异常进行硬编码,但我想我会发布它,看看是否有人可以提出改进建议。

dat = data.frame(owner1 = c("DXXXXX JOSEPH V JR","MIRNA NXXXXX","ADRIAN TXXXX", 
            "CUTLER PXXXXXXXXX LLC","GVM PXXXXXXXXX LLC", 
            "EARLENA RXXXXXXX","NATHANIEL TXXXXX","DXXXXXX DONNA", 
            "LXXXX ELAINE E TR","SXXXXXX KIMBERLY")) 

# Convert a string to title case 
tc = function(strings, exceptions="\\b(gvm)\\b") { 

    # Convert to title case, excluding terminal LLC, TR, etc. 
    title.case = gsub("\\b([a-zA-Z])([a-zA-Z]+)*(LLC| TR| FBO| LP)?", 
        "\\U\\1\\L\\2\\U\\3", strings, perl=TRUE) 

    # Add a period after initials (presumed to be any lone capital letter) 
    title.case = gsub(" ([A-Z]) ", " \\1\\. ", title.case) 

    # Deal with exceptions 
    title.case = gsub(exceptions, "\\U\\1", title.case, perl=TRUE, ignore.case=TRUE) 

    return(title.case) 
} 

dat$title.case = tc(dat$owner1) 

        owner1   title.case 
1  DXXXXX JOSEPH V JR Dxxxxx Joseph V. Jr 
2   MIRNA NXXXXX   Mirna Nxxxxx 
3   ADRIAN TXXXX   Adrian Txxxx 
4 CUTLER PXXXXXXXXX LLC Cutler Pxxxxxxxxx LLC 
5  GVM PXXXXXXXXX LLC GVM Pxxxxxxxxx LLC 
6  EARLENA RXXXXXXX  Earlena Rxxxxxxx 
7  NATHANIEL TXXXXX  Nathaniel Txxxxx 
8   DXXXXXX DONNA   Dxxxxxx Donna 
9  LXXXX ELAINE E TR Lxxxx Elaine E. TR 
10  SXXXXXX KIMBERLY  Sxxxxxx Kimberly 
+0

大的道具为我使用的'.simpleCap'函数的矢量化版本,这sped大量使用我的代码。我最终解决了与您所展示的功能接近的功能。矿是更加量身定制的;为了推广它,我可能会传递'exceptions'和'initialize'参数。 – MichaelChirico

+0

我也正在使用以下来找出什么样的2个字母的辅音短语围绕着它们并逐个处理它们:regmatches(string,regexpr(“\\ b [B-DF-HJ-NP (不幸的是,由于诸如Jr,Sr,Co,Sc(学校),Ch(教会)等缩写的大量缩写以及一些越南人的名字吴等) – MichaelChirico