我已经得到了我想要转换为正确的大小写全部大写所有者名称的列表的适当资本与公司名称混合名称字符串
owner1<-c("DXXXXX JOSEPH V JR","MIRNA NXXXXX","ADRIAN TXXXX",
"CUTLER PXXXXXXXXX LLC","GVM PXXXXXXXXX LLC",
"EARLENA RXXXXXXX","NATHANIEL TXXXXX","DXXXXXX DONNA",
"LXXXX ELAINE E TR","SXXXXXX KIMBERLY")
)
希望的输出:
owner1
1: Dxxxxx Joseph V. Jr
2: Mirna Nxxxxx
3: Adrian Txxxx
4: Cutler Pxxxxxxxxx LLC
5: GVM Pxxxxxxxxx LLC
6: Earlena Rxxxxxxx
7: Nathaniel Txxxxx
8: Dxxxxxx Donna
9: Lxxxx Elaine E. TR
10: Sxxxxxx Kimberly
一个很大的第一步是在?chartr
提到的.simpleCap
功能的版本:
.simpleCap <- function(x) {
s <- strsplit(tolower(x), " ")[[1]]
paste(toupper(substring(s, 1, 1)), substring(s, 2),
sep = "", collapse = " ")
}
这是问题的一大块,但未能在4,5和9,我可以补充这种治疗的关键短语(LLC,TR等)分开,但是这仍然留下像观察5
这里是至今(我已经得到了奇妙的功能加快通过以下@ eipi10的解决方案,它向量化的.simpleCap
功能,允许整个功能用于载体):
to.proper<-function(strings){
#vectorized version of .simpleCap;
# I've also built in that I know `strings` is all caps
res<-gsub("\\b([A-Z])([A-Z]+)*","\\U\\1\\L\\2",strings,perl=T)
#In my data, some Irish/Scottish names separated the MC prefix
# Also, re-capitalize following a hyphen
res<-gsub("\\bMc\\s","Mc",gsub("(-.)","\\U\\1",res,perl=T))
for (init in c("[A-Z]","Inc","Assoc","Co",
"Jr","Sr","Tr","Bros")){
#Add a period after common abbreviations
res<-gsub(paste0("\\b(",init,")\\b"),"\\1.",res)
}
for (abbr in c("[B-DF-HJ-NP-TV-XZ][b-df-hj-np-tv-xz]{2,}",
"Pa","Ii","Iii","Iv","Lp","Tj",
"Xiv","Ll","Yml","Us")){
#Re-capitalize any string of >=3 consonants (excluding
# Y for such names as LYNN and WYNN), as well as
# some other common phrases that need upper-casing
res<-gsub(paste0("\\b(",abbr,")\\b"),"\\U\\1",res,perl=T)
}
#Re-capitalize post-Mc letters, e.g. in Mcmahon
gsub("\\bMc([a-z])","Mc\\U\\1",res,perl=T)
}
对于在这个过程中单独留下潜在的不可预测的缩写(特别是像那些不常见的观察5中的那些缩写),有什么想法?
我想你可能需要后缀的一些列表离开'LLC,TR'了比赛,而不是在资本 – akrun
使用除了@ akrun的建议,你有没有从尝试stri_trans_totitle() stringi包? – lawyeR
@lawyeR这也应该给同样的问题。我试过了:-) – akrun