2012-10-24 28 views
0

我想问一个后续问题this issue,请问,因为还有一个问题出现:我发现属于多个类别(文化&人文与社会科学)的科目(文化研究),即有必须考虑的重叠。正确地消除R中重叠字符串的重复项?

我有类别的长列表,例如这款机器可读例如:

AB <- c("Science","Arts & Humanities","Arts & Humanities; Social Sciences","Science","Arts & Humanities; Arts & Humanities; Social Sciences","Science","Science; Social Sciences","Social Sciences; Science") 

所以它看起来像这样:

> AB 
[1] "Science"            "Arts & Humanities" 
[3] "Arts & Humanities; Social Sciences"     "Science" 
[5] "Arts & Humanities; Arts & Humanities; Social Sciences" "Science" 
[7] "Science; Social Sciences"        "Social Sciences; Science" 

我想以修改这些条款和消除重复到得到这个结果:

[1] "Science"         "Arts & Humanities" 
[3] "Arts & Humanities; Social Sciences"   "Science" 
[5] "Arts & Humanities; Social Sciences"   "Science" 
[7] "Science; Social Sciences"     "Science; Social Sciences" 

所以我正在寻找另一个循环来消除在#5中重复。我试着用strsplit()唯一的()但这并没有工作:

> unique(strsplit(AB, "; *")) 
[[1]] 
[1] "Science" 

[[2]] 
[1] "Arts & Humanities" 

[[3]] 
[1] "Arts & Humanities" "Social Sciences" 

[[4]] 
[1] "Arts & Humanities" "Arts & Humanities" "Social Sciences" 

[[5]] 
[1] "Social Sciences" "Science" 

所以我想再问你一遍,请:我怎样才能实现上述正确的输出? 非常感谢您提前考虑!

回答

2

我认为它与尾随或领先的白色空间有关。如果应用此AB将照顾这对你:

fun <- function(text.var){ 
    x <- unlist(strsplit(text.var, ";")) 
    Trim <- function(x) gsub("^\\s+|\\s+$", "", x) 
    paste(sort(unique(Trim(x))), collapse="; ") 
} 

sapply(AB, fun, USE.NAMES = FALSE) 

产量:

> sapply(AB, fun, USE.NAMES = FALSE) 
[1] "Science"       "Arts & Humanities"     
[3] "Arts & Humanities; Social Sciences" "Science"       
[5] "Arts & Humanities; Social Sciences" "Science"       
[7] "Science; Social Sciences"   "Science; Social Sciences"  
+0

有也''修剪在GDATA包()。 –

+0

非常感谢您的回复,@Tyler Rinker!不幸的是,这给了我唯一的错误(修剪(x)):找不到功能“修剪”**我必须先安装gdata软件包吗? – user1496104

+0

对不起。我没有把它定义为一个函数。现在就试试。 –