2016-10-13 54 views
0

相同的对象删除行我有大约800万行数据帧的看起来象下面这样:与数据帧

Trevor Brown Chris Coghlan Starlin Castro Kelby Tomlinson Brandon Crawford Brandon Crawford Kelby Tomlinson Brandon Crawford 

Buster Posey Chris Coghlan Starlin Castro Kelby Tomlinson Brandon Crawford Brandon Crawford Kelby Tomlinson Brandon Crawford 

. 
. 
. 
. 

Trevor Brown Brandon Crawford Starlin Castro Kelby Tomlinson Brandon Crawford Brandon Crawford Kelby Tomlinson Brandon Crawford 

很多行有重复的名字,我想它删除。我发现很难对每行进行矢量化,然后检查是否有重复,因为数据帧有800万行,因此需要花费很长时间。有没有更快的方法来完成这项任务?

+0

难道每每行一个字符串? – akrun

+0

每行16个字符串。它是8 x 800万数据帧。每行八个全名 – James

+0

你可以尝试'apply'和'unique' – parksw3

回答

0

从我可以从问题和意见中收集的信息,我提出了这个解决方案。

require(gtools) 
a <- LETTERS[1:8] 
data <- permutations(n = 8, r = 8, v = a) 
tail(data) 

#   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] 
# [40315,] "H" "G" "F" "E" "D" "A" "B" "C" 
# [40316,] "H" "G" "F" "E" "D" "A" "C" "B" 
# [40317,] "H" "G" "F" "E" "D" "B" "A" "C" 
# [40318,] "H" "G" "F" "E" "D" "B" "C" "A" 
# [40319,] "H" "G" "F" "E" "D" "C" "A" "B" 
# [40320,] "H" "G" "F" "E" "D" "C" "B" "A" 

这是否解决了问题? (它没有字母的任何行重复两次创建8!组合)

0
df$unique_names <- " " 

for(i in 1:nrow(df)){ 
    df$unique_names[i]<- paste0(unique(unlist(strsplit(df$names[i]," "))),collapse=" ") 

} 

df$unique_names 
[1] "Trevor Brown Chris Coghlan Starlin Castro Kelby Tomlinson Brandon Crawford" 
[2] "Buster Posey Chris Coghlan Starlin Castro Kelby Tomlinson Brandon Crawford" 

数据

df <- data.frame(names=c("Trevor Brown Chris Coghlan Starlin Castro Kelby Tomlinson Brandon Crawford Brandon Crawford Kelby Tomlinson Brandon Crawford" 
,"Buster Posey Chris Coghlan Starlin Castro Kelby Tomlinson Brandon Crawford Brandon Crawford Kelby Tomlinson Brandon Crawford" 
),stringsAsFactors = F)