2017-10-08 63 views
-2

我有一个如下所示的数据框。在一行中组合具有相似值的单元格

New_ment1_1 New_ment1_2  New_ment1_3   New_ment1_4 
1 application  android   ios      NA 
2 donald trump agreement  climate    united states 
3 donald trump agreement  paris    united states 
4 donald trump agreement united states    NA 
5 donald trump  climate  emission    united states 
6 donald trump entertainer  host     president 
7 hen    chicken  mustard     wimp 
8 husband   pamela  private lives    NA 
9 pan    chicken   hen      wimp 
10 sex   associate  pamela     partner 
11 united kingdom chicken   hen      wimp 
12 united states agreement  paris      NA 

我希望得到的如像下面

例如, ROW1应该是这样,因为它不具有任何类似的行具有行的数据帧。

,如果你看行2,3,4,5和12它们应该在同一行组合一样

united states donald trump paris climate agreement emission 

而行7,9和11,应合并为

united kingdom chicken hen wimp mustard 

它可以以任何顺序。

+0

不太清楚你的意思是“相似”。另外,你到目前为止尝试过什么? – useR

+0

类似于,在一行4个单词中,如果两行中有两个相同的单词,我想合并它们。 –

回答

0

假设数据帧DF在最后的注释中可重复显示。

将其转换为字符矩阵m。让我们说,如果两行共有多个元素,并且定义is_similar以获取两行索引并相应地返回TRUE或FALSE,则两行相似。然后将其应用于使用outer的每一对行。解释为一个图的邻接矩阵,并计算连接成分将DF划分成列表L,其每个元素是构成该连接成分的来自DF的行的数据帧。最后将L返回成字符矩阵。

library(igraph) 

m <- as.matrix(DF) 
n <- nrow(m) 
is_similar <- function(i, j) length(intersect(na.omit(m[i, ]), na.omit(m[j, ]))) > 1 
smat <- outer(1:n, 1:n, Vectorize(is_similar)) 

adj <- graph.adjacency(smat) 
cl <- components(adj)$membership 

str(split(1:n, cl)) 
## List of 6 
## $ 1: int 1 
## $ 2: int [1:5] 2 3 4 5 12 
## $ 3: int 6 
## $ 4: int [1:3] 7 9 11 
## $ 5: int 8 
## $ 6: int 10 

spl <- split(DF, cl) 
L <- lapply(spl, function(x) na.omit(unique(unlist(x)))) 
t(do.call("cbind", lapply(L, ts))) 

,并提供:

[,1]   [,2]   [,3]    [,4]  [,5]  [,6]  
1 "application" "android"  "ios"   NA   NA  NA   
2 "donald_trump" "united_states" "agreement"  "climate" "paris" "emission" 
3 "donald_trump" "entertainer" "host"   "president" NA  NA   
4 "hen"   "pan"   "united_kingdom" "chicken" "mustard" "wimp"  
5 "husband"  "pamela"  "private_lives" NA   NA  NA   
6 "sex"   "associate"  "pamela"   "partner" NA  NA  

注:在重现的形式输入:

Lines <- " 
New_ment1_1 New_ment1_2  New_ment1_3   New_ment1_4 
1 application  android   ios      NA 
2 donald_trump agreement  climate    united_states 
3 donald_trump agreement  paris    united_states 
4 donald_trump agreement united_states    NA 
5 donald_trump  climate  emission    united_states 
6 donald_trump entertainer  host     president 
7 hen    chicken  mustard     wimp 
8 husband   pamela  private_lives    NA 
9 pan    chicken   hen      wimp 
10 sex   associate  pamela     partner 
11 united_kingdom chicken   hen      wimp 
12 united_states agreement  paris      NA" 

DF <- read.table(text = Lines, header = TRUE, as.is = TRUE) 

更新:固定相似的定义。

相关问题