2017-12-02 47 views
2

我在这里问一个问题,这是非常难以对付how can I group based on similarity in strings发生。我发现了一个好主意,我想尝试一下。我怎么能执行一个函数一次在所有对

这是我的思想和数据(相同的数据作为问题)

df <-structure(list(label = structure(c(5L, 6L, 7L, 8L, 3L, 1L, 2L, 
    9L, 10L, 4L), .Label = c(" holand", " holandindia", " Holandnorway", 
    " USAargentinabrazil", "Afghanestan ", "Afghanestankabol", "Afghanestankabolindia", 
    "indiaAfghanestan ", "USA", "USAargentina "), class = "factor"), 
     value = structure(c(5L, 4L, 1L, 9L, 7L, 10L, 6L, 3L, 2L, 
     8L), .Label = c("1941029507", "2367321518", "2849255881", 
     "2913128511", "2927576083", "4550996370", "457707181.9", 
     "637943892.6", "796495286.2", "89291651.19"), class = "factor")), .Names = c("label", 
    "value"), class = "data.frame", row.names = c(NA, -10L)) 

1-我尝试计算每行中每每个串字母的数目 2-我试图执行adist每对

如果adist输出类似于1之间,它们属于一个组,如果没有它们是在两个不同的组

为了解决上述问题,我需要知道如何执行adjst我的数据的第一列的所有字符串。

所以我的问题是下面

1是有,做相反adjst的功能? 2-我怎样才能在所有组合执行adjst(基于最长的一个时间到最短,例如,

adist("Afghanestankabolindia","Afghanestan") 
adist("Afghanestankabolindia","Afghanestankabol") 
adist("Afghanestankabolindia","indiaAfghanestan") 
adist("Afghanestankabolindia","Holandnorway") 
adist("Afghanestankabolindia","holand") 
adist("Afghanestankabolindia","holandindia") 
. 
. 
. 

棘手的部分是,它应该参考,另一个例如之间发生一次,它应该只计算一次

Afghanestankabolindia and Afghanestan 

,而不是

Afghanestan and Afghanestankabolindia 

之间的距离是指参考始终是最长的字符串

回答

1

不能确定你的期望输出格式,但我认为这你想要做什么:

ref = as.character(df$label) 
all_combs = as.data.frame(t(combn(ref[order(nchar(ref),decreasing = T)],2))) 
all_combs$val = mapply(adist,all_combs$V1,all_combs$V2) 

首先,我们创建的所有组合(排序ref向量所以第一个元素总是较长一个(即参考资料)。然后我们使用mapply计算adist所有组合。

输出:

     V1     V2 val 
1 Afghanestankabolindia USAargentinabrazil 15 
2 Afghanestankabolindia indiaAfghanestan 15 
3 Afghanestankabolindia Afghanestankabol 5 
4 Afghanestankabolindia  Holandnorway 17 
5 Afghanestankabolindia  USAargentina 17 
6 Afghanestankabolindia  Afghanestan 10 
7 Afghanestankabolindia   holandindia 13 
8 Afghanestankabolindia    holand 16 
9 Afghanestankabolindia     USA 21 
10 USAargentinabrazil indiaAfghanestan 16 
11 USAargentinabrazil Afghanestankabol 13 
12 USAargentinabrazil  Holandnorway 14 
13 USAargentinabrazil  USAargentina 7 
14 USAargentinabrazil  Afghanestan 15 
15 USAargentinabrazil   holandindia 13 
16 USAargentinabrazil    holand 16 
17 USAargentinabrazil     USA 16 
18  indiaAfghanestan  Afghanestankabol 10 
19  indiaAfghanestan   Holandnorway 14 
...    .....    ..... .. 

希望这有助于!

+1

非常感谢你,我喜欢并接受你的答案 –