2017-10-16 184 views
1

我有一个长的帧文件如下所示:R:最佳匹配比较

df <- structure(list(Date =c("2011-01", "2011-08", "2012-03", "2011-01", "2011-08", "2011-01", "2011-08", "2011-01", "2011-08", 
        "2011-01", "2011-08", "2012-03", "2011-01", "2011-08", "2011-01", "2011-08", "2011-01", "2011-08", 
        "2011-01", "2011-08", "2012-03", "2011-01", "2011-08", "2011-01", "2011-08", "2011-01", "2011-08"), 
    Part=c("A", "A", "A", "A", "A", "A", "A", "A", "A", 
      "B", "B", "B", "B", "B", "B", "B", "B", "B", 
      "C", "C", "C", "C", "C", "C", "C", "C", "C"), 
    method=c("Type1","Type1","Type1","Type2","Type2","Type3","Type3","Type4","Type4", 
       "Type1","Type1","Type1","Type2","Type2","Type3","Type3","Type4","Type4", 
       "Type1","Type1","Type1","Type2","Type2","Type3","Type3","Type4","Type4"), 
    value= c(4L, 46L, 43L, 9L, 8L, 46L, 63L, 84L, 2L, 5L, 78L, 2L, 89L, 2L, 6L, 62L, 25L, 46L, 3L, 4L, 7L, 24L, 13L, 21L, 19L, 8L, 3L)), 
    class= "data.frame", row.names=c(NA, -27L)) 

我想创建称为BestMethod另一列。该变量应该是与部件和日期最接近类型3的值相对应的方法列表。

例如,在2011-01部分A中,应用了类型1,2,3,类型1与类型3最接近。在BestMethod下,我将拥有Type1。否则,如果所有3种类型都没有应用,我会把NA。

(在Excel中,可能是这样的:

=INDEX(C2:F2, MATCH(MIN(ABS(C2:F2-B2)), ABS(C2:F2-B2),0)) 

那么这样的:

=IF(B2="", "NA", INDEX($C$1:$F$1,1,(MATCH(H2,C2:F2,0))))) 

然后我想创建一个名为FinalMethod另一列,我想有大部分上市类型。对于每个部分要复制的所有日期

例如:在2011-01,2011-02的A部分,类型1是更好的匹配,但在2011-03类型2是更好的ma TCH。在这种情况下,我希望类型1为本部分所有日期的FinalMethod

我试过如下:

which(abs(x-your.number)==min(abs(x-your.number))) 

,但我万万麻烦调用正确的数据值,并通过各行运行它。

谢谢。

所需的输出:采用dplyr + tidyr

df <- structure(list(Date =c("2011-01", "2011-08", "2012-03", "2011-01", "2011-08", "2011-01", "2011-08", "2011-01", "2011-08", 
        "2011-01", "2011-08", "2012-03", "2011-01", "2011-08", "2011-01", "2011-08", "2011-01", "2011-08", 
        "2011-01", "2011-08", "2012-03", "2011-01", "2011-08", "2011-01", "2011-08", "2011-01", "2011-08"), 
    Part=c("A", "A", "A", "A", "A", "A", "A", "A", "A", 
      "B", "B", "B", "B", "B", "B", "B", "B", "B", 
      "C", "C", "C", "C", "C", "C", "C", "C", "C"), 
    method=c("Type1","Type1","Type1","Type2","Type2","Type3","Type3","Type4","Type4", 
       "Type1","Type1","Type1","Type2","Type2","Type3","Type3","Type4","Type4", 
       "Type1","Type1","Type1","Type2","Type2","Type3","Type3","Type4","Type4"), 
    value= c(4L, 46L, 43L, 9L, 8L, 46L, 63L, 84L, 2L, 5L, 78L, 2L, 89L, 2L, 6L, 62L, 25L, 46L, 3L, 4L, 7L, 24L, 13L, 21L, 19L, 8L, 3L), 
    BestModel=c("Type2", "Type1", "NA", "Type2", "Type1", "Type2", "Type1", "Type2", "Type1", 
       "Type1", "Type1Type4", "NA", "Type1", "Type1Type4", "Type1", "Type1Type4","Type1", "Type1Type4", 
       "Type2", "Type2", "NA", "Type2", "Type2", "Type2", "Type2", "Type2", "Type2"), 
    FinalModel= c("Type1Type2", "Type1Type2","Type1Type2", "Type1Type2","Type1Type2", "Type1Type2","Type1Type2","Type1Type2","Type1Type2", 
        "Type1", "Type1", "Type1", "Type1", "Type1", "Type1","Type1", "Type1", "Type1", 
        "Type2", "Type2","Type2", "Type2", "Type2", "Type2","Type2", "Type2", "Type2")), 
    class= "data.frame", row.names=c(NA, -27L)) 
+0

的问题是不清楚我。 “最接近3型”是什么意思?类型3仅适用于日期:2013-08和2013-09的A,B和C,而其他两种类型不适用。在该示例中,只有类型1在2011-01日期存在。你能否让这个例子更清楚一些? – missuse

+0

嗨!谢谢你注意到这一点。我改变了日期,所以有重叠。对于没有Type3的人,我想默认为NA。示例:对于部件A,如果类型1与2011-01-01类型3最接近,则在BestMethod列下打印类型1。如果没有类型3,则在BestMethod列下打印NA。第二部分:对于部分A,如果所有日期的总类型1的数量大于总类型2,则在FinalMethod下打印类型1。谢谢。 – flightless13wings

+0

你能添加所需的输出吗? – Marcelo

回答

1

一个不是很优雅的解决方案,但工程:

library(dplyr) 
library(tidyr) 

temp = df %>% 
    group_by(Part, Date) %>% 
    mutate(value.x = ifelse(method == "Type3", value, NA)) %>% 
    fill(value.x, .direction = "up") %>% 
    fill(value.x) %>% 
    mutate(difference = abs(value.x - value)) %>% 
    filter(method != "Type3") %>% 
    filter(difference == min(difference)) 

BestMethod = temp %>% 
    summarize(BestMethod = paste(method, collapse = " ")) 

FinalMethod = temp %>% 
    group_by(Part, method) %>% 
    summarize(count = n()) %>% 
    filter(count == max(count)) %>% 
    rename(FinalMethod = method) 

df %>% 
    full_join(BestMethod) %>% 
    full_join(FinalMethod) %>% 
    select(-count) %>% 
    arrange(Part, Date) 

结果:

 Date Part method value BestMethod FinalMethod 
1 2011-01 A Type1  4  Type2  Type1 
2 2011-01 A Type1  4  Type2  Type2 
3 2011-01 A Type2  9  Type2  Type1 
4 2011-01 A Type2  9  Type2  Type2 
5 2011-01 A Type3 46  Type2  Type1 
6 2011-01 A Type3 46  Type2  Type2 
7 2011-01 A Type4 84  Type2  Type1 
8 2011-01 A Type4 84  Type2  Type2 
9 2011-08 A Type1 46  Type1  Type1 
10 2011-08 A Type1 46  Type1  Type2 
11 2011-08 A Type2  8  Type1  Type1 
12 2011-08 A Type2  8  Type1  Type2 
13 2011-08 A Type3 63  Type1  Type1 
14 2011-08 A Type3 63  Type1  Type2 
15 2011-08 A Type4  2  Type1  Type1 
16 2011-08 A Type4  2  Type1  Type2 
17 2012-03 A Type1 43  <NA>  Type1 
18 2012-03 A Type1 43  <NA>  Type2 
19 2011-01 B Type1  5  Type1  Type1 
20 2011-01 B Type2 89  Type1  Type1 
21 2011-01 B Type3  6  Type1  Type1 
22 2011-01 B Type4 25  Type1  Type1 
23 2011-08 B Type1 78 Type1 Type4  Type1 
24 2011-08 B Type2  2 Type1 Type4  Type1 
25 2011-08 B Type3 62 Type1 Type4  Type1 
26 2011-08 B Type4 46 Type1 Type4  Type1 
27 2012-03 B Type1  2  <NA>  Type1 
28 2011-01 C Type1  3  Type2  Type2 
29 2011-01 C Type2 24  Type2  Type2 
30 2011-01 C Type3 21  Type2  Type2 
31 2011-01 C Type4  8  Type2  Type2 
32 2011-08 C Type1  4  Type2  Type2 
33 2011-08 C Type2 13  Type2  Type2 
34 2011-08 C Type3 19  Type2  Type2 
35 2011-08 C Type4  3  Type2  Type2 
36 2012-03 C Type1  7  <NA>  Type2 
+0

谢谢你的解决方案。有没有办法做到这一点,但对于超过3种类型的方法?我的实际数据有四种类型的方法,我不确定如何在不使用ifelse语句的情况下这样做。 – flightless13wings

+0

@ flightless13wings在这种情况下,你会比较''Type3“'还是'”Type4“'?除了引用'Type',代码不应该改变你的情况。如果我的代码不适合你的工作,你应该发布实际数据。 – useR

+0

我编辑了包含第四种方法的数据。逻辑与您在解决方案中使用的逻辑相同,现在我将Type1,2和4与Type3进行比较。然后选择最后一种方法。我没有发布实际的文件,因为它太大而且包含一些机密信息。希望这个样本数据能够更好地模仿我。再次感谢你的帮助。 – flightless13wings