2016-02-16 130 views
0

我已经定义了以下函数来检查数据框是否包含多个列,如果没有,则包含它们。R /使用矢量化来检查列中是否存在df

test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5)) 

test <- CheckFullCohorts(test) 

问题:我怎样才能使函数(df <- foo(...)更灵活的硬编码部分通过使用列名的载体来检查

CheckFullCohorts <- function(df) { 
    # Checks if year/cohort df contains all necessary columns 
    # Args: 
    # df: year/cohort df 

    # Return: 
    # df: df, corrected if necessary 

    foo <- function(mydf, mystring) { 
    if(!(mystring %in% names(mydf))) { 
     mydf[mystring] <- 0 
    } 
    mydf 
    } 

    df <- foo(df, "age.16.20") 
    df <- foo(df, "age.21.24") 
    df <- foo(df, "age.25.49") 
    df <- foo(df, "age.50.57") 
    df <- foo(df, "age.58.65") 
    df <- foo(df, "age.66.70") 

    df 
} 

如下我会用这个功能?

我已经试过:

CheckFullCohorts <- function(df, col.list) { 
    # Checks if year/cohort df contains all necessary columns 
    # Args: 
    # df: year/cohort df 
    # col.list: named list of columns 

    # Return: 
    # df: df, corrected if necessary 

    foo <- function(mydf, mystring) { 
    if(!(mystring %in% names(mydf))) { 
     mydf[mystring] <- 0 
    } 
    mydf 
    } 

    df <- sapply(df, foo, mystring = col.list) 

    df 
} 

...但我得到一个错误的结果:

test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5)) 
test <- CheckFullCohorts(test, c("age.16.20", "age.20.25")) 

Warning messages: 
1: In if (!(mystring %in% names(mydf))) { : 
    the condition has length > 1 and only the first element will be used 
2: In `[<-.factor`(`*tmp*`, mystring, value = 0) : 
    invalid factor level, NA generated 
3: In if (!(mystring %in% names(mydf))) { : 
    the condition has length > 1 and only the first element will be used 
4: In `[<-.factor`(`*tmp*`, mystring, value = 0) : 
    invalid factor level, NA generated 
> test 
      age.16.20 lorem 
      "x"  "y" 
      "x"  "y" 
      "x"  "y" 
      "x"  "y" 
      "x"  "y" 
age.16.20 NA  NA 
age.20.25 NA  NA 
+0

如何将字符串向量'S'传递给'CheckFullCohort',然后用'for(s in s){df < - foo(df,s)}'替换相关行。 –

+0

当然,这会工作。这是否意味着循环比矢量化解决方案更有效的情况之一?如果是这样,我仍然很想知道我在'sapply'上做错了什么。 –

+1

循环是否有效取决于数据框是否在每次交互时被复制,并且我不知道这是否是这种情况。但是关于循环效率不高的讨论经常被夸大了:这一步是你的代码中的瓶颈吗?如果不是,那不是你应该花费能源优化的地方。至于'sapply',好问题 - 我倾向于使用'plyr'来做这些事情,界面对我来说更有意义。 PS,@罗兰德的答案下面的作品也不需要功能! –

回答

2

您可以轻松地矢量化这样的:

test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5)) 
musthaves <- c("age.16.20", "age.21.24", "age.25.49", 
       "age.50.57", "age.58.65", "age.66.70") 

test[musthaves[!(musthaves %in% names(test))]] <- 0 
# age.16.20 lorem age.21.24 age.25.49 age.50.57 age.58.65 age.66.70 
#1   x  y   0   0   0   0   0 
#2   x  y   0   0   0   0   0 
#3   x  y   0   0   0   0   0 
#4   x  y   0   0   0   0   0 
#5   x  y   0   0   0   0   0 

然而,通常NA值将比0更合适。

+0

“NA”与“0”是一个好点。 –

+0

哇,这真的很优雅。一般来说,我同意NA评论 - 在这个特定的情况下,我正在寻找0。 –