R /使用矢量化来检查列中是否存在df

我已经定义了以下函数来检查数据框是否包含多个列，如果没有，则包含它们。R /使用矢量化来检查列中是否存在df

test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5)) 

test <- CheckFullCohorts(test)

问题：我怎样才能使函数（df <- foo(...）更灵活的硬编码部分通过使用列名的载体来检查

CheckFullCohorts <- function(df) { 
    # Checks if year/cohort df contains all necessary columns 
    # Args: 
    # df: year/cohort df 

    # Return: 
    # df: df, corrected if necessary 

    foo <- function(mydf, mystring) { 
    if(!(mystring %in% names(mydf))) { 
     mydf[mystring] <- 0 
    } 
    mydf 
    } 

    df <- foo(df, "age.16.20") 
    df <- foo(df, "age.21.24") 
    df <- foo(df, "age.25.49") 
    df <- foo(df, "age.50.57") 
    df <- foo(df, "age.58.65") 
    df <- foo(df, "age.66.70") 

    df 
}

如下我会用这个功能？

我已经试过：

CheckFullCohorts <- function(df, col.list) { 
    # Checks if year/cohort df contains all necessary columns 
    # Args: 
    # df: year/cohort df 
    # col.list: named list of columns 

    # Return: 
    # df: df, corrected if necessary 

    foo <- function(mydf, mystring) { 
    if(!(mystring %in% names(mydf))) { 
     mydf[mystring] <- 0 
    } 
    mydf 
    } 

    df <- sapply(df, foo, mystring = col.list) 

    df 
}

...但我得到一个错误的结果：

test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5)) 
test <- CheckFullCohorts(test, c("age.16.20", "age.20.25")) 

Warning messages: 
1: In if (!(mystring %in% names(mydf))) { : 
    the condition has length > 1 and only the first element will be used 
2: In `[<-.factor`(`*tmp*`, mystring, value = 0) : 
    invalid factor level, NA generated 
3: In if (!(mystring %in% names(mydf))) { : 
    the condition has length > 1 and only the first element will be used 
4: In `[<-.factor`(`*tmp*`, mystring, value = 0) : 
    invalid factor level, NA generated 
> test 
      age.16.20 lorem 
      "x"  "y" 
      "x"  "y" 
      "x"  "y" 
      "x"  "y" 
      "x"  "y" 
age.16.20 NA  NA 
age.20.25 NA  NA

来源

2016-02-16 Timm S.

如何将字符串向量'S'传递给'CheckFullCohort'，然后用'for（s in s）{df < - foo（df，s）}'替换相关行。 –

当然，这会工作。这是否意味着循环比矢量化解决方案更有效的情况之一？如果是这样，我仍然很想知道我在'sapply'上做错了什么。 –

循环是否有效取决于数据框是否在每次交互时被复制，并且我不知道这是否是这种情况。但是关于循环效率不高的讨论经常被夸大了：这一步是你的代码中的瓶颈吗？如果不是，那不是你应该花费能源优化的地方。至于'sapply'，好问题 - 我倾向于使用'plyr'来做这些事情，界面对我来说更有意义。 PS，@罗兰德的答案下面的作品也不需要功能！ –

您可以轻松地矢量化这样的：

test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5)) 
musthaves <- c("age.16.20", "age.21.24", "age.25.49", 
       "age.50.57", "age.58.65", "age.66.70") 

test[musthaves[!(musthaves %in% names(test))]] <- 0 
# age.16.20 lorem age.21.24 age.25.49 age.50.57 age.58.65 age.66.70 
#1   x  y   0   0   0   0   0 
#2   x  y   0   0   0   0   0 
#3   x  y   0   0   0   0   0 
#4   x  y   0   0   0   0   0 
#5   x  y   0   0   0   0   0

然而，通常NA值将比0更合适。

来源

2016-02-16 15:33:53 Roland

“NA”与“0”是一个好点。 –

哇，这真的很优雅。一般来说，我同意NA评论 - 在这个特定的情况下，我正在寻找0。 –

R /使用矢量化来检查列中是否存在df

回答

相关问题