我已经定义了以下函数来检查数据框是否包含多个列,如果没有,则包含它们。R /使用矢量化来检查列中是否存在df
test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))
test <- CheckFullCohorts(test)
问题:我怎样才能使函数(df <- foo(...
)更灵活的硬编码部分通过使用列名的载体来检查
CheckFullCohorts <- function(df) {
# Checks if year/cohort df contains all necessary columns
# Args:
# df: year/cohort df
# Return:
# df: df, corrected if necessary
foo <- function(mydf, mystring) {
if(!(mystring %in% names(mydf))) {
mydf[mystring] <- 0
}
mydf
}
df <- foo(df, "age.16.20")
df <- foo(df, "age.21.24")
df <- foo(df, "age.25.49")
df <- foo(df, "age.50.57")
df <- foo(df, "age.58.65")
df <- foo(df, "age.66.70")
df
}
如下我会用这个功能?
我已经试过:
CheckFullCohorts <- function(df, col.list) {
# Checks if year/cohort df contains all necessary columns
# Args:
# df: year/cohort df
# col.list: named list of columns
# Return:
# df: df, corrected if necessary
foo <- function(mydf, mystring) {
if(!(mystring %in% names(mydf))) {
mydf[mystring] <- 0
}
mydf
}
df <- sapply(df, foo, mystring = col.list)
df
}
...但我得到一个错误的结果:
test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))
test <- CheckFullCohorts(test, c("age.16.20", "age.20.25"))
Warning messages:
1: In if (!(mystring %in% names(mydf))) { :
the condition has length > 1 and only the first element will be used
2: In `[<-.factor`(`*tmp*`, mystring, value = 0) :
invalid factor level, NA generated
3: In if (!(mystring %in% names(mydf))) { :
the condition has length > 1 and only the first element will be used
4: In `[<-.factor`(`*tmp*`, mystring, value = 0) :
invalid factor level, NA generated
> test
age.16.20 lorem
"x" "y"
"x" "y"
"x" "y"
"x" "y"
"x" "y"
age.16.20 NA NA
age.20.25 NA NA
如何将字符串向量'S'传递给'CheckFullCohort',然后用'for(s in s){df < - foo(df,s)}'替换相关行。 –
当然,这会工作。这是否意味着循环比矢量化解决方案更有效的情况之一?如果是这样,我仍然很想知道我在'sapply'上做错了什么。 –
循环是否有效取决于数据框是否在每次交互时被复制,并且我不知道这是否是这种情况。但是关于循环效率不高的讨论经常被夸大了:这一步是你的代码中的瓶颈吗?如果不是,那不是你应该花费能源优化的地方。至于'sapply',好问题 - 我倾向于使用'plyr'来做这些事情,界面对我来说更有意义。 PS,@罗兰德的答案下面的作品也不需要功能! –