我从missForest
包使用的prodNA
。
我的功能是遵循
fn.df.add.NA <- function(df, var.name, prop.of.missing) {
df.buf <- subset(df, select=c(var.name)) # Select variable
require(missForest, quietly = T)
df.buf <- prodNA(x = df.buf, prop.of.missing) # chage original value to NA in casual sequence
detach("package:missForest", unload=TRUE)
df.col.order <- colnames(x = df) # save the column order
df <- subset(df, select=-c(which(colnames(df)==var.name))) # drop the variable with no NAs
df <- cbind(df, df.buf) # add the column with NA
df <- subset(df, select=df.col.order) # restore the original order sequence
return(df)
}
它允许根据给定的比例改变到NAS观察的随机数。
因为prodNA函数将NA应用于所有data.frame列我已经使用“缓冲区”数据结构以便返回输入data.frame的相同数据结构。也许一些读者可能会建议一个更优雅的方式。
在每一个方式,你可以做这个测试
set.seed(1)
df <- data.frame(a = as.numeric(runif(n = 100, min = 1, max = 100)),
b = as.numeric(runif(n = 100, min = 201, max = 300)),
c = as.numeric(runif(n = 100, min = 301, max = 400)))
summary(df)
a b c
Min. : 2.326 Min. :202.3 Min. :303.8
1st Qu.:32.985 1st Qu.:229.2 1st Qu.:319.8
Median :49.293 Median :252.3 Median :338.4
Mean :52.267 Mean :252.2 Mean :344.1
3rd Qu.:76.952 3rd Qu.:273.3 3rd Qu.:364.0
Max. :99.199 Max. :299.3 Max. :398.2
df <- fn.df.add.NA(df = df, var.name = "a", prop.of.missing = .1)
df <- fn.df.add.NA(df = df, var.name = "b", prop.of.missing = .2)
df <- fn.df.add.NA(df = df, var.name = "c", prop.of.missing = .3)
summary(df)
a b c
Min. : 2.326 Min. :202.3 Min. :303.8
1st Qu.:30.628 1st Qu.:229.2 1st Qu.:319.2
Median :48.202 Median :252.3 Median :342.2
Mean :50.247 Mean :252.5 Mean :345.4
3rd Qu.:71.504 3rd Qu.:273.3 3rd Qu.:369.3
Max. :99.199 Max. :299.3 Max. :396.2
NA's :10 NA's :20 NA's :30
感谢克里斯编辑的程序代码。 –
另请参阅此图形代表:http://stackoverflow.com/a/28368161/3871924 – agenis