填充NA值与尾部行值乘以增长率？

以前值时间(1+growth)填充NA值的好方法是什么？填充NA值与尾部行值乘以增长率？

df <- data.frame(year=0:6, 
       price1=c(1.1, 2.1, 3.2, 4.8, NA, NA, NA), 
       price2=c(1.1, 2.1, 3.2, NA, NA, NA, NA)) 
growth <- .02

在这种情况下，我希望在遗漏值在price1来填充4.8*1.02，4.8*1.02^2和4.8*1.02^3。同样，我希望price2中的缺失值填充为3.2*1.02,3.2*1.02^2,3.2*1.02^3和3.2*1.02^4。

我试过，但我认为它需要进行设置以某种方式重复（apply）：

library(dplyr) 
df %>% mutate(price1=ifelse(is.na(price1), 
      lag(price1)*(1+growth), price1))

我没有使用 dplyr为别的（还），所以从基础的东西

R或plyr或类似的将不胜感激。

来源

2015-06-25 Adam Smith

它看起来像dplyr不能处理访问新分配的滞后值。即使NA位于列中间，这个解决方案也应该可以工作。

df <- apply(
    df, 2, function(x){ 
    if(sum(is.na(x)) == 0){return(x)} 
    ## updated with optimized portion from @josilber 
    r <- rle(is.na(x)) 
    na.loc <- which(r$values) 
    b <- rep(cumsum(r$lengths)[na.loc-1], r$lengths[na.loc]) 
    lastValIs <- 1:length(x) 
    lastValI[is.na(x)] <- b 
    x[is.na(x)] <- 
     sapply(which(is.na(x)), function(i){ 
     return(x[lastValIs[i]]*(1 + growth)^(i - lastValIs[i])) 
     }) 
    return(x) 
    })

来源

2015-06-25 15:16:32 cr1msonB1ade

谢谢！中间的“NA”处理是一个不错的加成。 –

可以尝试这样的功能

test <- function(x,n) { 
     if (!is.na(df[x,n])) return (df[x,n]) 
     else   return (test(x-1,n)*(1+growth)) 
    } 


a=1:nrow(df) 


lapply(a, FUN=function(i) test(i,2)) 

unlist(lapply(a, FUN=function(i) test(i,2)))

[1] 1.100000 2.100000 3.200000 4.800000 4.896000 4.993920 5.093798

来源

2015-06-25 15:10:08 Batanichek

假设只有后的NA：

NAgrow <- function(x,growth=0.02) { 
    isna <- is.na(x) 
    lastval <- tail(x[!isna],1) 
    x[isna] <- lastval*(1+growth)^seq(sum(isna)) 
    return(x) 
}

如果存在内部NA值以及这会变得有点棘手。

适用于所有列，除了第一：

df[-1] <- lapply(df[-1],NAgrow) 

## year price1 price2 
## 1 0 1.100000 1.100000 
## 2 1 2.100000 2.100000 
## 3 2 3.200000 3.200000 
## 4 3 4.800000 3.264000 
## 5 4 4.896000 3.329280 
## 6 5 4.993920 3.395866 
## 7 6 5.093798 3.463783

来源

2015-06-25 15:12:58

而对于'dplyr'倾斜：'DF％>％mutate_each（玩意儿（NAgrow）， - 年）' – Frank

@奔bolker - 再次感谢你的帮助。这对我有用，但你也是正确的，这会导致中间“NA”的问题。 –

以下解决方案基于rle作品与NA在任何位置，并且不依赖于循环填补缺失值：

NAgrow.rle <- function(x) { 
    if (is.na(x[1])) stop("Can't have NA at beginning") 
    r <- rle(is.na(x)) 
    na.loc <- which(r$values) 
    b <- rep(cumsum(r$lengths)[na.loc-1], r$lengths[na.loc]) 
    x[is.na(x)] <- ave(x[b], b, FUN=function(y) y[1]*(1+growth)^seq_along(y)) 
    x 
} 
df[,-1] <- lapply(df[,-1], NAgrow.rle) 
# year price1 price2 
# 1 0 1.100000 1.100000 
# 2 1 2.100000 2.100000 
# 3 2 3.200000 3.200000 
# 4 3 4.800000 3.264000 
# 5 4 4.896000 3.329280 
# 6 5 4.993920 3.395866 
# 7 6 5.093798 3.463783

我会在另外两个解决方案使用循环下降，一个在基R和一个在RCPP：

NAgrow.for <- function(x) { 
    for (i in which(is.na(x))) { 
    x[i] <- x[i-1] * (1+growth) 
    } 
    x 
} 

library(Rcpp) 
cppFunction(
"NumericVector NAgrowRcpp(NumericVector x, double growth) { 
    const int n = x.size(); 
    NumericVector y(x); 
    for (int i=1; i < n; ++i) { 
    if (R_IsNA(x[i])) { 
     y[i] = (1.0 + growth) * y[i-1]; 
    } 
    } 
    return y; 
}")

的解决方案基于rle（crimson和josilber.rle）取约两倍只要基于for循环的简单解决方案（josilber.for）和预期的Rcpp解决方案最快，运行时间约为0.002秒。

set.seed(144) 
big.df <- data.frame(ID=1:100000, 
        price1=sample(c(1:10, NA), 100000, replace=TRUE), 
        price2=sample(c(1:10, NA), 100000, replace=TRUE)) 
crimson <- function(df) apply(df[,-1], 2, function(x){ 
    if(sum(is.na(x)) == 0){return(x)} 
    ## updated with optimized portion from @josilber 
    r <- rle(is.na(x)) 
    na.loc <- which(r$values) 
    b <- rep(cumsum(r$lengths)[na.loc-1], r$lengths[na.loc]) 
    lastValIs <- 1:length(x) 
    lastValIs[is.na(x)] <- b 
    x[is.na(x)] <- 
    sapply(which(is.na(x)), function(i){ 
     return(x[lastValIs[i]]*(1 + growth)^(i - lastValIs[i])) 
    }) 
    return(x) 
}) 
ggrothendieck <- function(df) { 
    growthfun <- function(x, y) if (is.na(y)) (1+growth)*x else y 
    lapply(df[,-1], Reduce, f = growthfun, acc = TRUE) 
} 
josilber.rle <- function(df) lapply(df[,-1], NAgrow.rle) 
josilber.for <- function(df) lapply(df[,-1], NAgrow.for) 
josilber.rcpp <- function(df) lapply(df[,-1], NAgrowRcpp, growth=growth) 
library(microbenchmark) 
microbenchmark(crimson(big.df), ggrothendieck(big.df), josilber.rle(big.df), josilber.for(big.df), josilber.rcpp(big.df)) 
# Unit: milliseconds 
#     expr  min   lq  mean  median   uq   max neval 
#  crimson(big.df) 98.447546 131.063713 161.494366 152.477661 183.175840 379.643222 100 
# ggrothendieck(big.df) 437.015693 667.760401 822.530745 817.864707 925.974019 1607.352929 100 
# josilber.rle(big.df) 59.678527 115.220519 132.874030 127.476340 151.665657 262.003756 100 
# josilber.for(big.df) 21.076516 57.479169 73.860913 72.959536 84.846912 178.412591 100 
# josilber.rcpp(big.df) 1.248793 1.894723 2.373469 2.190545 2.697246 5.646878 100

来源

2015-06-25 15:55:08 josliber

这太棒了！我不知道'rle'函数，这是它的一个很好的应用。所以看起来好像我的代码中的低效率主要来自'max（which（！is.na（x）））'对吗？我认为这不是必然的“循环”，因为'ave'函数本质上是在与我的'sapply'相同的向量（又名循环）上运行的。这听起来是对的吗？ – cr1msonB1ade

为了测试我之前的评论，我使用了你的'b'值，并且改变了我的函数以包含以下两行：'lastValIs < - 1：length（x）'和'lastValI [is.na（x）] < - b' 。然后，而不是计算我刚刚索引到'lastValIs'中的'max（which（））'值。使用'rbenchmark'软件包，实际上我的版本没有'ave'调用的速度快了大约30％。让我知道你是否有不同的东西。 – cr1msonB1ade

非常彻底。我将不得不给Rcpp另一个运行。 –

紧凑基础R溶液可以使用Reduce获得：

growthfun <- function(x, y) if (is.na(y)) (1+growth)*x else y 
replace(df, TRUE, lapply(df, Reduce, f = growthfun, acc = TRUE))

，并提供：

year price1 price2 
1 0 1.100000 1.100000 
2 1 2.100000 2.100000 
3 2 3.200000 3.200000 
4 3 4.800000 3.264000 
5 4 4.896000 3.329280 
6 5 4.993920 3.395866 
7 6 5.093798 3.463783

注：在问题中的数据没有非尾随NA值但是如果有的话，我们可以使用来自动物园的na.fill首先用一个特殊的值来代替尾随的NA，例如Na N，以及寻找它，而不是NA的：

library(zoo) 

DF <- as.data.frame(na.fill(df, c(NA, NA, NaN))) 
growthfun <- function(x, y) if (is.nan(y)) (1+growth)*x else y 
replace(DF, TRUE, lapply(DF, Reduce, f = growthfun, acc = TRUE))

来源

2015-06-25 16:15:14

填充NA值与尾部行值乘以增长率？

回答

相关问题