2017-07-06 62 views
0

我试着计算每个组内每个观察值的连续变量(我们称之为'值')的分位数(0至100)在一个新的变量中观察其相应的分位数。R:按赋值分组估计加权分位数

换句话说,每一行是一个观察,每个观察属于一个组。所有的小组都有两个以上的观察结果。在每个组中,我需要使用我的数据中的抽样权重来估计值的分布,确定观察值位于其分布的百分位数,然后将该百分位数作为列添加到数据框中。

据我所知,该survey封装具有svyby()svyquantile()但是为指定的位数,而不是对于给定的观测值的位数后者返回值。

# Load survey package 
library(survey) 

# Set seed for replication 
set.seed(123) 

# Create data with value, group, weight 
dat <- data.frame(value = 1:6, 
        group = rep(1:3,2), 
        weight = abs(rnorm(6)) 
# Declare survey design 
d <- survey::svydesign(id =~1, data = dat, weights = weight) 

# Do something to calculate the quantile and add it to the data 
???? 

这类似于这个问题,但没有被分组完成:Compute quantiles incorporating Sample Design (Survey package)

+0

https://stackoverflow.com/questions/32167390/compute-quantiles-incorporating-sample-design-survey-package/32173435#32173435或https://stackoverflow.com/questions/24587499/compute-多少百分之一富裕集中使用调查数据/ 24590340#24590340 –

+0

对不起,'quantile_by_stype'是由子组,不是吗?我很困惑为什么使用svyby或子集来获得你想要的子群是不够的?谢谢 –

+0

@AnthonyDamico这些似乎可以通过子群来计算分位数,但(a)一旦完成就不会将值添加到前一组中。我最终使用了一个非常黑客的方法,我添加了一个答案。如果有办法加快这个过程,很高兴能够修改! – user3614648

回答

0

我放在一起的解决方案。可以修改mutate()中的以下语句顺序,将采样权重转换为感兴趣的分位数。虽然这可以在基数R中完成,但由于dplyr::bind_rows()的功率在连接两个数据帧时添加到NA中,所以我使用dplyr数据包。

# Set seed for replication 
set.seed(123) 

# Create data with value, group, weight 
dat <- data.frame(value = 1:6, 
        group = rep(1:3,2), 
        weight = abs(rnorm(6)) 

# Initialize list for storing group results 
# Setting the length of the list is quicker than 
# creating an empty list and growing it 
quantile_list <- vector("list", length(unique(dat$group))) 

# Initialize variable to indicate initial iteration 
iteration <- 0 

# estimate the decile of each respondent 
# in a large for-loop 

for(group in unique(dat$group)) { 

# Keep only observations for a given group 
    temp <- dat %>% dplyr::filter(group == group) 

    # Create subset with missing values 
    temp_missing <- temp %>% dplyr::filter(is.na(value)) 

    # Create subset without missing values 
    temp_nonmissing <- temp %>% dplyr::filter(!is.na(value)) 

    # Sort observations with value on value, calculate cumulative 
    # sum of sampling weights, create variable indicating the decile 
    # of responses. 1 = lowest, 10 = highest 
    temp_nonmissing <- temp_nonmissing %>% 
          dplyr::arrange(value) %>% 
          dplyr::mutate(cumulative_weight = cumsum(weight), 
              cumulative_weight_prop = cumulative_weight/sum(weight), 
              decile = dplyr::case_when(cumulative_weight_prop < 0.10 ~ 1, 
              cumulative_weight_prop >= 0.10 & cumulative_weight_prop < 0.20 ~ 2, 
              cumulative_weight_prop >= 0.20 & cumulative_weight_prop < 0.30 ~ 3, 
              cumulative_weight_prop >= 0.30 & cumulative_weight_prop < 0.40 ~ 4, 
              cumulative_weight_prop >= 0.40 & cumulative_weight_prop < 0.50 ~ 5, 
              cumulative_weight_prop >= 0.50 & cumulative_weight_prop < 0.60 ~ 6, 
              cumulative_weight_prop >= 0.60 & cumulative_weight_prop < 0.70 ~ 7, 
              cumulative_weight_prop >= 0.70 & cumulative_weight_prop < 0.80 ~ 8, 
              cumulative_weight_prop >= 0.80 & cumulative_weight_prop < 0.90 ~ 9 , 
              cumulative_weight_prop >= 0.90 ~ 10)) 

    # Increment the iteration of the for loop 
    iteration <- iteration + 1 

    # Join the data with missing values and the data without 
    # missing values on the value variable into 
    # a single data frame 
    quantile_list[[iteration]] <- dplyr::bind_rows(temp_nonmissing, temp_missing) 
    } 

# Convert the list of data frames into a single dataframe 
out <- dplyr::bind_rows(quantile_list) 

# Show outcome 
head(out)