2016-06-17 26 views
2

我有以下数据框:如何使用新组的总和创建新观察值?

gender age population 
H  0-4 5 
H  5-9 5 
H  10-14 10 
H  15-19 15 
H  20-24 15 
H  25-29 10 
M  0-4 0 
M  5-9 5 
M  10-14 5 
M  15-19 15 
M  20-24 10 
M  25-29 15 

,我需要重新组的年龄段在下面的数据帧:

gender age population 
H  0-14 20 
H  15-19 15 
H  20-29 25 
M  0-14 10 
M  15-19 15 
M  20-29 25 

我有dplyr的偏好,因此,如果有一种方式来完成这个使用这个包,我很欣赏。

回答

7

使用字符串分割 - tidyr::separate()cut()

library(dplyr) 
library(tidyr) 

df1 %>% 
    separate(age, into = c("age1", "age2"), sep = "-", convert = TRUE) %>% 
    mutate(age = cut(age1, 
        breaks = c(0, 14, 19, 29), 
        labels = c("0-14", "15-19", "20-29"), 
        include.lowest = TRUE)) %>% 
    group_by(gender, age) %>% 
    summarise(population = sum(population)) 

# output 
# gender age population 
# (fctr) (fctr)  (int) 
# 1  H 0-14   20 
# 2  H 15-19   15 
# 3  H 20-29   25 
# 4  M 0-14   10 
# 5  M 15-19   15 
# 6  M 20-29   25 
0

data.table解决方案,其中dat是表:

library(data.table) 
dat <- as.data.table(dat) 
dat[ , mn := as.numeric(sapply(strsplit(age, "-"), "[[", 1))] 
dat[ , age := cut(mn, breaks = c(0, 14, 19, 29), 
       include.lowest = TRUE, 
       labels = c("0-14", "15-19", "20-29"))] 
dat[ , list(population = sum(population)), by = list(gender, age)] 
# gender age population 
# 1:  H 0-14   20 
# 2:  H 15-19   15 
# 3:  H 20-29   25 
# 4:  M 0-14   10 
# 5:  M 15-19   15 
# 6:  M 20-29   25