2016-08-03 139 views
0

我想通过几个不同的因素来总结数据集。以下是我的数据示例:按日期和组汇总数据框

household<-c("household1","household1","household1","household2","household2","household2","household3","household3","household3") 
date<-c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9)) 
value<-c(1:9) 
type<-c("income","water","energy","income","water","energy","income","water","energy") 
df<-data.frame(household,date,value,type) 

    household  date value type 
1 household1 1999-05-10 100 income 
2 household1 1999-05-25 200 water 
3 household1 1999-10-12 300 energy 
4 household2 1999-02-02 400 income 
5 household2 1999-08-20 500 water 
6 household2 1999-02-19 600 energy 
7 household3 1999-07-01 700 income 
8 household3 1999-10-13 800 water 
9 household3 1999-01-01 900 energy 

我想按月总结数据。理想情况下,最终的数据集将有每户12行(每月一笔)和每个支出类别(水,能源,收入)的列,该列是该月总数的总和。

我试着从添加一个带有短日期的列开始,然后我要过滤每个类型,并为每个事务类型的总和数据创建一个单独的数据框。然后,我将把这些数据帧合并在一起以得到汇总的df。我试图使用ddply对其进行总结,但是它汇总得太多了,我无法保留家庭级别的信息。

ddply(df,.(shortdate),summarize,mean_value=mean(value)) 
    shortdate mean_value 
1  14/07 15.88235 
2  14/09 5.00000 
3  14/10 5.00000 
4  14/11 21.81818 
5  14/12 20.00000 
6  15/01 10.00000 
7  15/02 12.50000 
8  15/04 5.00000 

任何帮助将不胜感激!

+0

是的,我只是懒惰,并没有输出完整的DF例 –

+0

是的,理想情况下,我会有每行12行(除非你可以推荐更好的方式)。这匹配另一个df我从另一个来源 –

回答

3

这听起来像你正在寻找的是一个透视表。我喜欢对这些类型的表使用reshape :: cast。如果给定家庭/年/月组合的给定支出类型返回多于一个值,则会将这些值相加。如果只有一个值,则返回该值。 “总和”参数不是必需的,但仅用于处理异常。我认为如果你的数据是干净的你不应该需要这个参数。

hh <- c("hh1", "hh1", "hh1", "hh2", "hh2", "hh2", "hh3", "hh3", "hh3") 
date <- c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9)) 
value <- c(1:9) 
type <- c("income", "water", "energy", "income", "water", "energy", "income", "water", "energy") 
df <- data.frame(hh, date, value, type) 

# Load lubridate library, add date and year 
library(lubridate) 
df$month <- month(df$date) 
df$year <- year(df$date) 

# Load reshape library, run cast from reshape, creates pivot table 
library(reshape) 
dfNew <- cast(df, hh+year+month~type, value = "value", sum) 

> dfNew 
    hh year month energy income water 
1 hh1 1999  4  3  0  0 
2 hh1 1999 10  0  1  0 
3 hh1 1999 11  0  0  2 
4 hh2 1999  2  0  4  0 
5 hh2 1999  3  6  0  0 
6 hh2 1999  6  0  0  5 
7 hh3 1999  1  9  0  0 
8 hh3 1999  4  0  7  0 
9 hh3 1999  8  0  0  8 
+1

如果我对你的问题的数据透视表性质是正确的,你可能想要以某种方式把它放在问题上或标记它。 – JMT2080AD

+0

是的,这实际上是一个数据透视表!感谢您指出了这一点。完美的工作,我做了标签的编辑。 –

2

试试这个:

df$ym<-zoo::as.yearmon(as.Date(df$date), "%y/%m") 
library(dplyr) 
df %>% group_by(ym,type) %>% 
    summarise(mean_value=mean(value)) 

Source: local data frame [9 x 3] 
Groups: ym [?] 

      ym type mean_value 
    <S3: yearmon> <fctr>  <dbl> 
1  jan 1999 income   1 
2  jun 1999 energy   3 
3  jul 1999 energy   6 
4  jul 1999 water   2 
5  ago 1999 income   4 
6  set 1999 energy   9 
7  set 1999 income   7 
8  nov 1999 water   5 
9  dez 1999 water   8 

编辑:宽幅:

reshape2::dcast(dfr, ym ~ type) 

     ym energy income water 
1 jan 1999  NA  1 NA 
2 jun 1999  3  NA NA 
3 jul 1999  6  NA  2 
4 ago 1999  NA  4 NA 
5 set 1999  9  7 NA 
6 nov 1999  NA  NA  5 
7 dez 1999  NA  NA  8 
0

如果我理解正确的您的要求(从问题的描述),这是你在找什么:

library(dplyr) 
library(tidyr) 

df %>% mutate(date = lubridate::month(date)) %>% 
    complete(household, date = 1:12) %>% 
    spread(type, value) %>% group_by(household, date) %>% 
    mutate(Total = sum(energy, income, water, na.rm = T)) %>% 
    select(household, Month = date, energy:water, Total) 

#Source: local data frame [36 x 6] 
#Groups: household, Month [36] 
# 
# household Month energy income water Total 
#  <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> 
#1 household1  1  NA  NA NA  0 
#2 household1  2  NA  NA NA  0 
#3 household1  3  NA  NA 200 200 
#4 household1  4  NA  NA NA  0 
#5 household1  5  NA  NA NA  0 
#6 household1  6  NA  NA NA  0 
#7 household1  7  NA  NA NA  0 
#8 household1  8  NA  NA NA  0 
#9 household1  9 300  NA NA 300 
#10 household1 10  NA  NA NA  0 
# ... with 26 more rows 

注意:我用你所提供的相同df题。我做的唯一的变化是value列。我用seq(100, 900, 100)

如果我弄错了,请告诉我,我会删除我的答案。如果这是正确的,我会添加一个解释。