2010-12-02 98 views
4

以下是玩具问题,展示了我的问题。汇总或总结比率

我有一个数据框,其中包含一堆雇员;对于每位员工而言,它都有姓名,工资,性别和国家。

aggregate(salary ~ state) # Returns the average salary per state 
aggregate(salary ~ state + gender, data, FUN = mean) # Avg salary per state/gender 

我真正需要的是每个州妇女总收入的一部分。

aggregate(salary ~ state + gender, data, FUN = sum) 

返回由女性(和男人)在每个州获得的总薪酬,但我真正需要的是salary_w/salary_total在每个国家的水平。我可以写一个for循环等 - 但我想知道是否有某种方法来使用聚合来做到这一点。

+0

相关问题[这里](http://stackoverflow.com/questions/4337170/calculating-subtotals-in-r) – Chase 2010-12-03 02:57:05

回答

3

可能重塑或reshape2将帮助您的工作。

这里是一个示例脚本:

library(reshape2) # from CRAN 

# sample data 
d <- data.frame(expand.grid(state=gl(2,2),gender=gl(2,1, labels=c("Men","Wemon"))), 
    salaly=runif(8)) 

d2 <- dcast(d, state~gender, sum) 
d2$frac <- d2$Wemon/(d2$Men+d2$Wemon) 
8

另一种办法是使用plyr。 ddply()需要data.frame作为输入,并将返回一个data.frame作为输出。第二个参数是你想如何拆分数据框。第三个参数是我们想要应用到块的,我们使用summarise从现有的data.frame中创建一个新的data.frame。

library(plyr) 

#Using the sample data from kohske's answer above 

> ddply(d, .(state), summarise, ratio = sum(salary[gender == "Woman"])/sum(salary)) 
    state  ratio 
1  1 0.5789860 
2  2 0.4530224 
+0

很酷的解决方案! – kohske 2010-12-03 02:51:14

1

它通常是不可取的,命名您的数据集,“数据”,所以我会稍微改变一下问题来命名数据集“DAT1”。

 with(subset(dat1, gender="Female"), aggregate(salary, state, sum)/ 
# should return a vector 
     with(data=dat1,     aggregate(salary, state, sum) 
      # using R's element-wise division 

我想你也使用附加和有充分理由重新考虑这一决定,尽管你可能在克劳利读什么。

2

ave函数适用于这样的问题。

Data$ratio <- ave(Data$salary, Data$state, Data$gender, FUN=sum)/
       ave(Data$salary, Data$state, FUN=sum) 
2

另一种解决方案是使用xtabsprop.table

prop.table(xtabs(salary ~ state + gender,data),margin=1) 
1

既然你想在每个状态的基础上tapply结果可能是你想要的。

为了说明,让我们产生一些任意的数据一起玩:

set.seed(349) # For replication 
n <- 20000  # Sample size 
gender <- sample(c('M', 'W'), size = n, replace = TRUE) # Random selection of gender 
state <- c('AL','AK','AZ','AR','CA','CO','CT','DE','DC','FL','GA','HI', 
      'ID','IL','IN','IA','KS','KY','LA','ME','MD','MA','MI','MN', 
      'MS','MO','MT','NE','NV','NH','NJ','NM','NY','NC','ND','OH', 
      'OK','OR','PA','RI','SC','SD','TN','TX','UT','VT','VA','WA', 
      'WV','WI','WY')  # All US states 
state <- sample(state, size = n, replace = TRUE) # Random selection of the states 

state_index <- tapply(state, state)  # Just for the data generatino part ... 
gender_index <- tapply(gender, gender) 

# Generate salaries 
salary <- runif(length(unique(state)))[state_index]  # Make states different 
salary <- salary + c(.02, -.02)[gender_index]   # Make gender different 
salary <- salary + log(50) + rnorm(n)     # Add mean and error term 
salary <- exp(salary)         # The variable of interest 

你问,薪金为每个州妇女总和与总工资的每个州的总和是什么:

salary_w <- tapply(salary[gender == 'W'], state[gender == 'W'], sum) 
salary_total <- tapply(salary, state, sum) 

或者如果它是在一个数据帧:

salary_w <- with(myData, tapply(salary[gender == 'W'], state[gender == 'W'], sum)) 
salary_total <- with(myData, tapply(salary, state, sum)) 

那么答案是:

> salary_w/salary_total 
     AK  AL  AR  AZ  CA  CO  CT  DC 
0.4667424 0.4877013 0.4554831 0.4959573 0.5382478 0.5544388 0.5398104 0.4750799 
     DE  FL  GA  HI  IA  ID  IL  IN 
0.4684846 0.5365707 0.5457726 0.4788805 0.5409347 0.4596598 0.4765021 0.4873932 
     KS  KY  LA  MA  MD  ME  MI  MN 
0.5228247 0.4955802 0.5604342 0.5249406 0.4890297 0.4939574 0.4882687 0.5611435 
     MO  MS  MT  NC  ND  NE  NH  NJ 
0.5090843 0.5342312 0.5492702 0.4928284 0.5180169 0.5696885 0.4519603 0.4673822 
     NM  NV  NY  OH  OK  OR  PA  RI 
0.4391634 0.4380065 0.5366625 0.5362918 0.5613301 0.4583937 0.5022793 0.4523672 
     SC  SD  TN  TX  UT  VA  VT  WA 
0.4862358 0.4895377 0.5048047 0.4443220 0.4881062 0.4880047 0.5338397 0.5136393 
     WI  WV  WY 
0.4787588 0.5495602 0.5029816