2012-12-14 54 views
1

我在R A大的数据帧,所有看起来大约是这样的:如何简化这个R脚本?

name amount date1  date2 days_out year 
    JEAN 318.5 1971-02-16 1972-11-27 650 days 1971 
GREGORY 1518.5  <NA>  <NA> NA days 1971 
    JOHN 318.5  <NA>  <NA> NA days 1971 
    EDWARD 318.5  <NA>  <NA> NA days 1971 
    WALTER 518.5 1971-07-06 1975-03-14 1347 days 1971 
    BARRY 1518.5 1971-11-09 1972-02-09 92 days 1971 
    LARRY 518.5 1971-09-08 1972-02-09 154 days 1971 
    HARRY 318.5 1971-09-16 1972-02-09 146 days 1971 
    GARRY 1018.5 1971-10-26 1972-02-09 106 days 1971 

如果某人的days_out小于60,他们获得了90%的折扣。 60-90,70%的折扣。我需要找出每年所有金额的折扣金额。我彻底难堪的解决方法是编写一个Python脚本,各相关年度编写的R脚本读取这样的:

tmp <- members[members$year==1971, ] 
tmp90 <- tmp[tmp$days_out <= 60 & tmp$days_out > 0 & !is.na(tmp$days_out), ] 
tmp70 <- tmp[tmp$days_out <= 90 & tmp$days_out > 60 & !is.na(tmp$days_out), ] 
tmp50 <- tmp[tmp$days_out <= 120 & tmp$days_out > 90 & !is.na(tmp$days_out), ] 
tmp30 <- tmp[tmp$days_out <= 180 & tmp$days_out >120 & !is.na(tmp$days_out), ] 
tmp00 <- tmp[tmp$days_out > 180 | is.na(tmp$days_out), ] 
details.1971 <- c(1971, nrow(tmp), 
    nrow(tmp90), sum(tmp90$amount), sum(tmp90$amount) * .9, 
    nrow(tmp70), sum(tmp70$amount), sum(tmp70$amount) * .7, 
    nrow(tmp50), sum(tmp50$amount), sum(tmp50$amount) * .5, 
    nrow(tmp30), sum(tmp30$amount), sum(tmp90$amount) * .9, 
    nrow(tmp00), sum(tmp00$amount)) 
membership.for.chart <- rbind(membership.for.chart,details.1971) 

它工作得很好。 tmp帧和矢量被覆盖,这很好。但我知道我已经完全击败了R这个优雅高效的一切。我一个月前第一次发布了R,我想我已经走过了一段很长的路。但我真的很想知道我应该如何去做这件事?

回答

2

哇,你写道,生成R脚本Python脚本?考虑我的眉毛抬起......

希望这将帮助您了解:

#Import your data; add dummy column to separate 'days' suffix into its own column 
dat <- read.table(text = " name amount date1  date2 days_out dummy year 
    JEAN 318.5 1971-02-16 1972-11-27 650 days 1971 
GREGORY 1518.5  <NA>  <NA> NA days 1971 
    JOHN 318.5  <NA>  <NA> NA days 1971 
    EDWARD 318.5  <NA>  <NA> NA days 1971 
    WALTER 518.5 1971-07-06 1975-03-14 1347 days 1971 
    BARRY 1518.5 1971-11-09 1972-02-09 92 days 1971 
    LARRY 518.5 1971-09-08 1972-02-09 154 days 1971 
    HARRY 318.5 1971-09-16 1972-02-09 146 days 1971 
    GARRY 1018.5 1971-10-26 1972-02-09 106 days 1971",header = TRUE,sep = "") 

#Repeat 3 times 
df <- rbind(dat,dat,dat) 

#Create new year variable 
df$year <- rep(1971:1973,each = nrow(dat)) 

#Breaks for discount levels 
ct <- c(0,60,90,120,180,Inf) 

#Cut into a factor 
df$fac <- cut(df$days_out,ct) 

#Create discount amounts for each row 
df$discount <- c(0.9,0.7,0.5,0.9,1)[df$fac] 
df$discount[is.na(df$discount)] <- 1 

#Calc adj amount 
df$amount_adj <- with(df,amount * discount) 

#I use plyr a lot, but there are many, many 
# alternatives 
library(plyr) 
ddply(df,.(year),summarise, 
      amt = sum(amount_adj), 
      total = length(year), 
      d60 = length(which(fac == "(0,60]"))) 

我只计算了几个您汇总值在过去ddply命令。我假设你可以自己扩展它。

+1

首先,我撞了我的头挂在墙上,并在整个房间里扔果壳R,虽然。 – Amanda

2

您可以使用该功能cutfindInterval功能。确切的代码将取决于与控制台输出没有明确沟通的对象的内部。如果那个days_out是一个difftime对象。那么这样的事情可能工作:

disc_amt <- with(tmp, amount*c(.9, .7, .5, .9, 1)[ 
           findInterval(days_out, c(0, 60, 90, 120, 180, Inf]) 

你应该张贴的dput()的输出tmp对象或许dput(head(tmp, 20))如果真的很大,和测试可以继续进行。 (实际的折扣似乎并不在我本来期望的方式来排序。)