2017-10-08 13 views
1

我有经验在两个数据框之间的列值匹配的数据框上使用R中的inner_join函数。但是,我有一个数据框,其中包含2007-2014年每只股票的每月平均股价,另一个数据框包含2007 - 2014年各个股票的财务比率,并显示每家公司的财年结束月份。问题是,一家公司的财务比率直到他们发布3个月后的10K才会被报告。因此,我希望将每家公司的财务比率与3个月后的适当股票价格进行匹配。如何在需要提前3个月的列上合并两个数据帧,而不使用for循环

RatioDF:

Symbol Month Year 10KRatio1 10KRatio2 ... 10KRatioN 
FLWS 6 2007 100  200 ... 1000 
ACAD 12 2007 500  600 ... 2000 

StockPriceDF:

Company Year Month MeanPrice 
FLWS 2007 1  6.32 
    .  . .  . 
    .  . .  . 
    .  . .  . 
FLWS 2007 9  10.995 
    .  . .  . 
    .  . .  . 
    .  . .  . 
FLWS 2014 12 17.92 
    .  . .  . 
ACAD 2007 1  7.5 
    .  . .  . 
    .  . .  . 
    .  . .  . 
ACAD 2008 3  8.64 
    .  . .  . 
    .  . .  . 

DesiredDF:

Symbol Month Year 10KRatio1 10KRatio2 ... 10KRatioN MeanPrice 
FLWS 9 2007 100  200   1000  10.995 
ACAD 3 2008 500  600   2000  8.64 

我想用一个for循环来检查RatioDF个月为10-12然后将其与适当的Symbol/Company的明年1-3月进行匹配,但我认为计算m因为这些年有很多股票和很多月度价格,因此时间过长。

回答

2

lubridatedata.tabledplyr的可能解决方案。

1)data.table:

# load packages 
library(lubridate) 
library(data.table) 

# convert both dataframes to data.table's and add a 'date'-variable 
setDT(d1)[, date := as.IDate(sprintf('%s-%02d-01',Year,Month))][] 

# idem + substract 3 months with lubridate's '%m-%` function 
setDT(d2)[, date := as.IDate(sprintf('%s-%02d-01',Year,Month)) %m-% months(3)][] 

# join d1 with d2 and update d1 by reference 
d1[d2, on = .(Symbol = Company, date), MeanPrice := MeanPrice][] 

其给出:

Symbol Month Year 10KRatio1 10KRatio2  date MeanPrice 
1: FLWS  6 2007  100  200 2007-06-01 10.995 
2: ACAD 12 2007  500  600 2007-12-01  8.640 

一种替代加入法可以是:

d1[d2[, .(Company, date, MeanPrice)], on = .(Symbol = Company, date), nomatch = 0L][] 

2)dplyr:

# load packages 
library(lubridate) 
library(dplyr) 

# add a 'date'-variable to 'd1' 
# add a 'date'-variable to 'd2' and substract 3 months 
# from that with lubridate's '%m-%` function 
# select only 'Company', 'date' and 'MeanPrice' from 'd2' 
# join 'd1' with 'd2' 

d1 %>% 
    mutate(date = as.Date(sprintf('%s-%02d-01',Year,Month))) %>% 
    left_join(., d2 %>% 
       mutate(date = as.Date(sprintf('%s-%02d-01',Year,Month)) %m-% months(3)) %>% 
       select(Company, date, MeanPrice), 
      by = c('Symbol' = 'Company', 'date')) 

其给出相同的结果:

Symbol Month Year 10KRatio1 10KRatio2  date MeanPrice 
1 FLWS  6 2007  100  200 2007-06-01 10.995 
2 ACAD 12 2007  500  600 2007-12-01  8.640 

使用的数据:

d1 <- structure(list(Symbol = c("FLWS", "ACAD"), 
        Month = c(6L, 12L), 
        Year = c(2007L, 2007L), 
        `10KRatio1` = c(100L, 500L), 
        `10KRatio2` = c(200L, 600L)), 
       .Names = c("Symbol", "Month", "Year", "10KRatio1", "10KRatio2"), class = "data.frame", row.names = c(NA, -2L)) 

d2 <- structure(list(Company = c("FLWS", "FLWS", "FLWS", "ACAD", "ACAD"), 
        Year = c(2007L, 2007L, 2014L, 2007L, 2008L), 
        Month = c(1L, 9L, 12L, 1L, 3L), 
        MeanPrice = c(6.32, 10.995, 17.92, 7.5, 8.64)), 
       .Names = c("Company", "Year", "Month", "MeanPrice"), class = "data.frame", row.names = c(NA, -5L))