2016-10-26 306 views
0

我有两个数据帧包含相关数据。这与NFL有关。一个DF有按周球员的名字和接收目标(玩家DF):R:如何从两个其他数据帧创建一个新的数据帧

  Player Tm Position 1 2 3 4 5 6 
1  A.J. Green CIN  WR 13 8 11 12 8 10 
2 Aaron Burbridge SFO  WR 0 1 0 2 0 0 
3 Aaron Ripkowski GNB  RB 0 0 0 0 0 1 
4 Adam Humphries TAM  WR 5 8 12 4 2 0 
5 Adam Thielen MIN  WR 5 5 4 3 8 0 
6 Adrian Peterson MIN  RB 2 3 0 0 0 0 

其他数据帧recieving通过团队总结目标,每星期(团队DF):

 Tm `1` `2` `3` `4` `5` `6` 
    <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 
1  ARI 37 35 50 45 26 35 
2  ATL 38 34 30 37 28 41 
3  BAL 32 45 40 51 47 48 
4  BUF 22 30 20 33 20 26 
5  CAR 31 39 36 47 28 46 
6  CHI 28 29 45 36 41 49 
7  CIN 30 54 28 31 39 31 
8  CLE 26 33 38 38 35 42 
9  DAL 43 30 24 32 24 27 
10 DEN 26 32 35 31 34 47 
# ... with 22 more rows 

我是什么试图做的是按星期创建另一个包含玩家目标百分比的数据框。所以我需要匹配球员df中的“Tm”列和周列标题(1-6)中的球队。

我已经找到了如何通过将它们合并,然后创建新行要做到这一点,但我添加更多的数据(周)我需要编写更多的代码:

a <- merge(playertgt, teamtgt, by="Tm") #merges the two 
    a$Wk1 <- a$`1.x`/a$`1.y` 
    a$Wk2 <- a$`2.x`/a$`2.y` 
    a$Wk3 <- a$`3.x`/a$`3.y` 

所以我要寻找是一个很好的方法来做到这一点,将自动更新,并不需要创建一个df与我不需要的一堆列,并将更新与新周,因为我将它们添加到我的源数据。

如果在其他地方回答这个问题,我很抱歉,但我一直在寻找一种很好的方法来做到这一点,我找不到它。在此先感谢您的帮助!很显然,我只是在完成合并后选择列使用dplyrends_with方便

library(dplyr) 
## Do a left outer join to match each player with total team targets 
a <- left_join(playertgt,teamtgt, by="Tm") 
## Compute percentage over all weeks selecting player columns ending with ".x" 
## and dividing by corresponding team columns ending with ".y" 
tgt.pct <- select(a,ends_with(".x"))/select(a,ends_with(".y")) 
## set the column names to week + number 
colnames(tgt.pct) <- paste0("week",seq_len(ncol(teamtgt)-1)) 
## construct the output data frame adding back the player and team columns 
tgt.pct <- data.frame(Player=playertgt$Player,Tm=playertgt$Tm,tgt.pct) 

回答

2

你可以用dplyr做到这一点。使用grepl做这个选择的基-R的做法是:

a <- merge(playertgt, teamtgt, by="Tm", all.x=TRUE) 
tgt.pct <- subset(a,select=grepl(".x$",colnames(a)))/subset(a,select=grepl(".y$",colnames(a))) 
colnames(tgt.pct) <- paste0("week",seq_len(ncol(teamtgt)-1)) 
tgt.pct <- data.frame(Player=playertgt$Player,Tm=playertgt$Tm,tgt.pct) 

数据:用有限的发布数据中,只有AJ格林将有他的目标百分比计算:

playertgt <- structure(list(Player = structure(1:6, .Label = c("A.J. Green", 
"Aaron Burbridge", "Aaron Ripkowski", "Adam Humphries", "Adam Thielen", 
"Adrian Peterson"), class = "factor"), Tm = structure(c(1L, 4L, 
2L, 5L, 3L, 3L), .Label = c("CIN", "GNB", "MIN", "SFO", "TAM" 
), class = "factor"), Position = structure(c(2L, 2L, 1L, 2L, 
2L, 1L), .Label = c("RB", "WR"), class = "factor"), X1 = c(13L, 
0L, 0L, 5L, 5L, 2L), X2 = c(8L, 1L, 0L, 8L, 5L, 3L), X3 = c(11L, 
0L, 0L, 12L, 4L, 0L), X4 = c(12L, 2L, 0L, 4L, 3L, 0L), X5 = c(8L, 
0L, 0L, 2L, 8L, 0L), X6 = c(10L, 0L, 1L, 0L, 0L, 0L)), .Names = c("Player", 
"Tm", "Position", "X1", "X2", "X3", "X4", "X5", "X6"), class = "data.frame", row.names = c(NA, 
-6L)) 
##   Player Tm Position X1 X2 X3 X4 X5 X6 
##1  A.J. Green CIN  WR 13 8 11 12 8 10 
##2 Aaron Burbridge SFO  WR 0 1 0 2 0 0 
##3 Aaron Ripkowski GNB  RB 0 0 0 0 0 1 
##4 Adam Humphries TAM  WR 5 8 12 4 2 0 
##5 Adam Thielen MIN  WR 5 5 4 3 8 0 
##6 Adrian Peterson MIN  RB 2 3 0 0 0 0 

teamtgt <- structure(list(Tm = structure(1:10, .Label = c("ARI", "ATL", 
"BAL", "BUF", "CAR", "CHI", "CIN", "CLE", "DAL", "DEN"), class = "factor"), 
    X1 = c(37L, 38L, 32L, 22L, 31L, 28L, 30L, 26L, 43L, 26L), 
    X2 = c(35L, 34L, 45L, 30L, 39L, 29L, 54L, 33L, 30L, 32L), 
    X3 = c(50L, 30L, 40L, 20L, 36L, 45L, 28L, 38L, 24L, 35L), 
    X4 = c(45L, 37L, 51L, 33L, 47L, 36L, 31L, 38L, 32L, 31L), 
    X5 = c(26L, 28L, 47L, 20L, 28L, 41L, 39L, 35L, 24L, 34L), 
    X6 = c(35L, 41L, 48L, 26L, 46L, 49L, 31L, 42L, 27L, 47L)), .Names = c("Tm", 
"X1", "X2", "X3", "X4", "X5", "X6"), class = "data.frame", row.names = c(NA, 
-10L)) 
## Tm X1 X2 X3 X4 X5 X6 
##1 ARI 37 35 50 45 26 35 
##2 ATL 38 34 30 37 28 41 
##3 BAL 32 45 40 51 47 48 
##4 BUF 22 30 20 33 20 26 
##5 CAR 31 39 36 47 28 46 
##6 CHI 28 29 45 36 41 49 
##7 CIN 30 54 28 31 39 31 
##8 CLE 26 33 38 38 35 42 
##9 DAL 43 30 24 32 24 27 
##10 DEN 26 32 35 31 34 47 

结果:

##   Player Tm  week1  week2  week3  week4  week5  week6 
##1  A.J. Green CIN 0.4333333 0.1481481 0.3928571 0.3870968 0.2051282 0.3225806 
##2 Aaron Burbridge SFO  NA  NA  NA  NA  NA  NA 
##3 Aaron Ripkowski GNB  NA  NA  NA  NA  NA  NA 
##4 Adam Humphries TAM  NA  NA  NA  NA  NA  NA 
##5 Adam Thielen MIN  NA  NA  NA  NA  NA  NA 
##6 Adrian Peterson MIN  NA  NA  NA  NA  NA  NA 
2

如果你下次提供一些数据,这会让你的生活变得轻松很多,那将会很不错。

我认为重点是你的数据结构。我认为你必须把你的数据转换成长格式(关键字是我想的整洁数据)。我编写了一些数据,希望我能正确理解你的问题。

library(tidyr) 
library(dplyr) 


player_df = data.frame(team = c('ARI', 'BAL', 'BAL', 'CLE', 'CLE'), 
         player =c('A', 'B', 'C', 'D', 'F'), 
         '1' = floor(runif(5, min=1, max=2)*10), 
         '2' = floor(runif(5, min=1, max=2)*10)) 
> player_df 
    team player X1 X2 
1 ARI  A 15 10 
2 BAL  B 16 15 
3 BAL  C 13 11 
4 CLE  D 14 19 
5 CLE  F 12 14 

team_df = data.frame(team = c('ARI', 'BAL', 'CLE'), 
         '1' = floor(runif(3, min=10, max=20)*20), 
         '2' = floor(runif(3, min=10, max=20)*20)) 
> team_df 
    team X1 X2 
1 ARI 281 205 
2 BAL 362 309 
3 CLE 323 238 

现在,把两者dataframes为长格式:

player_df = gather(player_df, week, player_value, -team, -player) 
team_df = gather(team_df, week, team_value, -team) 

> player_df 
    team player week player_value 
1 ARI  A X1   15 
2 BAL  B X1   16 
3 BAL  C X1   13 
4 CLE  D X1   14 
5 CLE  F X1   12 
6 ARI  A X2   10 
7 BAL  B X2   15 
8 BAL  C X2   11 
9 CLE  D X2   19 
10 CLE  F X2   14 
> team_df 
    team week team_value 
1 ARI X1  281 
2 BAL X1  362 
3 CLE X1  323 
4 ARI X2  205 
5 BAL X2  309 
6 CLE X2  238 

现在,加入(或合并)在一起。默认情况下,inner_join将加入公共列名称。

join_db = inner_join(player_df, team_df) 
> join_db 
    team player week player_value team_value 
1 ARI  A X1   15  281 
2 BAL  B X1   16  362 
3 BAL  C X1   13  362 
4 CLE  D X1   14  323 
5 CLE  F X1   12  323 
6 ARI  A X2   10  205 
7 BAL  B X2   15  309 
8 BAL  C X2   11  309 
9 CLE  D X2   19  238 
10 CLE  F X2   14  238 

我认为在这种格式下你可以做更多的事情。

HTH

斯特凡

相关问题