2016-08-17 90 views
0

我正在处理每个以前的作业都是excel文件中的行的作业应用程序数据。我想转换数据集,以便每个过去的雇主1,2,3,4等都有列...根据唯一值的数量在R中重塑数据帧

我认为这个问题最好用一个例子来解释。我如何从开始数据帧到所需的数据帧?

我尝试了一些熔炼和铸造,但我陷入困境,因为我不想为每个独特的公司名称创建一列,而是基于唯一公司名称的数量。

id <- c(1000,1000,1002,1007,1007,1007,1007,1009) 
employers <-c("Ikea","Subway","DISH","DISH","Ikea","Starbucks","Google","Google") 
start_date <- c("2/1/2013","5/1/2000","4/1/2012","3/1/2014","8/15/2011","4/15/2008","2/1/2004","3/15/2010") 
start <- data.frame(cbind(id,employers,start_date)) 
colnames(start) <- c("id","employers","start_date") 

start 

unique_id <- c(1000,1002,1007,1009) 
emp1 <- c("Ikea","DISH","DISH","Google") 
emp2 <- c("Subway",NA,"Ikea",NA) 
emp3 <- c(NA,NA,"Starbucks",NA) 
emp4 <- c(NA, NA,"Google",NA) 
emp1_start <- c("2/1/2013","4/1/2012","3/1/2014","3/15/2010") 
emp2_start <- c("5/1/2000",NA,"8/15/2011",NA) 
emp3_start <- c(NA,NA,"4/15/2008",NA) 
emp4_start <- c(NA,NA,"2/1/2004",NA) 
desired <- data.frame(cbind(unique_id,emp1,emp2,emp3,emp4,emp1_start,emp2_start,emp3_start,emp4_start)) 

desired 
+0

'start $ time < - with(start,ave(as.character(id),id,FUN = seq_along));从另一个答案重新设置(start,direction =“wide”,idvar =“id”,sep =“”))。 – thelatemail

+0

你忘了重新命名列:-)(只是在开玩笑......你的编程器能够轻松击败我)。 – r2evans

+0

感谢@thelatemail发现重复并使用我的示例发布答案。按照预期的方式创建timevar可以很好地处理我的实际数据,并且它更大更复杂。 – andrea

回答

0

使用您的数据(有意与factor S,很容易与stringsAsFactors = FALSE修复):

start <- data.frame(
      id=c( "1000",  "1000",  "1002",  "1007", 
        "1007",  "1007",  "1007",  "1009"), 
    employers=c( "Ikea", "Subway",  "DISH",  "DISH", 
        "Ikea", "Starbucks", "Google", "Google"), 
    start_date=c("2/1/2013", "5/1/2000", "4/1/2012", "3/1/2014", 
       "8/15/2011", "4/15/2008", "2/1/2004", "3/15/2010") 
) 

将这项工作的吗?

library(dplyr) 
library(tidyr) 

a <- start %>% 
    select(-start_date) %>% 
    group_by(id) %>% 
    mutate(emp = sprintf("emp%s", seq_len(n()))) %>% 
    ungroup() %>% 
    spread(emp, employers) 

b <- start %>% 
    select(-employers) %>% 
    group_by(id) %>% 
    mutate(emp = sprintf("emp%s_start", seq_len(n()))) %>% 
    ungroup() %>% 
    spread(emp, start_date) 

left_join(a, b, by = "id") 
# # A tibble: 4 x 9 
#  id emp1 emp2  emp3 emp4 emp1_start emp2_start emp3_start emp4_start 
# <fctr> <fctr> <fctr> <fctr> <fctr>  <fctr>  <fctr>  <fctr>  <fctr> 
# 1 1000 Ikea Subway  NA  NA 2/1/2013 5/1/2000   NA   NA 
# 2 1002 DISH  NA  NA  NA 4/1/2012   NA   NA   NA 
# 3 1007 DISH Ikea Starbucks Google 3/1/2014 8/15/2011 4/15/2008 2/1/2004 
# 4 1009 Google  NA  NA  NA 3/15/2010   NA   NA   NA 
+0

谢谢@ r2evens。我将坚持这个未来。它对我简单的例子非常有用,但对于过去的学校和相关日期,GPA等也有多行的实际数据有点麻烦,所以select()部分不是直截了当的。 – andrea