2014-02-21 21 views
1

这里是我想的名字的性别编码随着时间的推移一些样本数据:传承ddply分裂的当前值功能

names_to_encode <- structure(list(names = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("jane", "john", "madison"), class = "factor"), year = c(1890, 1990, 1890, 1990, 1890, 2012)), .Names = c("names", "year"), row.names = c(NA, -6L), class = "data.frame") 

这里是一个集社会保障数据的最小的,有限的只是那些从1890年和1990年的名字:

ssa_demo <- structure(list(name = c("jane", "jane", "john", "john", "madison", "madison"), year = c(1890L, 1990L, 1890L, 1990L, 1890L, 1990L), female = c(372, 771, 56, 81, 0, 1407), male = c(0, 8, 8502, 29066, 14, 145)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("name", "year", "female", "male")) 

我定义其子集给出的年一年或范围的社会保障数据的功能。换句话说,它通过计算男性和女性的出生比例以及这个名字来计算某个特定时间段的名字是男性还是女性。这里是一个辅助功能沿功能:

require(plyr) 
require(dplyr) 

select_ssa <- function(years) { 

    # If we get only one year (1890) convert it to a range of years (1890-1890) 
    if (length(years) == 1) years <- c(years, years) 

    # Calculate the male and female proportions for the given range of years 
    ssa_select <- ssa_demo %.% 
    filter(year >= years[1], year <= years[2]) %.% 
    group_by(name) %.% 
    summarise(female = sum(female), 
       male = sum(male)) %.% 
    mutate(proportion_male = round((male/(male + female)), digits = 4), 
      proportion_female = round((female/(male + female)), digits = 4)) %.% 
    mutate(gender = sapply(proportion_female, male_or_female)) 

    return(ssa_select) 
} 

# Helper function to determine whether a name is male or female in a given year 
male_or_female <- function(proportion_female) { 
    if (proportion_female > 0.5) { 
    return("female") 
    } else if(proportion_female == 0.5000) { 
    return("either") 
    } else { 
    return("male") 
    } 
} 

现在我想要做的就是使用plyr,具体ddply,以子集逐年编码的数据,并用返回的值合并的每个那件由select_ssa功能。这是我的代码。

ddply(names_to_encode, .(year), merge, y = select_ssa(year), by.x = "names", by.y = "name", all.x = TRUE) 

当调用select_ssa(year),这个命令的作品就好了,如果我硬编码像1890作为参数传递给函数的值。但是,当我试图通过它yearddply正在与当前值,我得到一个错误信息:

Error in filter_impl(.data, dots(...), environment()) : 
    (list) object cannot be coerced to type 'integer' 

如何传递的year当前值上ddply

回答

1

我认为你在ddply内部试图做一个加入让事情变得复杂。如果我使用dplyr我可能会做更多的事情是这样的:

names_to_encode <- structure(list(name = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("jane", "john", "madison"), class = "factor"), year = c(1890, 1990, 1890, 1990, 1890, 2012)), .Names = c("name", "year"), row.names = c(NA, -6L), class = "data.frame") 

ssa_demo <- structure(list(name = c("jane", "jane", "john", "john", "madison", "madison"), year = c(1890L, 1990L, 1890L, 1990L, 1890L, 1990L), female = c(372, 771, 56, 81, 0, 1407), male = c(0, 8, 8502, 29066, 14, 145)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("name", "year", "female", "male")) 

names_to_encode$name <- as.character(names_to_encode$name) 
names_to_encode$year <- as.integer(names_to_encode$year) 

tmp <- left_join(ssa_demo,names_to_encode) %.% 
     group_by(year,name) %.% 
     summarise(female = sum(female), 
       male = sum(male)) %.% 
     mutate(proportion_male = round((male/(male + female)), digits = 4), 
      proportion_female = round((female/(male + female)), digits = 4)) %.% 
     mutate(gender = ifelse(proportion_female == 0.5,"either", 
         ifelse(proportion_female > 0.5,"female","male"))) 

需要注意的是0.1.1还是有点挑剔的类型连接列的,所以我不得不将它们转换。我想我在github上看到了一些活动,表明它在dev版本中是固定的,或者至少是他们正在开发的东西。

+0

这很好,适用于这些数据集。我的困难是我正在为R包写这个,所以我不能假设名字列被命名为'name',年份列在用户数据中被命名为'year'。在之前的question中,我了解到dplyr不允许您指定要加入的列。我应该强制用户重命名列吗? –

+0

@LincolnMullen你可以使用'regroup'以编程方式在dplyr中进行分组,如果有帮助的话。请参阅[这里](http://stackoverflow.com/q/21815060/324364)。 – joran