2017-07-13 30 views
0

我正在尝试运行泊松回归来预测常见二元结果。使用二元结果运行泊松回归时出错

这是我第一次尝试使用dput - 如果我使用不当,请让我知道,以便我可以更正它。

实施例的数据:

df <- structure(list(id = 1:30, sex = structure(c(1L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 
2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L), .Label = c("Female", "Male" 
), class = "factor"), migStat = structure(c(1L, 2L, 1L, 1L, 1L, 
1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 
1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L), .Label = c("Australian-born", 
"Migrant"), class = "factor"), mhAreaBi = structure(c(1L, 1L, 
1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 
1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("Metropolitan", 
"Regional"), class = "factor"), empStatBi = structure(c(2L, 2L, 
1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 
2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Student/employed", 
"Unemployed"), class = "factor"), pensBenBi = structure(c(1L, 
2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 
1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L), .Label = c("No benefit", 
"In receipt of pension benefit"), class = "factor"), maritStatBi = structure(c(2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("Married (including de facto)", 
"Not married"), class = "factor"), cto = structure(c(1L, 2L, 
2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 
2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L), .Label = c("No", 
"Yes"), class = "factor")), .Names = c("id", "sex", "migStat", 
"mhAreaBi", "empStatBi", "pensBenBi", "maritStatBi", "cto"), row.names = c(NA, 
-30L), class = "data.frame") 

当运行使用中的R glm回归时,收到一个错误:

fit <- glm(cto ~ sex + migStat + mhAreaBi + empStatBi + pensBenBi + maritStatBi, df, family = poisson) 

Error in if (any(y < 0)) stop("negative values not allowed for the 'Poisson' family") : 
    missing value where TRUE/FALSE needed 
In addition: Warning message: 
In Ops.factor(y, 0) : ‘<’ not meaningful for factors 

相同的错误已经简要地解释in this thread

Because the "<" operator is not defined for factors the result that is passed to if is of length 0. Setting the factor variable on the RHS and using the integer values on hte LHS succeeds.

将结果转换为整数时不会出现错误;然而,这个:

  1. 似乎失败了预测二元结果的目的(除非一个范围为0-1的数值变量被视为具有两个级别的因子变量);和
  2. 似乎没有必要的(至少根据这个post,它采用geeglmgeepack [当我的代码适应我自己的数据集不幸的是,我收到了同样的错误]预测二元结果)

问题:

我可以收到错误的进一步解释吗?

如果我将结果转换为范围为0-1的整数,那么glm会将其视为与二进制变量相同吗?如果没有,是否有更适合于对普通二元结果进行回归分析的方法?

+3

'as.numeric(df $ cto ==“Yes”)''会给你0和1,在'glm'中可以很好地工作。但是,通常情况下,您可以使用逻辑回归来处理这样的二元结果,使用泊松回归计数或速率变量,其中结果可以采用任何> 0的整数值。您确定泊松对您的分析来说是一个很好的选择吗? – Marius

+0

@Marius我很欣赏提示!在大学期间,我被告知逻辑回归用于二元结果,泊松回归用于计数数据。最近,我校一位统计学家告诉我,逻辑回归在二元结果很少时适合使用,但当结果很常见时,它会遇到麻烦。在这些情况下,最好使用泊松回归。这是一个关于CV的链接 - [link](https://stats.stackexchange.com/questions/18595/poisson-regression-to-estimate-relative-risk-for-binary-outcomes) –

+1

关于你的第1点, “unless”后面的语句是正确的 - 转换为0,1二进制变量(即虚拟变量)正是您想要做的。在你的链接'geeglm'例子中,结果编码为'TRUE'和'FALSE' - 即'1'和'0' - 这就是为什么他们没有在那篇文章中转换,但能够执行回归。 – paqmo

回答

0

我觉得这里最好的选择是:

df$cto_binary <- as.numeric(df$cto == "Yes") 
fit <- glm(cto_binary ~ sex + migStat + mhAreaBi + empStatBi + pensBenBi + maritStatBi, 
      df, family = poisson) 

由于这种方式,你明确显示在你的代码是什么将是你的二进制结果的1 /成功,不要搞什么得到绊倒了因子水平的排序。请注意,在R as.numeric(c(FALSE, TRUE))中给出c(0, 1),因此您始终知道您将从逻辑比较中获得什么。