计算两个数据集的置信区间

我想计算置信区间（99％的水平，而不是95）为可变两岁dataframes，infert_control和infert_patient的平均值，其中：

infert_control = subset(infert$age, infert$case == 0) 
infert_patient = subset(infert$age, infert$case == 1)

infert是一个内置的R数据集，对于那些不熟悉它的人，这里是：病例0命名为对照组患者，病例1实际的。

> infert 
    education age parity induced case spontaneous stratum pooled.stratum 
1  0-5yrs 26  6  1 1   2  1    3 
2  0-5yrs 42  1  1 1   0  2    1 
3  0-5yrs 39  6  2 1   0  3    4 
4  0-5yrs 34  4  2 1   0  4    2 
5  6-11yrs 35  3  1 1   1  5    32 
6  6-11yrs 36  4  2 1   1  6    36 
7  6-11yrs 23  1  0 1   0  7    6 
8  6-11yrs 32  2  0 1   0  8    22 
9  6-11yrs 21  1  0 1   1  9    5 
10 6-11yrs 28  2  0 1   0  10    19 
11 6-11yrs 29  2  1 1   0  11    20 
... 
239 12+ yrs 38  6  0 0   2  74    63 
240 12+ yrs 26  2  1 0   1  75    49 
241 12+ yrs 31  1  1 0   0  76    45 
242 12+ yrs 31  2  0 0   1  77    53 
243 12+ yrs 25  1  0 0   1  78    41 
244 12+ yrs 31  1  0 0   1  79    45 
245 12+ yrs 34  1  0 0   0  80    47 
246 12+ yrs 35  2  2 0   0  81    54 
247 12+ yrs 29  1  0 0   1  82    43 
248 12+ yrs 23  1  0 0   1  83    40

什么是解决此问题的正确方法？

我已经计算age列的两个infert_control和infert_patient，加上每个子集的标准偏差的平均值。

来源

2014-01-19 Dimitris Sfounis

这是一个统计问题。但你可以报告正确的'分位数' – rawr

也许't.test'可以帮到你吗？ – Fernando

@Fernando由于本练习的第二部分需要't.test（）'，所以我认为我不应该在这部分中使用它，这是第一部分。这整个事情是一个介绍到R的任务。 –

你可以很容易地计算手动置信区间：

infert_control <- subset(infert$age, infert$case == 0) 

# calculate needed values 
m <- mean(infert_control) 
s <- sd(infert_control) 
n <- length(infert_control) 

# calculate error for normal distribution (choose you distribution here, e.g. qt for t-distribution) 
a <- 0.995 # 99% CI => 0.5% on both sides 
error <- qnorm(a)*s/sqrt(n) 

# calculate CI 
ci_lower <- m-error 
ci_upper <- m+error

参见http://en.wikipedia.org/wiki/Confidence_interval（对不起，维基百科的链接，但它有一个很好的解释和说明你的公式）

来源

2014-01-19 18:19:11 sgibb

非常感谢您的帮助。让我研究你的答案。 –

...或小功能：

cifun <- function(data, ALPHA){ 
    c(mean(data) - qnorm(1-ALPHA/2) * sd(data)/sqrt(length(data)), 
    mean(data) + qnorm(1-ALPHA/2) * sd(data)/sqrt(length(data))) 
} 

cifun(infert_control, 0.01)

来源

2014-01-19 18:27:02 holzben

您可以使用引导这个：

library(boot) 
set.seed(42) 
boot_mean <- boot(infert_control, function(x, i) mean(x[i]), R=1e4) 
quantile(boot_mean$t, probs=c(0.005, 0.995)) 
#  0.5% 99.5% 
# 30.47273 32.58182

或者，如果你不想使用图书馆：

set.seed(42) 
R <- 1e4 
boot_mean <- colMeans(
       matrix(
        sample(infert_control, R * length(infert_control), TRUE), 
        ncol=R)) 
quantile(boot_mean, probs=c(0.005, 0.995)) 
# 0.5% 99.5% 
#30.42424 32.55152

来源

2014-01-19 18:33:52 Roland

我很抱歉，但我不熟悉bootstrap库。 –

那么，请阅读文档或手动执行（请参阅我的编辑）。 – Roland

所以很多答案...

随机抽样的平均值具有t分布，不正常，尽管t -> N为df -> Inf。

cl <- function(data,p) { 
    n <- length(data) 
    cl <- qt(p/2,n-1,lower.tail=F)*sd(data)/sqrt(n) 
    m <- mean(data) 
    return(c(lower=m-cl,upper=m+cl)) 
} 
cl.control <- cl(infert_control,0.01) 
cl.control 
# lower upper 
# 30.42493 32.55689 

cl.patient <- cl(infert_patient,0.01) 
cl.patient 
# lower upper 
# 30.00221 33.05803 

aggregate(age~case,data=infert,cl,p=0.01) # much better way... 
# case age.lower age.upper 
# 1 0 30.42493 32.55689 
# 2 1 30.00221 33.05803

此外，分位数函数（e.q. qt(...)和qnorm(...)）默认返回下尾，所以你的限制将被逆转，除非你设置lower.tail=F

来源

2014-01-19 18:45:57 jlhoward

非常感谢@jhoward。看起来这将帮助我完成练习的第二部分。 –

计算两个数据集的置信区间

回答

相关问题