2010-06-09 65 views
8

我正在尝试找到合适的显示来说明学校内部和学校之间的各种属性。对于每个班级,只有15-30个数据点(学生)。ggplot2 boxplot中多个异常值抖动

现在我倾向于无晶须的盒状图,只显示1.,2。和3.四分位数+数据点多于然后例如1人口SD +/-样本中值。

这个我可以做。

但是 - 我需要向一些老师展示这张图,以便衡量他们最喜欢的东西。我想比较我的图和普通的箱子图。但是,如果仅存在一个异常值,则普通盒图看起来是相同的,或者例如5个异常值相同。在这种情况下,这将是一个破坏交易的行为。

例如

test <-structure(list(value = c(3, 5, 3, 3, 6, 4, 5, 4, 6, 4, 6, 4, 
4, 6, 5, 3, 3, 4, 4, 4, 3, 4, 4, 4, 3, 4, 5, 6, 6, 4, 3, 5, 4, 
6, 5, 6, 4, 5, 5, 3, 4, 4, 6, 4, 4, 5, 5, 3, 4, 5, 8, 8, 8, 8, 
9, 6, 6, 7, 6, 9), places = structure(c(1L, 2L, 1L, 1L, 1L, 2L, 
1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 
2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 
2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 
1L, 2L, 2L, 1L, 2L, 1L), .Label = c("a", "b"), class = "factor")), .Names = c("value", 
"places"), row.names = c(NA, -60L), class = "data.frame") 

ggplot(test, aes(x=places,y=value))+geom_boxplot() 

这里有两个异常值(“a”,9) - 但只显示一个“点”。

所以我的问题:如何抖动异常值。而且 - 你会为这类数据提出什么样的显示?

回答

8

可以redifine的funcition

GeomBoxplot$draw<-function (., data, ..., outlier.colour = "black", outlier.shape = 16, 
    outlier.size = 2, outlier.jitter=0) 
{ 
    defaults <- with(data, data.frame(x = x, xmin = xmin, xmax = xmax, 
     colour = colour, size = size, linetype = 1, group = 1, 
     alpha = 1, fill = alpha(fill, alpha), stringsAsFactors = FALSE)) 
    defaults2 <- defaults[c(1, 1), ] 
     if (!is.null(data$outliers) && length(data$outliers[[1]] >= 
     1)) { 
      pp<-position_jitter(width=outlier.jitter,height=0) 
      p<-pp$adjust(data.frame(x=data$x[rep(1, length(data$outliers[[1]]))], y=data$outliers[[1]]),.scale) 
     outliers_grob <- GeomPoint$draw(data.frame(x=p$x, y = p$y, colour = I(outlier.colour), 
      shape = outlier.shape, alpha = 1, size = outlier.size, 
      fill = NA), ...) 
    } 
    else { 
     outliers_grob <- NULL 
    } 
    with(data, ggname(.$my_name(), grobTree(outliers_grob, GeomPath$draw(data.frame(y = c(upper, 
     ymax), defaults2), ...), GeomPath$draw(data.frame(y = c(lower, 
     ymin), defaults2), ...), GeomRect$draw(data.frame(ymax = upper, 
     ymin = lower, defaults), ...), GeomRect$draw(data.frame(ymax = middle, 
     ymin = middle, defaults), ...)))) 
} 

ggplot(test, aes(x=places,y=value))+geom_boxplot(outlier.jitter=0.05) 

这是临时的解决方案。当然,就OOP而言,您应该创建一个GeomBoxplot的子类并覆盖该函数。这很容易,因为ggplot2很好。添加例如子类定义的

=== ===

GeomBoxplotJitterOutlier <- proto(GeomBoxplot, { 
    draw <- function (., data, ..., outlier.colour = "black", outlier.shape = 16, 
    outlier.size = 2, outlier.jitter=0) { 
# copy the body of function 'draw' above and paste here. 
} 

    objname <- "boxplot_jitter_outlier" 
    desc <- "Box and whiskers plot with jittered outlier" 
    guide_geom <- function(.) "boxplot_jitter_outlier" 

}) 
geom_boxplot_jitter_outlier <- GeomBoxplotJitterOutlier$build_accessor() 

那么你可以与你的子类来完成:

ggplot(test, aes(x=places,y=value))+geom_boxplot_jitter_outlier(outlier.jitter=0.05) 
+0

这看起来不错! - 谢谢。我将如何去创建一个子类?在hadleys的书中我找不到这个参考 - 而且我不喜欢OOP :-) – Andreas 2010-06-10 13:27:41

+0

我举了一个例子。 – kohske 2010-06-11 00:03:35

+0

'build_accessor'现在必须由'new'替代(截至2011年6月),'GeomBoxplot'需要使用'ggplot2 :::'访问,否则这仍然可以正常工作。非常感谢! – krlmlr 2014-07-31 02:40:21

1

这是否让你找到你想要的东西?抖动开始位置的限制不是自动的,但它是一个开始。

g = ggplot(test, aes(x = places,y = value)) 

g + geom_boxplot(outlier.colour = rgb(0,0,0,0)) + geom_point(data = test[test$value > 8,], position = position_jitter(width = .4)) 
+0

谢谢greg。但是我认为你的解决方案会导致许多数据点。两个实际点+异常值的boxplot点。如果我可以告诉ggplot不要绘制异常值,这将是一个解决方案(结合stat_boxplot和我猜测的一些魔法),但geom_boxplot(outlier.color = NULL)不起作用。 – Andreas 2010-06-10 00:40:38

+0

通过上面添加的outlier.colour参数,我认为它会起作用。 – Greg 2010-06-10 01:04:40

+0

Offcause - 四个零代替三个而非常重要。使用position_jitter(w = .1,h = 0)) - 它看起来很好 - 但很明显,ggplot仍然为离群值点留下空间。 这不是美丽的 - 但我肯定的作品:-) – Andreas 2010-06-10 13:25:48

1

鉴于小的数据点,你想绘制所有的点不仅是异常值。这将有助于找出箱内积分的分布情况。

您可以使用geom_jitter来做到这一点,但请注意,box_plot已经为异常值绘制点,所以为了不显示它们两次,您需要关闭箱线图的异常值显示geom_boxplot(outlier.shape = NA)

library("ggplot2") 

test <-structure(list(value = c(3, 5, 3, 3, 6, 4, 5, 4, 6, 4, 6, 4, 4, 6, 5, 3, 3, 4, 4, 4, 3, 4, 4, 4, 3, 4, 5, 6, 6, 4, 3, 5\ 
, 4, 6, 5, 6, 4, 5, 5, 3, 4, 4, 6, 4, 4, 5, 5, 3, 4, 5, 8, 8, 8, 8, 9, 6, 6, 7, 6, 9), places = structure(c(1L, 2L, 1L, 1L, 1L\ 
, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, \ 
1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L), .Label = c("a", "b"), class =\ 
"factor")), .Names = c("value", "places"), row.names = c(NA, -60L), class = "data.frame") 

# adding a level that you will use latter for giving colors 
l <- rep(c(10,20,30,40,50,60), 10) 
test$levels<-l 

# [1] 
# original plot 
ggplot(test, aes(x=places,y=value))+geom_boxplot() 

# [2] 
# plot with outlier from boxplot and the points jittered to see 
# distribution (outliers and the same point from position jitter would be 
# counted twice for each different height) 
dev.new() 
ggplot(data=test, aes(x=places, y=value)) + geom_boxplot() + geom_jitter(position=position_jitter(width=0.1, height=0)) 

# [3] 
# make wider the jitter to avoid overplotting because there are a lot 
# of points with the same value, also remove the outliers from boxplot 
# (they are plotted with the geom_jitter anyway) 
dev.new() 
ggplot(data=test, aes(x=places, y=value)) + geom_boxplot(outlier.shape = NA) + 
    geom_jitter(position=position_jitter(width=0.3, height=0)) 

# [4] 
# adding colors to the points to see if there is a sub-pattern in the distribution 
dev.new() 
ggplot(data=test, aes(x=places, y=value)) + geom_boxplot(outlier.shape = NA) + 
    geom_jitter(position=position_jitter(width=0.3, height=0), aes(colour=levels)) 

# [5] 
# adding a bit of vertical jittering 
# jittering (a good option for a less discrete datasets) 
dev.new() 
ggplot(data=test, aes(x=places, y=value)) + geom_boxplot(outlier.shape = NA) + 
    geom_jitter(position=position_jitter(width=0.3, height=0.05), aes(colour=levels)) 

# [6] 
# finally remember that position_jitter makes a jittering of a 40% of 
# the resolution of the data, so if you forget the height=0 you will 
# have a total different picture 
dev.new() 
ggplot(data=test, aes(x=places, y=value)) + geom_boxplot(outlier.shape = NA) + 
    geom_jitter(position=position_jitter(width=0.2)) 

enter image description here

6

这似乎是公认的答案不工作了,因为GGPLOT2已经更新。 的净多搜索后我发现下面的:http://comments.gmane.org/gmane.comp.lang.r.ggplot2/3616 -Look在温斯顿畅的

他计算的异常值分别使用ddply reply-,然后将它们用

geom_dotplot() 

已经上禁用的异常输出plotts所述geom_boxplot():

geom_boxplot(outlier.colour = NA) 

下面是从URL的完整代码上面提到:

# This returns a data frame with the outliers only 
find_outliers <- function(y, coef = 1.5) { 
    qs <- c(0, 0.25, 0.5, 0.75, 1) 
    stats <- as.numeric(quantile(y, qs)) 
    iqr <- diff(stats[c(2, 4)]) 

    outliers <- y < (stats[2] - coef * iqr) | y > (stats[4] + coef * iqr) 

    return(y[outliers]) 
} 


library(MASS) # Use the birthwt data set from MASS 

# Find the outliers for each level of 'smoke' 
library(plyr) 
outlier_data <- ddply(birthwt, .(smoke), summarise, lwt = find_outliers(lwt)) 


# This draws an ordinary box plot 
ggplot(birthwt, aes(x = factor(smoke), y = lwt)) + geom_boxplot() 


# This draws the outliers using geom_dotplot 
ggplot(birthwt, aes(x = factor(smoke), y = lwt)) + 
    geom_boxplot(outlier.colour = NA) + 
#also consider: 
# geom_jitter(alpha = 0.5, size = 2)+ 
    geom_dotplot(data = outlier_data, binaxis = "y", 
       stackdir = "center", binwidth = 4) 
0

代码居住不再工作。对于当前版本的ggplot2,我使用了以下类:

DrawGeomBoxplotJitterOutlier <- function(data, panel_params, coord, ..., 
             outlier.jitter.width=NULL, 
             outlier.jitter.height=0, 
             outlier.colour = NULL, 
             outlier.fill = NULL, 
             outlier.shape = 19, 
             outlier.size = 1.5, 
             outlier.stroke = 0.5, 
             outlier.alpha = NULL) { 
    boxplot_grob <- ggplot2::GeomBoxplot$draw_group(data, panel_params, coord, ...) 
    point_grob <- grep("geom_point.*", names(boxplot_grob$children)) 
    if (length(point_grob) == 0) 
    return(boxplot_grob) 

    ifnotnull <- function(x, y) ifelse(is.null(x), y, x) 

    if (is.null(outlier.jitter.width)) { 
    outlier.jitter.width <- (data$xmax - data$xmin)/2 
    } 

    x <- data$x[1] 
    y <- data$outliers[[1]] 
    if (outlier.jitter.width > 0 & length(y) > 1) { 
    x <- jitter(rep(x, length(y)), amount=outlier.jitter.width) 
    } 

    if (outlier.jitter.height > 0 & length(y) > 1) { 
    y <- jitter(y, amount=outlier.jitter.height) 
    } 

    outliers <- data.frame(
    x = x, y = y, 
    colour = ifnotnull(outlier.colour, data$colour[1]), 
    fill = ifnotnull(outlier.fill, data$fill[1]), 
    shape = ifnotnull(outlier.shape, data$shape[1]), 
    size = ifnotnull(outlier.size, data$size[1]), 
    stroke = ifnotnull(outlier.stroke, data$stroke[1]), 
    fill = NA, 
    alpha = ifnotnull(outlier.alpha, data$alpha[1]), 
    stringsAsFactors = FALSE 
) 
    boxplot_grob$children[[point_grob]] <- ggplot2::GeomPoint$draw_panel(outliers, panel_params, coord) 



    return(boxplot_grob) 
} 

GeomBoxplotJitterOutlier <- ggplot2::ggproto("GeomBoxplotJitterOutlier", 
              ggplot2::GeomBoxplot, 
              draw_group = DrawGeomBoxplotJitterOutlier) 

geom_boxplot_jitter_outlier <- function(mapping = NULL, data = NULL, 
             stat = "boxplot", position = "dodge", 
             ..., outlier.jitter.width=0, 
             outlier.jitter.height=NULL, 
             na.rm = FALSE, show.legend = NA, 
             inherit.aes = TRUE) { 
    ggplot2::layer(
    geom = GeomBoxplotJitterOutlier, mapping = mapping, data = data, 
    stat = stat, position = position, show.legend = show.legend, 
    inherit.aes = inherit.aes, params = list(na.rm = na.rm, 
     outlier.jitter.width=outlier.jitter.width, 
     outlier.jitter.height=outlier.jitter.height, ...)) 
}