dataviz

September 6, 2012

**Update Oct-23: Added a new parameter**

add

**to the function. Now multiple groups can be plotted in a single plot (see example in my comment)**

As a follow-up on my R implementation of Solomon’s watercolor plots, I made some improvements to the function. I fine-tuned the graphical parameters (the median smoother line now diminishes faster with increasing CIs, and the shaded watercolors look more pretty). Furthermore, the function is faster and has more features:

- You can define any standard regression function for the bootstrap procedure.
- vwReg(y ~ x, df, method=lm)
- vwReg(y ~ x + I(x^2), df, method=lm)

- Provide parameters for the fitting function.
- You can make the smoother’s span larger. Then it takes more points into account when doing the local fitting. Per default, the smoother fits a polynomial of degree two – that means as you increase span you will approach the overall quadratic fit: vwReg(y ~ x, df, span=2)
- You can also make the smoother’s span smaller, then it takes less points for local fitting. If it is too small, it will overfit and approach each single data point. The default span (.75) seemed to be the best choice for me for a variety of data sets: vwReg(y ~ x, df, span=0.5)
- Use a robust M-estimator for the smoother; see ?loess for details: vwReg(y ~ x, df, family=”symmetric”)

- Provide your own color scheme (or, for example, a black-and-white scheme). Examples see pictures below.
- Quantize the color ramp, so that regions for 1, 2, and 3 SD have the same color (an idea proposed by John Mashey).

At the end of this post is the source code for the R function.

Here are some variants of the watercolor plots – at the end, you can vote for your favorite (or write something into the comments). I am still fine-tuning the default parameters, and I am interested in your opinions what would be the best default.

**Plot 1: The current default**

**Plot 2: Using an M-estimator for bootstrap smoothers.** Usually you get wider confidence intervalls.

**Plot 3:Increasing the span of the smoothers**

**Plot 4: Decreasing the span of the smoothers**

**Plot 5: Changing the color scheme**, using a predefined ColorBrewer palette. You can see all available palettes by using this command: library(RColorBrewer); display.brewer.all()

**Plot 6: Using a custom-made palette**

**Plot 7: Using a custom-made palette;** with the parameter *bias* you can shift the color ramp to the “higher” colors:

**Plot 8: A black and white version of the plot**

**Plot 9: The anti-Tufte-plot**: Using as much ink as possible by reversing black and white (a.k.a. “*the Milky-Way Plot*“)

**Plot 10: The Northern Light Plot/ fMRI plot.** This plotting technique already has been used by a suspicious company (called

**Plot 11: The 1-2-3-SD plot.** You can use your own color schemes as well, e.g.: vwReg(y~x, df, bw=TRUE, quantize=”SD”)

Any comments or ideas? Or just a vote? If you produce some nice plots with your data, you can send it to me, and I will post a gallery of the most impressive “data art”!

Cheers,

Felix

Loading ...

# Copyright 2012 Felix Schönbrodt

# All rights reserved.

#

# FreeBSD License

#

# Redistribution and use in source and binary forms, with or without

# modification, are permitted provided that the following conditions are

# met:

#

# 1. Redistributions of source code must retain the above copyright

# notice, this list of conditions and the following disclaimer.

#

# 2. Redistributions in binary form must reproduce the above copyright

# notice, this list of conditions and the following disclaimer in the

# documentation and/or other materials provided with the distribution.

#

# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER `AS IS'' AND ANY

# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE

# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR

# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR

# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,

# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,

# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR

# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF

# LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING

# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS

# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

#

# The views and conclusions contained in the software and documentation

# are those of the authors and should not be interpreted as representing

# official policies, either expressed or implied, of the copyright

# holder.

# Version history:

# 0.1: original code

# 0.1.1: changed license to FreeBSD; re-established compability to ggplot2 (new version 0.9.2)

## Visually weighted regression / Watercolor plots

## Idea: Solomon Hsiang, with additional ideas from many blog commenters

# B = number bootstrapped smoothers

# shade: plot the shaded confidence region?

# shade.alpha: should the CI shading fade out at the edges? (by reducing alpha; 0 = no alpha decrease, 0.1 = medium alpha decrease, 0.5 = strong alpha decrease)

# spag: plot spaghetti lines?

# spag.color: color of spaghetti lines

# mweight: should the median smoother be visually weighted?

# show.lm: should the linear regresison line be plotted?

# show.CI: should the 95% CI limits be plotted?

# show.median: should the median smoother be plotted?

# median.col: color of the median smoother

# shape: shape of points

# method: the fitting function for the spaghettis; default: loess

# bw = TRUE: define a default b&w-palette

# slices: number of slices in x and y direction for the shaded region. Higher numbers make a smoother plot, but takes longer to draw. I wouldn'T go beyond 500

# palette: provide a custom color palette for the watercolors

# ylim: restrict range of the watercoloring

# quantize: either "continuous", or "SD". In the latter case, we get three color regions for 1, 2, and 3 SD (an idea of John Mashey)

# add: if add == FALSE, a new ggplot is returned. If add == TRUE, only the elements are returned, which can be added to an existing ggplot (with the '+' operator)

# ...: further parameters passed to the fitting function, in the case of loess, for example, "span = .9", or "family = 'symmetric'"

vwReg <- function(formula, data, title="", B=1000, shade=TRUE, shade.alpha=.1, spag=FALSE, spag.color="darkblue", mweight=TRUE, show.lm=FALSE, show.median = TRUE, median.col = "white", shape = 21, show.CI=FALSE, method=loess, bw=FALSE, slices=200, palette=colorRampPalette(c("#FFEDA0", "#DD0000"), bias=2)(20), ylim=NULL, quantize = "continuous", add=FALSE, ...) {

IV <- all.vars(formula)[2]

DV <- all.vars(formula)[1]

data <- na.omit(data[order(data[, IV]), c(IV, DV)])

if (bw == TRUE) {

palette <- colorRampPalette(c("#EEEEEE", "#999999", "#333333"), bias=2)(20)

}

print("Computing boostrapped smoothers ...")

newx <- data.frame(seq(min(data[, IV]), max(data[, IV]), length=slices))

colnames(newx) <- IV

l0.boot <- matrix(NA, nrow=nrow(newx), ncol=B)

l0 <- method(formula, data)

for (i in 1:B) {

data2 <- data[sample(nrow(data), replace=TRUE), ]

data2 <- data2[order(data2[, IV]), ]

if (class(l0)=="loess") {

m1 <- method(formula, data2, control = loess.control(surface = "i", statistics="a", trace.hat="a"), ...)

} else {

m1 <- method(formula, data2, ...)

}

l0.boot[, i] <- predict(m1, newdata=newx)

}

# compute median and CI limits of bootstrap

library(plyr)

library(reshape2)

CI.boot <- adply(l0.boot, 1, function(x) quantile(x, prob=c(.025, .5, .975, pnorm(c(-3, -2, -1, 0, 1, 2, 3))), na.rm=TRUE))[, -1]

colnames(CI.boot)[1:10] <- c("LL", "M", "UL", paste0("SD", 1:7))

CI.boot$x <- newx[, 1]

CI.boot$width <- CI.boot$UL - CI.boot$LL

# scale the CI width to the range 0 to 1 and flip it (bigger numbers = narrower CI)

CI.boot$w2 <- (CI.boot$width - min(CI.boot$width))

CI.boot$w3 <- 1-(CI.boot$w2/max(CI.boot$w2))

# convert bootstrapped spaghettis to long format

b2 <- melt(l0.boot)

b2$x <- newx[,1]

colnames(b2) <- c("index", "B", "value", "x")

library(ggplot2)

library(RColorBrewer)

# Construct ggplot

# All plot elements are constructed as a list, so they can be added to an existing ggplot

# if add == FALSE: provide the basic ggplot object

p0 <- ggplot(data, aes_string(x=IV, y=DV)) + theme_bw()

# initialize elements with NULL (if they are defined, they are overwritten with something meaningful)

gg.tiles <- gg.poly <- gg.spag <- gg.median <- gg.CI1 <- gg.CI2 <- gg.lm <- gg.points <- gg.title <- NULL

if (shade == TRUE) {

quantize <- match.arg(quantize, c("continuous", "SD"))

if (quantize == "continuous") {

print("Computing density estimates for each vertical cut ...")

flush.console()

if (is.null(ylim)) {

min_value <- min(min(l0.boot, na.rm=TRUE), min(data[, DV], na.rm=TRUE))

max_value <- max(max(l0.boot, na.rm=TRUE), max(data[, DV], na.rm=TRUE))

ylim <- c(min_value, max_value)

}

# vertical cross-sectional density estimate

d2 <- ddply(b2[, c("x", "value")], .(x), function(df) {

res <- data.frame(density(df$value, na.rm=TRUE, n=slices, from=ylim[1], to=ylim[2])[c("x", "y")])

#res <- data.frame(density(df$value, na.rm=TRUE, n=slices)[c("x", "y")])

colnames(res) <- c("y", "dens")

return(res)

}, .progress="text")

maxdens <- max(d2$dens)

mindens <- min(d2$dens)

d2$dens.scaled <- (d2$dens - mindens)/maxdens

## Tile approach

d2$alpha.factor <- d2$dens.scaled^shade.alpha

gg.tiles <- list(geom_tile(data=d2, aes(x=x, y=y, fill=dens.scaled, alpha=alpha.factor)), scale_fill_gradientn("dens.scaled", colours=palette), scale_alpha_continuous(range=c(0.001, 1)))

}

if (quantize == "SD") {

## Polygon approach

SDs <- melt(CI.boot[, c("x", paste0("SD", 1:7))], id.vars="x")

count <- 0

d3 <- data.frame()

col <- c(1,2,3,3,2,1)

for (i in 1:6) {

seg1 <- SDs[SDs$variable == paste0("SD", i), ]

seg2 <- SDs[SDs$variable == paste0("SD", i+1), ]

seg <- rbind(seg1, seg2[nrow(seg2):1, ])

seg$group <- count

seg$col <- col[i]

count <- count + 1

d3 <- rbind(d3, seg)

}

gg.poly <- list(geom_polygon(data=d3, aes(x=x, y=value, color=NULL, fill=col, group=group)), scale_fill_gradientn("dens.scaled", colours=palette, values=seq(-1, 3, 1)))

}

}

print("Build ggplot figure ...")

flush.console()

if (spag==TRUE) {

gg.spag <- geom_path(data=b2, aes(x=x, y=value, group=B), size=0.7, alpha=10/B, color=spag.color)

}

if (show.median == TRUE) {

if (mweight == TRUE) {

gg.median <- geom_path(data=CI.boot, aes(x=x, y=M, alpha=w3^3), size=.6, linejoin="mitre", color=median.col)

} else {

gg.median <- geom_path(data=CI.boot, aes(x=x, y=M), size = 0.6, linejoin="mitre", color=median.col)

}

}

# Confidence limits

if (show.CI == TRUE) {

gg.CI1 <- geom_path(data=CI.boot, aes(x=x, y=UL), size=1, color="red")

gg.CI2 <- geom_path(data=CI.boot, aes(x=x, y=LL), size=1, color="red")

}

# plain linear regression line

if (show.lm==TRUE) {gg.lm <- geom_smooth(method="lm", color="darkgreen", se=FALSE)}

gg.points <- geom_point(data=data, aes_string(x=IV, y=DV), size=1, shape=shape, fill="white", color="black")

if (title != "") {

gg.title <- theme(title=title)

}

gg.elements <- list(gg.tiles, gg.poly, gg.spag, gg.median, gg.CI1, gg.CI2, gg.lm, gg.points, gg.title, theme(legend.position="none"))

if (add == FALSE) {

return(p0 + gg.elements)

} else {

return(gg.elements)

}

}

# All rights reserved.

#

# FreeBSD License

#

# Redistribution and use in source and binary forms, with or without

# modification, are permitted provided that the following conditions are

# met:

#

# 1. Redistributions of source code must retain the above copyright

# notice, this list of conditions and the following disclaimer.

#

# 2. Redistributions in binary form must reproduce the above copyright

# notice, this list of conditions and the following disclaimer in the

# documentation and/or other materials provided with the distribution.

#

# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER `AS IS'' AND ANY

# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE

# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR

# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR

# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,

# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,

# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR

# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF

# LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING

# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS

# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

#

# The views and conclusions contained in the software and documentation

# are those of the authors and should not be interpreted as representing

# official policies, either expressed or implied, of the copyright

# holder.

# Version history:

# 0.1: original code

# 0.1.1: changed license to FreeBSD; re-established compability to ggplot2 (new version 0.9.2)

## Visually weighted regression / Watercolor plots

## Idea: Solomon Hsiang, with additional ideas from many blog commenters

# B = number bootstrapped smoothers

# shade: plot the shaded confidence region?

# shade.alpha: should the CI shading fade out at the edges? (by reducing alpha; 0 = no alpha decrease, 0.1 = medium alpha decrease, 0.5 = strong alpha decrease)

# spag: plot spaghetti lines?

# spag.color: color of spaghetti lines

# mweight: should the median smoother be visually weighted?

# show.lm: should the linear regresison line be plotted?

# show.CI: should the 95% CI limits be plotted?

# show.median: should the median smoother be plotted?

# median.col: color of the median smoother

# shape: shape of points

# method: the fitting function for the spaghettis; default: loess

# bw = TRUE: define a default b&w-palette

# slices: number of slices in x and y direction for the shaded region. Higher numbers make a smoother plot, but takes longer to draw. I wouldn'T go beyond 500

# palette: provide a custom color palette for the watercolors

# ylim: restrict range of the watercoloring

# quantize: either "continuous", or "SD". In the latter case, we get three color regions for 1, 2, and 3 SD (an idea of John Mashey)

# add: if add == FALSE, a new ggplot is returned. If add == TRUE, only the elements are returned, which can be added to an existing ggplot (with the '+' operator)

# ...: further parameters passed to the fitting function, in the case of loess, for example, "span = .9", or "family = 'symmetric'"

vwReg <- function(formula, data, title="", B=1000, shade=TRUE, shade.alpha=.1, spag=FALSE, spag.color="darkblue", mweight=TRUE, show.lm=FALSE, show.median = TRUE, median.col = "white", shape = 21, show.CI=FALSE, method=loess, bw=FALSE, slices=200, palette=colorRampPalette(c("#FFEDA0", "#DD0000"), bias=2)(20), ylim=NULL, quantize = "continuous", add=FALSE, ...) {

IV <- all.vars(formula)[2]

DV <- all.vars(formula)[1]

data <- na.omit(data[order(data[, IV]), c(IV, DV)])

if (bw == TRUE) {

palette <- colorRampPalette(c("#EEEEEE", "#999999", "#333333"), bias=2)(20)

}

print("Computing boostrapped smoothers ...")

newx <- data.frame(seq(min(data[, IV]), max(data[, IV]), length=slices))

colnames(newx) <- IV

l0.boot <- matrix(NA, nrow=nrow(newx), ncol=B)

l0 <- method(formula, data)

for (i in 1:B) {

data2 <- data[sample(nrow(data), replace=TRUE), ]

data2 <- data2[order(data2[, IV]), ]

if (class(l0)=="loess") {

m1 <- method(formula, data2, control = loess.control(surface = "i", statistics="a", trace.hat="a"), ...)

} else {

m1 <- method(formula, data2, ...)

}

l0.boot[, i] <- predict(m1, newdata=newx)

}

# compute median and CI limits of bootstrap

library(plyr)

library(reshape2)

CI.boot <- adply(l0.boot, 1, function(x) quantile(x, prob=c(.025, .5, .975, pnorm(c(-3, -2, -1, 0, 1, 2, 3))), na.rm=TRUE))[, -1]

colnames(CI.boot)[1:10] <- c("LL", "M", "UL", paste0("SD", 1:7))

CI.boot$x <- newx[, 1]

CI.boot$width <- CI.boot$UL - CI.boot$LL

# scale the CI width to the range 0 to 1 and flip it (bigger numbers = narrower CI)

CI.boot$w2 <- (CI.boot$width - min(CI.boot$width))

CI.boot$w3 <- 1-(CI.boot$w2/max(CI.boot$w2))

# convert bootstrapped spaghettis to long format

b2 <- melt(l0.boot)

b2$x <- newx[,1]

colnames(b2) <- c("index", "B", "value", "x")

library(ggplot2)

library(RColorBrewer)

# Construct ggplot

# All plot elements are constructed as a list, so they can be added to an existing ggplot

# if add == FALSE: provide the basic ggplot object

p0 <- ggplot(data, aes_string(x=IV, y=DV)) + theme_bw()

# initialize elements with NULL (if they are defined, they are overwritten with something meaningful)

gg.tiles <- gg.poly <- gg.spag <- gg.median <- gg.CI1 <- gg.CI2 <- gg.lm <- gg.points <- gg.title <- NULL

if (shade == TRUE) {

quantize <- match.arg(quantize, c("continuous", "SD"))

if (quantize == "continuous") {

print("Computing density estimates for each vertical cut ...")

flush.console()

if (is.null(ylim)) {

min_value <- min(min(l0.boot, na.rm=TRUE), min(data[, DV], na.rm=TRUE))

max_value <- max(max(l0.boot, na.rm=TRUE), max(data[, DV], na.rm=TRUE))

ylim <- c(min_value, max_value)

}

# vertical cross-sectional density estimate

d2 <- ddply(b2[, c("x", "value")], .(x), function(df) {

res <- data.frame(density(df$value, na.rm=TRUE, n=slices, from=ylim[1], to=ylim[2])[c("x", "y")])

#res <- data.frame(density(df$value, na.rm=TRUE, n=slices)[c("x", "y")])

colnames(res) <- c("y", "dens")

return(res)

}, .progress="text")

maxdens <- max(d2$dens)

mindens <- min(d2$dens)

d2$dens.scaled <- (d2$dens - mindens)/maxdens

## Tile approach

d2$alpha.factor <- d2$dens.scaled^shade.alpha

gg.tiles <- list(geom_tile(data=d2, aes(x=x, y=y, fill=dens.scaled, alpha=alpha.factor)), scale_fill_gradientn("dens.scaled", colours=palette), scale_alpha_continuous(range=c(0.001, 1)))

}

if (quantize == "SD") {

## Polygon approach

SDs <- melt(CI.boot[, c("x", paste0("SD", 1:7))], id.vars="x")

count <- 0

d3 <- data.frame()

col <- c(1,2,3,3,2,1)

for (i in 1:6) {

seg1 <- SDs[SDs$variable == paste0("SD", i), ]

seg2 <- SDs[SDs$variable == paste0("SD", i+1), ]

seg <- rbind(seg1, seg2[nrow(seg2):1, ])

seg$group <- count

seg$col <- col[i]

count <- count + 1

d3 <- rbind(d3, seg)

}

gg.poly <- list(geom_polygon(data=d3, aes(x=x, y=value, color=NULL, fill=col, group=group)), scale_fill_gradientn("dens.scaled", colours=palette, values=seq(-1, 3, 1)))

}

}

print("Build ggplot figure ...")

flush.console()

if (spag==TRUE) {

gg.spag <- geom_path(data=b2, aes(x=x, y=value, group=B), size=0.7, alpha=10/B, color=spag.color)

}

if (show.median == TRUE) {

if (mweight == TRUE) {

gg.median <- geom_path(data=CI.boot, aes(x=x, y=M, alpha=w3^3), size=.6, linejoin="mitre", color=median.col)

} else {

gg.median <- geom_path(data=CI.boot, aes(x=x, y=M), size = 0.6, linejoin="mitre", color=median.col)

}

}

# Confidence limits

if (show.CI == TRUE) {

gg.CI1 <- geom_path(data=CI.boot, aes(x=x, y=UL), size=1, color="red")

gg.CI2 <- geom_path(data=CI.boot, aes(x=x, y=LL), size=1, color="red")

}

# plain linear regression line

if (show.lm==TRUE) {gg.lm <- geom_smooth(method="lm", color="darkgreen", se=FALSE)}

gg.points <- geom_point(data=data, aes_string(x=IV, y=DV), size=1, shape=shape, fill="white", color="black")

if (title != "") {

gg.title <- theme(title=title)

}

gg.elements <- list(gg.tiles, gg.poly, gg.spag, gg.median, gg.CI1, gg.CI2, gg.lm, gg.points, gg.title, theme(legend.position="none"))

if (add == FALSE) {

return(p0 + gg.elements)

} else {

return(gg.elements)

}

}

| Trackback

September 5, 2012

Dear valued customer,

it is a well-known scientific truth that research results which are accompanied by a fancy, colorful fMRI scan, are perceived as more believable and more persuasive than simple bar graphs or text results (McCabe & Castel, 2007; Weisberg, Keil, Goodstein, Rawson, & Gray, 2008). Readers even agree more with fictitious and unsubstantiated claims, as long as you provide a colorful brain image, and it works even when the subject is a dead salmon.

What are the consequence of these troubling findings? The answer is clear. Everybody should be equipped with these powerful tools of research communication! We at *IRET* made it to our mission to provide the latest, cutting-edge tools for your research analysis. In this case we adopted a new technology called “visually weighted regression” or “watercolor plots” (see here, here, or here), and simply applied a new color scheme.

But now, let’s get some hands on it!

Imagine you invested a lot of effort in collecting the data of 41 participants. Now you find following pattern in 2 of your 87 variables:

You could show that plain scatterplot. But should you do it? Nay. Of course everybody would spot the outliers on the top right. But which is much more important: it is *b-o-r-i-n-g*!

What is the alternative? Reporting the correlation as text? “We found a correlation of *r* = .38 (*p* = .014)”. Yawn.

Or maybe: “We chose to use a correlation technique that is robust against outliers and violations of normality, the Spearman rank coefficient. It turned out that the correlation broke down and was not significant any more (*r* = .06, *p* = .708).”.

Don’t be silly! With that style of scientific reporting, there would be nothing to write home about. But you can be sure: we have the right tools for you. Finally, the power of pictures is not limited to brain research – now you can turn any data into a magical fMRI plot like that:

Isn’t that beautiful? We recommend to accompany the figure with an elaborated description: “For local fitting, we used spline smoothers from 10`000 bootstrap replications. For a robust estimation of vertical confidence densities, a re-descending M-estimator with Tukey’s biweight function was employed. As one can clearly see in the plot, there is significant confidence in the prediction of the x=0, y=0 region, as well as a minor hot spot in the x=15, y=60 region (also known as the *supra-dextral data region*).”

With the Magical Data Enhancer Tool (MDET) you can …

- … turn boring, marginally significant, or just crappy results into a stunning research experience
- … publish in scientific journal with higher impact factors
- … receive the media coverage that you and your research deserve
- … achieve higher acceptance rates from funding agencies
- … impress young women at the bar (you wouldn’t show a plain scatterplot, dude?!)

Q: But – isn’t that approach unethical?

A: No, it’s not at all. In contrast, we at IRES think that it is unethical that only some researchers are allowed to exploit the cognitive biases of their readers. We design our products with a great respect for humanity and we believe that every researcher who can afford our products should have the same powerful tools at hand.

Q: How much does you product cost?

A: The standard version of the *Magical Data Enhancer* ships for 12’998 $. We are aware that this is a significant investment. But, come on: You deserve it! Furthermore, we will soon publish a free trial version, including the full R code on this blog. So stay tuned!

Best regards,

Lexis “Lex” Brycenet (CEO & CTO Research Communication)

International Research Enhancement Technology (IRET)

August 30, 2012

**[Update 1: Sep 5, 2012: Explore the Magical Data Enhancer by IRES, using this visualization technique]**

**[Update 2: Sep 6, 2012: See new improved plots, and new R code!**

Solomon Hsiang proposed an appealing method for visually displaying the uncertainty in regressions (see his blog [1][2], and also the discussions on the Statistical Modeling, Causal Inference, and Social Science Blog [1][2]).

I implemented the method in R (using ggplot2), and used an additional method of determining the shading (especially concerning Andrew Gelman’s comment that traditional statistical summaries (such as 95% intervals) give too much weight to the edges. In the following I will show how to produce plots like that:

I used following procedure:

- Compute smoothers from 1000 bootstrap samples of the original sample (this results in a spaghetti plot)
- Calculate a density estimate for each vertical cut through the bootstrapped smoothers. The area under the density curve always is 1, so the ink is constant for each y-slice.
- Shade the figure according to these density estimates.

The basic scatter plot:

No we show the bootstrapped smoothers (a “spaghetti plot”). Each spaghetti has a low alpha. That means that overlapping spaghettis produce a darker color and already give weight to highly populated regions.

Here is the shading according to the smoother’s density:

Now, we can overplot the median smoother estimate for each x value (the “median smoother”):

Or, a visually weighted smoother:

Finally, we can add the plain linear regression line (which obviously does not refelct the data points very well):

At the end of this post is the function that produces all of these plots. The function returns a ggplot object, so you can modify it afterwards, e.g.:

`vwReg(y~x, df, shade=FALSE, spag=TRUE) + xlab("Implicit power motive") + ylab("Corrugator activity during preparation")[/cc]`

## Here are two plots with actual data I am working on:

The correlation of both variables is .22 (*p* = .003).

A) As a heat map (note: the vertical breaks at the left and right end occur due to single data points that get either sampled or not during the bootstrap):

B) As a spaghetti plot:

Finally, here's the code (sometimes the code box is collapsed - click the arrow on the top right of the box to open it). Comments and additions are welcome.

Send this to a friend