Visually weighted/ Watercolor Plots, new variants: Please vote!

Update Oct-23: Added a new parameter add to the function. Now multiple groups can be plotted in a single plot (see example in my comment)
As a follow-up on my R implementation of Solomon’s watercolor plots, I made some improvements to the function. I fine-tuned the graphical parameters (the median smoother line now diminishes faster with increasing CIs, and the shaded watercolors look more pretty). Furthermore, the function is faster and has more features:

  • You can define any standard regression function for the bootstrap procedure.
    • vwReg(y ~ x, df, method=lm)
    • vwReg(y ~ x + I(x^2), df, method=lm)
  • Provide parameters for the fitting function.
    • You can make the smoother’s span larger. Then it takes more points into account when doing the local fitting. Per default, the smoother fits a polynomial of degree two – that means as you increase span you will approach the overall quadratic fit: vwReg(y ~ x, df, span=2)
    • You can also make the smoother’s span smaller, then it takes less points for local fitting. If it is too small, it will overfit and approach each single data point. The default span (.75) seemed to be the best choice for me for a variety of data sets: vwReg(y ~ x, df, span=0.5)
    • Use a robust M-estimator for the smoother; see ?loess for details: vwReg(y ~ x, df, family=”symmetric”)
  • Provide your own color scheme (or, for example, a black-and-white scheme). Examples see pictures below.
  • Quantize the color ramp, so that regions for 1, 2, and 3 SD have the same color (an idea proposed by John Mashey).

At the end of this post is the source code for the R function.

Some picture – please vote!

Here are some variants of the watercolor plots – at the end, you can vote for your favorite (or write something into the comments). I am still fine-tuning the default parameters, and I am interested in your opinions what would be the best default.
Plot 1: The current default

 
Plot 2: Using an M-estimator for bootstrap smoothers. Usually you get wider confidence intervalls.

 
Plot 3:Increasing the span of the smoothers

 
Plot 4: Decreasing the span of the smoothers

Plot 5: Changing the color scheme, using a predefined ColorBrewer palette. You can see all available palettes by using this command: library(RColorBrewer); display.brewer.all()

 
Plot 6: Using a custom-made palette

 
Plot 7: Using a custom-made palette; with the parameter bias you can shift the color ramp to the “higher” colors:

 
Plot 8: A black and white version of the plot

 
Plot 9: The anti-Tufte-plot: Using as much ink as possible by reversing black and white (a.k.a. “the Milky-Way Plot“)

Plot 10: The Northern Light Plot/ fMRI plot. This plotting technique already has been used by a suspicious company (called IRET – never heard of that). I hurried to publish the R code under a FreeBSD license before they can patent it! Feel free to use, share, or change the code for whatever purpose you need. Isn’t that beautiful?

Plot 11: The 1-2-3-SD plot. You can use your own color schemes as well, e.g.: vwReg(y~x, df, bw=TRUE, quantize=”SD”)


 
Any comments or ideas? Or just a vote? If you produce some nice plots with your data, you can send it to me, and I will post a gallery of the most impressive “data art”!
Cheers,
Felix
[poll id=”1″]
[cc lang=”rsplus” escaped=”true”]
# Copyright 2012 Felix Schönbrodt
# All rights reserved.
#
# FreeBSD License
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are
# met:
#
# 1. Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER `AS IS” AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
# LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# The views and conclusions contained in the software and documentation
# are those of the authors and should not be interpreted as representing
# official policies, either expressed or implied, of the copyright
# holder.
# Version history:
# 0.1: original code
# 0.1.1: changed license to FreeBSD; re-established compability to ggplot2 (new version 0.9.2)
## Visually weighted regression / Watercolor plots
## Idea: Solomon Hsiang, with additional ideas from many blog commenters
# B = number bootstrapped smoothers
# shade: plot the shaded confidence region?
# shade.alpha: should the CI shading fade out at the edges? (by reducing alpha; 0 = no alpha decrease, 0.1 = medium alpha decrease, 0.5 = strong alpha decrease)
# spag: plot spaghetti lines?
# spag.color: color of spaghetti lines
# mweight: should the median smoother be visually weighted?
# show.lm: should the linear regresison line be plotted?
# show.CI: should the 95% CI limits be plotted?
# show.median: should the median smoother be plotted?
# median.col: color of the median smoother
# shape: shape of points
# method: the fitting function for the spaghettis; default: loess
# bw = TRUE: define a default b&w-palette
# slices: number of slices in x and y direction for the shaded region. Higher numbers make a smoother plot, but takes longer to draw. I wouldn’T go beyond 500
# palette: provide a custom color palette for the watercolors
# ylim: restrict range of the watercoloring
# quantize: either “continuous”, or “SD”. In the latter case, we get three color regions for 1, 2, and 3 SD (an idea of John Mashey)
# add: if add == FALSE, a new ggplot is returned. If add == TRUE, only the elements are returned, which can be added to an existing ggplot (with the ‘+’ operator)
# …: further parameters passed to the fitting function, in the case of loess, for example, “span = .9”, or “family = ‘symmetric'”
vwReg <- function(formula, data, title="", B=1000, shade=TRUE, shade.alpha=.1, spag=FALSE, spag.color="darkblue", mweight=TRUE, show.lm=FALSE, show.median = TRUE, median.col = "white", shape = 21, show.CI=FALSE, method=loess, bw=FALSE, slices=200, palette=colorRampPalette(c("#FFEDA0", "#DD0000"), bias=2)(20), ylim=NULL, quantize = "continuous", add=FALSE, ...) { IV <- all.vars(formula)[2] DV <- all.vars(formula)[1] data <- na.omit(data[order(data[, IV]), c(IV, DV)]) if (bw == TRUE) { palette <- colorRampPalette(c("#EEEEEE", "#999999", "#333333"), bias=2)(20) } print("Computing boostrapped smoothers ...") newx <- data.frame(seq(min(data[, IV]), max(data[, IV]), length=slices)) colnames(newx) <- IV l0.boot <- matrix(NA, nrow=nrow(newx), ncol=B) l0 <- method(formula, data) for (i in 1:B) { data2 <- data[sample(nrow(data), replace=TRUE), ] data2 <- data2[order(data2[, IV]), ] if (class(l0)=="loess") { m1 <- method(formula, data2, control = loess.control(surface = "i", statistics="a", trace.hat="a"), ...) } else { m1 <- method(formula, data2, ...) } l0.boot[, i] <- predict(m1, newdata=newx) } # compute median and CI limits of bootstrap library(plyr) library(reshape2) CI.boot <- adply(l0.boot, 1, function(x) quantile(x, prob=c(.025, .5, .975, pnorm(c(-3, -2, -1, 0, 1, 2, 3))), na.rm=TRUE))[, -1] colnames(CI.boot)[1:10] <- c("LL", "M", "UL", paste0("SD", 1:7)) CI.boot$x <- newx[, 1] CI.boot$width <- CI.boot$UL - CI.boot$LL # scale the CI width to the range 0 to 1 and flip it (bigger numbers = narrower CI) CI.boot$w2 <- (CI.boot$width - min(CI.boot$width)) CI.boot$w3 <- 1-(CI.boot$w2/max(CI.boot$w2)) # convert bootstrapped spaghettis to long format b2 <- melt(l0.boot) b2$x <- newx[,1] colnames(b2) <- c("index", "B", "value", "x") library(ggplot2) library(RColorBrewer) # Construct ggplot # All plot elements are constructed as a list, so they can be added to an existing ggplot # if add == FALSE: provide the basic ggplot object p0 <- ggplot(data, aes_string(x=IV, y=DV)) + theme_bw() # initialize elements with NULL (if they are defined, they are overwritten with something meaningful) gg.tiles <- gg.poly <- gg.spag <- gg.median <- gg.CI1 <- gg.CI2 <- gg.lm <- gg.points <- gg.title <- NULL if (shade == TRUE) { quantize <- match.arg(quantize, c("continuous", "SD")) if (quantize == "continuous") { print("Computing density estimates for each vertical cut ...") flush.console() if (is.null(ylim)) { min_value <- min(min(l0.boot, na.rm=TRUE), min(data[, DV], na.rm=TRUE)) max_value <- max(max(l0.boot, na.rm=TRUE), max(data[, DV], na.rm=TRUE)) ylim <- c(min_value, max_value) } # vertical cross-sectional density estimate d2 <- ddply(b2[, c("x", "value")], .(x), function(df) { res <- data.frame(density(df$value, na.rm=TRUE, n=slices, from=ylim[1], to=ylim[2])[c("x", "y")]) #res <- data.frame(density(df$value, na.rm=TRUE, n=slices)[c("x", "y")]) colnames(res) <- c("y", "dens") return(res) }, .progress="text") maxdens <- max(d2$dens) mindens <- min(d2$dens) d2$dens.scaled <- (d2$dens - mindens)/maxdens ## Tile approach d2$alpha.factor <- d2$dens.scaled^shade.alpha gg.tiles <- list(geom_tile(data=d2, aes(x=x, y=y, fill=dens.scaled, alpha=alpha.factor)), scale_fill_gradientn("dens.scaled", colours=palette), scale_alpha_continuous(range=c(0.001, 1))) } if (quantize == "SD") { ## Polygon approach SDs <- melt(CI.boot[, c("x", paste0("SD", 1:7))], id.vars="x") count <- 0 d3 <- data.frame() col <- c(1,2,3,3,2,1) for (i in 1:6) { seg1 <- SDs[SDs$variable == paste0("SD", i), ] seg2 <- SDs[SDs$variable == paste0("SD", i+1), ] seg <- rbind(seg1, seg2[nrow(seg2):1, ]) seg$group <- count seg$col <- col[i] count <- count + 1 d3 <- rbind(d3, seg) } gg.poly <- list(geom_polygon(data=d3, aes(x=x, y=value, color=NULL, fill=col, group=group)), scale_fill_gradientn("dens.scaled", colours=palette, values=seq(-1, 3, 1))) } } print("Build ggplot figure ...") flush.console() if (spag==TRUE) { gg.spag <- geom_path(data=b2, aes(x=x, y=value, group=B), size=0.7, alpha=10/B, color=spag.color) } if (show.median == TRUE) { if (mweight == TRUE) { gg.median <- geom_path(data=CI.boot, aes(x=x, y=M, alpha=w3^3), size=.6, linejoin="mitre", color=median.col) } else { gg.median <- geom_path(data=CI.boot, aes(x=x, y=M), size = 0.6, linejoin="mitre", color=median.col) } } # Confidence limits if (show.CI == TRUE) { gg.CI1 <- geom_path(data=CI.boot, aes(x=x, y=UL), size=1, color="red") gg.CI2 <- geom_path(data=CI.boot, aes(x=x, y=LL), size=1, color="red") } # plain linear regression line if (show.lm==TRUE) {gg.lm <- geom_smooth(method="lm", color="darkgreen", se=FALSE)} gg.points <- geom_point(data=data, aes_string(x=IV, y=DV), size=1, shape=shape, fill="white", color="black") if (title != "") { gg.title <- theme(title=title) } gg.elements <- list(gg.tiles, gg.poly, gg.spag, gg.median, gg.CI1, gg.CI2, gg.lm, gg.points, gg.title, theme(legend.position="none")) if (add == FALSE) { return(p0 + gg.elements) } else { return(gg.elements) } }[/cc]

43 thoughts on “Visually weighted/ Watercolor Plots, new variants: Please vote!”

  1. Pingback: Start
  2. Felix, these are great. Thanks for the terrific implementation.
    Going back to original idea of Visually-Weighted Regression, i.e. to focus the viewer on important parts of the graph, I would comment that some of these (very creative) schemes might distract the viewer from the central portion of the regression. Eg. When I look at Plots 6, 7, 10 and 11, I feel like my eyes are drawn toward the edges of the graph where there’s a lot of bright yellow/orange. That’s why we originally were fading some of the plots as the sample got thinner. I think that your plots which conserve ink (1, 8 and 9) do this quite nicely. But when the color for zero-mass is something other than the background color, I think it creates visual weight in places of the graph where there is not much information.

    1. Point very well taken. Although I love the colors, they are a little bit distracting me from the original idea. SUPERB job dear Felix, simply SUPERB!!!

    2. Solomon, I agree 100% with your remarks. The more colorful plots were more meant as a show-off for different color schemes. It’s fun to see how the overall impression changes with new colors!

  3. Not only are the plots awesome, but I love the “anti Tufte” name. Use as much ink as possible! Write in white space on bits of paper covered in black! Use all the ink!

  4. When I tried to run the code, I got following error:
    Error in get(x, envir = this, inherits = inh)(this, …) :
    unused argument(s) (range = c(0.001, 1))

    1. Hi, can you send me a data set to replicate that error (by email)? Then I can look for the error.

      1. I had the same problem.
        I think it’s from an older version of ggplot2. I changed “scale_alpha_continuous(range=c(0.001, 1))” to “scale_alpha_continuous(to=c(0.001, 1))” and it worked fine. I then updated everything and the original version worked.

        1. Graeme, you’re right, that was the problem. With the current code, you even need the most recent version of ggplot2 (0.9.2), which has been released a few days ago.

  5. When I do these graphs I get the following warning:
    Warning message:
    ‘opts’ is deprecated.
    Use ‘theme’ instead.
    See help(“Deprecated”)
    The graphs still are produced though and look good. Is that something to worry about or can I assume the plots are ok anyways?

    1. The plots are OK – you have a somewhat outdated version of the ggplot2 package. When you update to 0.9.2, the warnings disappear.

  6. Hi,
    first of all, great work man 😉 but I need to draw multiple regressions with different colors within one plot. Maybe this is possible by adding another function parameter, like newplot=TRUE/FALSE. Is this somehow possible, because I am not familiar with ggplot2?
    Regards …

    1. Hi Andre,
      sorry for my late reply. I was thinking quite a while on how to do this in a clever way, and I think I found a way.
      Now there’s a new parameter add. You can construct a first plot (with add = FALSE), construct a second plot (with add = TRUE), and then you can merge both with the ‘+’ operator. Here’s an example:
      # build a demo data set
      set.seed(1)
      x <- rnorm(200, 0.8, 1.2) e <- rnorm(200, 0, 3)*(abs(x)^1.5 + .5) + rnorm(200, 0, 4) e2 <- rnorm(200, 0, 5)*(abs(x)^1.5 + .8) + rnorm(200, 0, 5) y <- 8*x - x^3 + e y2 <- 20 + 3*x + 0.6*x^3 + e2 df <- data.frame(x, y, y2) p1 <- vwReg(y~x, df, spag=TRUE, shade=FALSE) p2 <- vwReg(y2~x, df, add=TRUE, spag=TRUE, shade=FALSE, spag.color="red", shape=3) p3 <- p1 + p2 p3

      vwReg: Two groups
      You can control the appearance of each group with the parameters of the function, and add as many plots as you like. You can also provide different data frames for each plot.
      It works well with spaghetti plots; if you do shaded plots, at the moment all groups have the same palette. I still have to find a solution for that.
      Cheers,
      Felix

      1. Hi Felix,
        and thank you very much, for the upgrade of your function 😉 I will try to use this visualization in my next publication …
        Regards,
        André

  7. Thanks for a great visualization function.
    When “show.CI=TRUE” is specified as an argument I get the message:
    Error in eval(expr, envir, enclos) : object ‘B’ not found
    It is generated by the “return(gg.elements)” at the end.
    It looks like the call to aes_string might be the problem.

    1. Fantastic visualization. Aside from the same error as Derek Jones when calling ‘CI=TRUE’, calling this line:
      p3 <- p1 + p2
      Also throws up the error:
      Error in p + o : non-numeric argument to binary operator
      In addition: Warning message:
      Incompatible methods ("+.gg", "Ops.data.frame") for "+"
      Anyone got any thoughts as to why?
      Thanks,
      Dave

      1. The show.CI bug should be removed now!
        Concerning the error message about “non-numeric argument”: which version of ggplot do you use? It should be >= 0.9.2. But I think that either p1 or p2 are not a ggplot-object (maybe one of them was overwritten with something else in your script)?

        1. Many thanks- CI bug is removed and the plots work perfectly. As for the other error, I am using 0.9.2.1, but I think your hunch was right as I re-ran the code in a new session and it worked fine. Apologies for not checking that.
          Thanks again for your sharing this superb method.
          Dave

        2. One other thought: it would be nice if this method worked with faceting too, is this something that would be easy to implement?
          Dave

  8. Dear Felix,
    Fantastic work- These look spectacular. Do you know of, or do you have any plans to, apply these visualization techniques to Kaplan-Meier survival curves? (e.g. survfit objects using the package survival). The package denstrip aims to do so but seems to have trouble with the changing (over time) and assymmetric nature of confidence intervals in survival curves. I think it would be fantastic to visualize survival curves in this manner also. Kindest regards,
    Rogier

  9. Hi Felix,
    Thank you for your great work and sharing it on the blog.
    The value of visually appealing, and informative, graphics for communicating information in publications and presentations is described well in your previous post on the watercolor plots. The real value of these appealing graphs for me is in motivating my psychology statistics students to get exited about using R.
    Once again thank you for your contribution to the community.
    Kind Regards,
    Conrad Zygmont

  10. Hi Felix,
    Great work, man!
    I have tried to use your function with method=lm (to get a straight line), but this message appeared:
    [1] “Computing boostrapped smoothers …”
    Error in vwReg(Y ~ X, data = mydata, method=lm, quantize = “SD”) :
    could not find function “method”
    Could you help me with this?
    Cheers,
    Daniele

      1. Thanks, for the quick response!
        I have tried this too. But the message is the same:
        > vwReg(y~x, data=mydata,method=”lm”,quantize=”SD”)
        [1] “Computing boostrapped smoothers …”
        Error in vwReg(BGR ~ WGR, data = football, method = “lm”, quantize = “SD”) :
        could not find function “method”
        Cheers,D.

        1. Interesting – it works for me (without the quotes – it’s method=lm). Did you try it with an empty environment and the current version of the function?

  11. Im interested in visualizing uncertainty for time series data. I can easily simulate hundreds of points along the x-axis. My experience from using the vwReg function is that (a) it’s very slow (bootstrapping) and (b) the density along the y-axis looks a lot different from what I expected when I look at box plots of the same data.
    I did not take a deep dive into the code, but have you considered simply custering the points along the y-axsis and use the cluster centers plus weights instead of smoothing?

  12. Beautyful job!
    I had a problem when trying to subset data for overplotting. When i use:
    df=data.frame(x=rnorm(50,15), y=rnorm(50, 15), s=rep(c(“a”,”b”),50))
    vwReg(x[s==”a”]~y[s==”b”], data=df, method=lm,
    slices=400, palette=colorRampPalette(c(“orange”, “green”, “yellow”),
    bias=5)(20), ylim=NULL, quantize = “SD”, add=F)
    [1] “Computing boostrapped smoothers …”
    Show Traceback
    Rerun with Debug
    Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, :
    min not meaningful for factors

    1. Subsetting within the formula is not possible. You’d have to subset your data frame before:
      df2 <- data.frame(x = df$x[df$s=="a"], y = df$y[df$s=="b"]) vwReg(x~y, data=df, method=lm, slices=400, palette=colorRampPalette(c("orange", "green", "yellow"), bias=5)(20), quantize = "SD")

  13. Thanks for creating such a great visualization!
    I have a problem of visualizing my data with a requirement of hiding all the data points. Because sample size is very big and the data points are broadly located, the watercolor regression line looks very small in the figure and it is difficult to see the details clearly. May I ask how can I hide the date points and only show (zoom in) the regression line?
    Thanks in advance!

    1. It’s not a feature of the function, yet. But you can simply try to reomve the “+ geom_point(…)” part in the function!

Comments are closed.