[Update 1: Sep 5, 2012: Explore the Magical Data Enhancer by IRES, using this visualization technique]
[Update 2: Sep 6, 2012: See new improved plots, and new R code!
Solomon Hsiang proposed an appealing method for visually displaying the uncertainty in regressions (see his blog [1][2], and also the discussions on the Statistical Modeling, Causal Inference, and Social Science Blog [1][2]).
I implemented the method in R (using ggplot2), and used an additional method of determining the shading (especially concerning Andrew Gelman’s comment that traditional statistical summaries (such as 95% intervals) give too much weight to the edges. In the following I will show how to produce plots like that:
I used following procedure:
- Compute smoothers from 1000 bootstrap samples of the original sample (this results in a spaghetti plot)
- Calculate a density estimate for each vertical cut through the bootstrapped smoothers. The area under the density curve always is 1, so the ink is constant for each y-slice.
- Shade the figure according to these density estimates.
Now let’s construct some plots!
The basic scatter plot:
No we show the bootstrapped smoothers (a “spaghetti plot”). Each spaghetti has a low alpha. That means that overlapping spaghettis produce a darker color and already give weight to highly populated regions.
Here is the shading according to the smoother’s density:
Now, we can overplot the median smoother estimate for each x value (the “median smoother”):
Or, a visually weighted smoother:
Finally, we can add the plain linear regression line (which obviously does not refelct the data points very well):
At the end of this post is the function that produces all of these plots. The function returns a ggplot object, so you can modify it afterwards, e.g.:
vwReg(y~x, df, shade=FALSE, spag=TRUE) + xlab("Implicit power motive") + ylab("Corrugator activity during preparation")[/cc]
Here are two plots with actual data I am working on:
The correlation of both variables is .22 (p = .003).
A) As a heat map (note: the vertical breaks at the left and right end occur due to single data points that get either sampled or not during the bootstrap):
B) As a spaghetti plot:
Finally, here's the code (sometimes the code box is collapsed - click the arrow on the top right of the box to open it). Comments and additions are welcome.
[cc lang="rsplus" escaped="true"][Update: I removed the code, as an updated version has been published here (see at the end of the post)][/cc]
Great work! However, the actual code isn’t visible: just some comments and the name of the function.
There seems to be some bug in the syntax highlighter – sometimes it’s there, sometimes not. Now the code is visible – I hope it stays like that!
Simply wonderful ideas and implementation. I am most amazed with how brilliant people like you is willing to share its wisdom!
Thank you!
in order to add a second predictors it would require to have a tri-dimensional plot, correct?
Thanks for your friendly comment – this blog post stands on the shoulders of Solomon’s idea and many comments on Andrew Gelman’s blog. I think that’s the power of open source!
Concerning two predictors: yes, that would require a surface plane, or some sort moderation plot where two 2-dim plots are plotted at some selected values of one of the predictors (e.g. +/- 1 SD).
F*
I know that it is Salomon’s idea, but you brilliantly implemented in R and thanks to you I visited Salomon’s website and he is giving you credit for the implementation. It is amazing to see how academic networks emerge. Thank you, again FelixS
the code was available earlier today, now not 🙁
einar
Thanks for this, looks much more intuitive than simply plotting error bounds. A random suggestion which might make sense when the focus is more the visuals is to always include the first and last points in the bootstrap sampling so that all lines start and end at the same place. I know this would somewhat bias the estimates. Thoughts?
Interesting idea. But I think it would indeed introduce too much bias, particularly in data sets with few data points, as all smoothers would go through that one point …
The code’s visible for me.
Totally awesome! One of those functions you’re just dying to try out [roots around in cupboard to find some regression lines to try it out on]
Hi, the code has an error, the program can not find the function “melt”.
One line was missing: “library(reshape2)”. It’s included now (I always have that package in my environment, so I forgot to put the library call into the function). Should work now!
Hi,
I really like those figures! I tried to run the sample code in R and it seems to execute fine, but ggplot2 doesn’t seem to generate any actual figures – am I doing something wrong?
hard to say – it works for me. Do you start R from the console? Maybe you have to open a graphic device before you execute the function?
Hi, where to get the source code to reproduce the plots?
Oh, no I managed to get it. Somehow hidden at first.
What would be the best way of doing this in black and white (including grayscale) for publication plots?
See here! Cheers, F
Thanks for this very useful code! This is a lovely implementation, and very effective and displaying uncertainty.
I did have one suggestion.
Because the majority of the computational effort is in bootstrapping the smoothers, it would be ideal if this did not have to be done for each attempt to draw the graph. It would be nice if one could capture these estimates as one step, and display them as a second step. This would encourage tinkinering with the image, and testing various formats. The central idea here being that one might well want to play with the display without having to re-estimate the smoothers.
Many thanks,
Hi,
Thanks for this “reproducible research”, everything works well, apart from the spaghetti plot. All lines are darkblue, without any weight on the color.
I wonder if someone also have this issue
I have a Warning message:
“Removed 20842 rows containing missing values (geom_path).”
Thomas
Hi Thomas, sorry for my late reply. The spaghettis have a higher alpha transparency, and are not weighted by color. Overlapping lines produce a darker shading. The error message can be ignored (it’s due to the bootstrap resampling). HTH Felix
Hi Felix,
first of all. THX for sharing that. Its very very cool. If you dont mind to answer me a short question regarding labeling. Is there a way to skip y and x labels as well as the vertical and horizontal lines on the main plot area. I just tried the usual suspects but that doesnt work out:
vwReg(prestige~income,Prestige,family=”symmetric”,palette=brewer.pal(9,”YlOrRd”),xlab=” ,ylab=”,main=”)
vwReg(prestige ~ income, Prestige,family=”symmetric”,palette=brewer.pal(9,”YlOrRd”),xlab=NULL,ylab=NULL,main=NULL)
The function returns a ggplot-object. Hence, you can modify it using the ggplot syntax, such as:
p1 <- vwReg(y~x, df) p1 + xlab("") + ylab("YLAB")
Hi Felix,
Really nice work. I have a family of spaghetti lines (that is, I don’t have a regular sample of points but instead their, so to say, estimated bootstraped representaion) and want to create a shaded plot ‘à la watercolor’ from them. Any advice on how to proceed?
Hello Felix,
Great visualization. It runs well with my data, but I have a quick question about using this function to regress logistic growth. Instead of using loess, I’m attempting to use method=glm but receive the error: Error in method(formula, data) : ‘family’ not recognized.
Is it possible to indicate that the regression should take a logistic form instead of loess or lm?