Statistics

# Installation of WRS package (Wilcox’ Robust Statistics)

Some users had trouble installing the WRS package from R-Forge. Here’s a method that should work automatically and fail-safe:

?View Code RSPLUS
 # first: install dependent packages install.packages(c("MASS", "akima", "robustbase"))   # second: install suggested packages install.packages(c("cobs", "robust", "mgcv", "scatterplot3d", "quantreg", "rrcov", "lars", "pwr", "trimcluster", "parallel", "mc2d", "psych", "Rfit"))   # third: install WRS install.packages("WRS", repos="http://R-Forge.R-project.org", type="source")

WRS cannot be hosted on CRAN, as CRAN demands help files for every user-visible function. This has not been done for WRS (yet). For the time being, this somewhat more complicated installation routine has to be used.

# The first CREDAM Award for creative data management goes to … the German government!

“If you torture the data long enough, it will confess.”
This aphorism, attributed to Ronald Coase, sometimes has been used in a disrespective manner, as if was wrong to do creative data analysis. This view obviously is misleading. In contrast, we at IRET have a much more positive and humanistic view of data management, and therefore we have made this aphorism to our leading guide in difficult times.

We at IRET have made it to our mission to proliferate and foster creative ways of data analysis. Therefore, we proudly introduce an award in recognition of outstanding data creativity: the CREDAM Award. CREDAM is both an acronym (CREative DAta Management), and a statement: credam (lat.) means “I will believe”, or “I will trust”.

## This years CREDAM Award goes to …….. the German government!

A new report on poverty in Germany is going to be published soon. What does the data say?

Year
Overall property in possession of rich households
Overall property in possession of complete lower half
199845%3%
200853%1%

Seems like a pretty clear picture, and in a previous version of the report, the authors concluded (based on this and other data), that “income disparity increased” (see Süddeutsche Zeitung). But that is wrong!! But why is it wrong? Well, that interpretation “does not reflect the opinion of the German government”.

On the pressure of the leader of the minor coalition partner, Philipp Rösler (which currently would be elected by 4% of Germans), this conclusion was re-interpreted. Now, the report comes to the completely opposite conclusion: “income disparity decreases!

### As this is a great example of creative data analysis, which liberates us from restrictive and anally retentive “scientific” procedures, we are happy to award the first CREDAM trophy to the German government, especially Phillip Rösler. Congratulations!

(Maybe we should think about adopting this strategy for scientific reports as well. Given highly flexible approaches of data analysis, conclusions should rather be based on a majority vote of all (co-)authors and reviewers, not on empirical evidence.)

# Improved evolution of correlations

As an update of this post: here’s an improved version of “The evolution of correlations”.

From the original post:

This is the evolution of a bivariate correlation between two questionnaire scales, “hope of power” and “fear of losing control”. Both scales were administered in an open online study. The video shows how the correlation evolves from r = .69*** (n=20) to r = .26*** (n=271). It does not stabilize until n = 150.

Data has not been rearranged – it is the random order how participants dropped into the study. This had been a rather extreme case of an unstable correlation – other scales in this study were stable right from the beginning. Maybe this video could help as an anecdotal caveat for a careful interpretation of correlations with small n’s (and with ‘small’ I mean n < 100) …

The right panel now displays the correlation in each step. The horizontal green line is the final correlation that is approached, the curved dotted line shows the marginal correlation that would be significant at that sample size. As the empirical curve always is above this dotted line, it is significantly different from zero in each step.

Here the code that created the movie. It’s not fully self-contained – the function plotReg plots the dual-panel display, dat0, A, and B are parameters passed to this function. You can insert any other function here. The function loops through the rows of a data frame and saves a plot at every step into a subfolder. Finally, the function needs the command line version of ffmpeg, which connects the pictures to a movie.

?View Code RSPLUS
 makeMovie <- function(fname, dat0, A, B, fps=15) {   # create a new directory for the pictures dir.create(fname)   # create the picture sequence picName <- paste(fname, "/", fname, "_%03d.jpg", sep="") jpeg(picName, width=800, height=450, quality=95) for (i in 15:nrow(dat0)) { print(i) plotReg(A, B, i, keep=15) } dev.off()   # delete any existing movie file unlink(paste(fname,".mpg",sep=""))   # point system to R's working directory system(paste("cd ", gsub(" ", "\\ ", getwd(), fixed=TRUE)))   # show & execute the command line expression for ffmpeg to glue the pictures together print(paste(paste0("ffmpeg -r ", fps, " -i ", fname, "/", fname, "_%03d.jpg -sameq -r 25 ", paste0(fname,".avi")))) system(paste(paste0("ffmpeg -r ", fps, " -i ", fname, "/", fname, "_%03d.jpg -sameq -r 25 ", paste0(fname,".avi")))) }

# Optimizing parameters for an oscillator – Video

Here’s a video how the modFit function from the FME package optimizes parameters for an oscillation. A Nelder-Mead-optimizer (R function optim) finds the best fitting parameters for an undampened oscillator. Minimum was found after 72 iterations, true parameter eta was -.05:

More on estimating parameters of differential equations is coming later on this blog!

Things I’ve learned:

• ffmpeg does not like pngs. They are internally converted to jpg in a very low quality and I could not find a way to improve this quality. Lesson learned: Export high quality jpgs from your R function
• Use a standard frame rate for the output file (i.e., 24, 25, or 30 fps)
• My final ffmpeg command: ffmpeg -r 10 -i modFit%03d.jpg -r 25 -b:v 5000K modelFit.avi
• -r 10: Use 10 pictures / second as input
• -i modFit%03d.jpg: defines the names of the input files, modFit001.jpg, modFit002.jpg, …
• -r 25: Set framerate of output file to 25 fps
• -b:v 5000K: set bitrate of video to a high value
• modelFit.mp4: video name and encoding type (mp4)

# R-package: Wilcox’ Robust Statistics updated (WRS v0.20)

Rand Wilcox constantly updates the functions accompanying his books on robust statistics. Recently, they have been updated to version 20. The functions are available in the WRS package for R – for installation simply type
install.packages("WRS", repos="http://R-Forge.R-project.org")

In version 0.20, a number of functions dealing with ANCOVA have been added and some others improved. Unfortunately, only very few help files exist for the functions. I would recommend to check out the source code, as most functions have a comment section roughly explaining the parameters. Alternatively, consult Wilcox’ books for descriptions of the functions.

References:
Wilcox, R. R. (2012). Introduction to Robust Estimation and Hypothesis Testing, 3rd Ed. Academic Press.
Wilcox, R. (2012). Modern Statistics for the Social and Behavioral Sciences: A Practical Introduction. CRC Press.
Wilcox, R. R. (2010). Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy. Springer, 2nd Ed.
Wilcox, R. R. (2009). Basics Statistics: Understanding Conventional Methods and Modern Insights. New York: Oxford.

# Visually weighted/ Watercolor Plots, new variants: Please vote!

Update Oct-23: Added a new parameter add to the function. Now multiple groups can be plotted in a single plot (see example in my comment)

As a follow-up on my R implementation of Solomon’s watercolor plots, I made some improvements to the function. I fine-tuned the graphical parameters (the median smoother line now diminishes faster with increasing CIs, and the shaded watercolors look more pretty). Furthermore, the function is faster and has more features:

• You can define any standard regression function for the bootstrap procedure.
• vwReg(y ~ x, df, method=lm)
• vwReg(y ~ x + I(x^2), df, method=lm)
• Provide parameters for the fitting function.
• You can make the smoother’s span larger. Then it takes more points into account when doing the local fitting. Per default, the smoother fits a polynomial of degree two – that means as you increase span you will approach the overall quadratic fit: vwReg(y ~ x, df, span=2)
• You can also make the smoother’s span smaller, then it takes less points for local fitting. If it is too small, it will overfit and approach each single data point. The default span (.75) seemed to be the best choice for me for a variety of data sets: vwReg(y ~ x, df, span=0.5)
• Use a robust M-estimator for the smoother; see ?loess for details: vwReg(y ~ x, df, family=”symmetric”)
• Provide your own color scheme (or, for example, a black-and-white scheme). Examples see pictures below.
• Quantize the color ramp, so that regions for 1, 2, and 3 SD have the same color (an idea proposed by John Mashey).

At the end of this post is the source code for the R function.

## Some picture – please vote!

Here are some variants of the watercolor plots – at the end, you can vote for your favorite (or write something into the comments). I am still fine-tuning the default parameters, and I am interested in your opinions what would be the best default.

Plot 1: The current default

Plot 2: Using an M-estimator for bootstrap smoothers. Usually you get wider confidence intervalls.

Plot 3:Increasing the span of the smoothers

Plot 4: Decreasing the span of the smoothers

Plot 5: Changing the color scheme, using a predefined ColorBrewer palette. You can see all available palettes by using this command: library(RColorBrewer); display.brewer.all()

Plot 6: Using a custom-made palette

Plot 7: Using a custom-made palette; with the parameter bias you can shift the color ramp to the “higher” colors:

Plot 8: A black and white version of the plot

Plot 9: The anti-Tufte-plot: Using as much ink as possible by reversing black and white (a.k.a. “the Milky-Way Plot“)

Plot 10: The Northern Light Plot/ fMRI plot. This plotting technique already has been used by a suspicious company (called IRET – never heard of that). I hurried to publish the R code under a FreeBSD license before they can patent it! Feel free to use, share, or change the code for whatever purpose you need. Isn’t that beautiful?

Plot 11: The 1-2-3-SD plot. You can use your own color schemes as well, e.g.: vwReg(y~x, df, bw=TRUE, quantize=”SD”)

Any comments or ideas? Or just a vote? If you produce some nice plots with your data, you can send it to me, and I will post a gallery of the most impressive “data art”!

Cheers,

Felix

Which water color plot do you like most?

View Results

?View Code RSPLUS
 # Copyright 2012 Felix Schönbrodt # All rights reserved. # # FreeBSD License # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions are # met: # # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER AS IS'' AND ANY # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF # LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING # NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS # SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. # # The views and conclusions contained in the software and documentation # are those of the authors and should not be interpreted as representing # official policies, either expressed or implied, of the copyright # holder.   # Version history: # 0.1: original code # 0.1.1: changed license to FreeBSD; re-established compability to ggplot2 (new version 0.9.2)   ## Visually weighted regression / Watercolor plots ## Idea: Solomon Hsiang, with additional ideas from many blog commenters     # B = number bootstrapped smoothers # shade: plot the shaded confidence region? # shade.alpha: should the CI shading fade out at the edges? (by reducing alpha; 0 = no alpha decrease, 0.1 = medium alpha decrease, 0.5 = strong alpha decrease) # spag: plot spaghetti lines? # spag.color: color of spaghetti lines # mweight: should the median smoother be visually weighted? # show.lm: should the linear regresison line be plotted? # show.CI: should the 95% CI limits be plotted? # show.median: should the median smoother be plotted? # median.col: color of the median smoother # shape: shape of points # method: the fitting function for the spaghettis; default: loess # bw = TRUE: define a default b&w-palette # slices: number of slices in x and y direction for the shaded region. Higher numbers make a smoother plot, but takes longer to draw. I wouldn'T go beyond 500 # palette: provide a custom color palette for the watercolors # ylim: restrict range of the watercoloring # quantize: either "continuous", or "SD". In the latter case, we get three color regions for 1, 2, and 3 SD (an idea of John Mashey) # add: if add == FALSE, a new ggplot is returned. If add == TRUE, only the elements are returned, which can be added to an existing ggplot (with the '+' operator) # ...: further parameters passed to the fitting function, in the case of loess, for example, "span = .9", or "family = 'symmetric'" vwReg <- function(formula, data, title="", B=1000, shade=TRUE, shade.alpha=.1, spag=FALSE, spag.color="darkblue", mweight=TRUE, show.lm=FALSE, show.median = TRUE, median.col = "white", shape = 21, show.CI=FALSE, method=loess, bw=FALSE, slices=200, palette=colorRampPalette(c("#FFEDA0", "#DD0000"), bias=2)(20), ylim=NULL, quantize = "continuous", add=FALSE, ...) { IV <- all.vars(formula)[2] DV <- all.vars(formula)[1] data <- na.omit(data[order(data[, IV]), c(IV, DV)])   if (bw == TRUE) { palette <- colorRampPalette(c("#EEEEEE", "#999999", "#333333"), bias=2)(20) }   print("Computing boostrapped smoothers ...") newx <- data.frame(seq(min(data[, IV]), max(data[, IV]), length=slices)) colnames(newx) <- IV l0.boot <- matrix(NA, nrow=nrow(newx), ncol=B)   l0 <- method(formula, data) for (i in 1:B) { data2 <- data[sample(nrow(data), replace=TRUE), ] data2 <- data2[order(data2[, IV]), ] if (class(l0)=="loess") { m1 <- method(formula, data2, control = loess.control(surface = "i", statistics="a", trace.hat="a"), ...) } else { m1 <- method(formula, data2, ...) } l0.boot[, i] <- predict(m1, newdata=newx) }   # compute median and CI limits of bootstrap library(plyr) library(reshape2) CI.boot <- adply(l0.boot, 1, function(x) quantile(x, prob=c(.025, .5, .975, pnorm(c(-3, -2, -1, 0, 1, 2, 3))), na.rm=TRUE))[, -1] colnames(CI.boot)[1:10] <- c("LL", "M", "UL", paste0("SD", 1:7)) CI.boot$x <- newx[, 1] CI.boot$width <- CI.boot$UL - CI.boot$LL   # scale the CI width to the range 0 to 1 and flip it (bigger numbers = narrower CI) CI.boot$w2 <- (CI.boot$width - min(CI.boot$width)) CI.boot$w3 <- 1-(CI.boot$w2/max(CI.boot$w2))     # convert bootstrapped spaghettis to long format b2 <- melt(l0.boot) b2$x <- newx[,1] colnames(b2) <- c("index", "B", "value", "x") library(ggplot2) library(RColorBrewer) # Construct ggplot # All plot elements are constructed as a list, so they can be added to an existing ggplot # if add == FALSE: provide the basic ggplot object p0 <- ggplot(data, aes_string(x=IV, y=DV)) + theme_bw() # initialize elements with NULL (if they are defined, they are overwritten with something meaningful) gg.tiles <- gg.poly <- gg.spag <- gg.median <- gg.CI1 <- gg.CI2 <- gg.lm <- gg.points <- gg.title <- NULL if (shade == TRUE) { quantize <- match.arg(quantize, c("continuous", "SD")) if (quantize == "continuous") { print("Computing density estimates for each vertical cut ...") flush.console() if (is.null(ylim)) { min_value <- min(min(l0.boot, na.rm=TRUE), min(data[, DV], na.rm=TRUE)) max_value <- max(max(l0.boot, na.rm=TRUE), max(data[, DV], na.rm=TRUE)) ylim <- c(min_value, max_value) } # vertical cross-sectional density estimate d2 <- ddply(b2[, c("x", "value")], .(x), function(df) { res <- data.frame(density(df$value, na.rm=TRUE, n=slices, from=ylim[1], to=ylim[2])[c("x", "y")]) #res <- data.frame(density(df$value, na.rm=TRUE, n=slices)[c("x", "y")]) colnames(res) <- c("y", "dens") return(res) }, .progress="text") maxdens <- max(d2$dens) mindens <- min(d2$dens) d2$dens.scaled <- (d2$dens - mindens)/maxdens ## Tile approach d2$alpha.factor <- d2$dens.scaled^shade.alpha gg.tiles <- list(geom_tile(data=d2, aes(x=x, y=y, fill=dens.scaled, alpha=alpha.factor)), scale_fill_gradientn("dens.scaled", colours=palette), scale_alpha_continuous(range=c(0.001, 1))) } if (quantize == "SD") { ## Polygon approach SDs <- melt(CI.boot[, c("x", paste0("SD", 1:7))], id.vars="x") count <- 0 d3 <- data.frame() col <- c(1,2,3,3,2,1) for (i in 1:6) { seg1 <- SDs[SDs$variable == paste0("SD", i), ] seg2 <- SDs[SDs$variable == paste0("SD", i+1), ] seg <- rbind(seg1, seg2[nrow(seg2):1, ]) seg$group <- count seg$col <- col[i] count <- count + 1 d3 <- rbind(d3, seg) } gg.poly <- list(geom_polygon(data=d3, aes(x=x, y=value, color=NULL, fill=col, group=group)), scale_fill_gradientn("dens.scaled", colours=palette, values=seq(-1, 3, 1))) } } print("Build ggplot figure ...") flush.console() if (spag==TRUE) { gg.spag <- geom_path(data=b2, aes(x=x, y=value, group=B), size=0.7, alpha=10/B, color=spag.color) } if (show.median == TRUE) { if (mweight == TRUE) { gg.median <- geom_path(data=CI.boot, aes(x=x, y=M, alpha=w3^3), size=.6, linejoin="mitre", color=median.col) } else { gg.median <- geom_path(data=CI.boot, aes(x=x, y=M), size = 0.6, linejoin="mitre", color=median.col) } } # Confidence limits if (show.CI == TRUE) { gg.CI1 <- geom_path(data=CI.boot, aes(x=x, y=UL), size=1, color="red") gg.CI2 <- geom_path(data=CI.boot, aes(x=x, y=LL), size=1, color="red") } # plain linear regression line if (show.lm==TRUE) {gg.lm <- geom_smooth(method="lm", color="darkgreen", se=FALSE)} gg.points <- geom_point(data=data, aes_string(x=IV, y=DV), size=1, shape=shape, fill="white", color="black") if (title != "") { gg.title <- theme(title=title) } gg.elements <- list(gg.tiles, gg.poly, gg.spag, gg.median, gg.CI1, gg.CI2, gg.lm, gg.points, gg.title, theme(legend.position="none")) if (add == FALSE) { return(p0 + gg.elements) } else { return(gg.elements) } } # Amazing fMRI plots for everybody! Dear valued customer, it is a well-known scientific truth that research results which are accompanied by a fancy, colorful fMRI scan, are perceived as more believable and more persuasive than simple bar graphs or text results (McCabe & Castel, 2007; Weisberg, Keil, Goodstein, Rawson, & Gray, 2008). Readers even agree more with fictitious and unsubstantiated claims, as long as you provide a colorful brain image, and it works even when the subject is a dead salmon. ## The power of brain images for everybody What are the consequence of these troubling findings? The answer is clear. Everybody should be equipped with these powerful tools of research communication! We at IRET made it to our mission to provide the latest, cutting-edge tools for your research analysis. In this case we adopted a new technology called “visually weighted regression” or “watercolor plots” (see here, here, or here), and simply applied a new color scheme. But now, let’s get some hands on it! ## The example Imagine you invested a lot of effort in collecting the data of 41 participants. Now you find following pattern in 2 of your 87 variables: You could show that plain scatterplot. But should you do it? Nay. Of course everybody would spot the outliers on the top right. But which is much more important: it is b-o-r-i-n-g! What is the alternative? Reporting the correlation as text? “We found a correlation of r = .38 (p = .014)”. Yawn. Or maybe: “We chose to use a correlation technique that is robust against outliers and violations of normality, the Spearman rank coefficient. It turned out that the correlation broke down and was not significant any more (r = .06, p = .708).”. Don’t be silly! With that style of scientific reporting, there would be nothing to write home about. But you can be sure: we have the right tools for you. Finally, the power of pictures is not limited to brain research – now you can turn any data into a magical fMRI plot like that: Isn’t that beautiful? We recommend to accompany the figure with an elaborated description: “For local fitting, we used spline smoothers from 10000 bootstrap replications. For a robust estimation of vertical confidence densities, a re-descending M-estimator with Tukey’s biweight function was employed. As one can clearly see in the plot, there is significant confidence in the prediction of the x=0, y=0 region, as well as a minor hot spot in the x=15, y=60 region (also known as the supra-dextral data region).” ## Magical Data Enhancer Tool With the Magical Data Enhancer Tool (MDET) you can … • … turn boring, marginally significant, or just crappy results into a stunning research experience • … publish in scientific journal with higher impact factors • … receive the media coverage that you and your research deserve • … achieve higher acceptance rates from funding agencies • … impress young women at the bar (you wouldn’t show a plain scatterplot, dude?!) ## FAQ Q: But – isn’t that approach unethical? A: No, it’s not at all. In contrast, we at IRES think that it is unethical that only some researchers are allowed to exploit the cognitive biases of their readers. We design our products with a great respect for humanity and we believe that every researcher who can afford our products should have the same powerful tools at hand. Q: How much does you product cost? A: The standard version of the Magical Data Enhancer ships for 12’998$. We are aware that this is a significant investment. But, come on: You deserve it! Furthermore, we will soon publish a free trial version, including the full R code on this blog. So stay tuned!

Best regards,

Lexis “Lex” Brycenet (CEO & CTO Research Communication)
International Research Enhancement Technology (IRET)

# Visually weighted regression in R (à la Solomon Hsiang)

[Update 1: Sep 5, 2012: Explore the Magical Data Enhancer by IRES, using this visualization technique]

[Update 2: Sep 6, 2012: See new improved plots, and new R code!

Solomon Hsiang proposed an appealing method for visually displaying the uncertainty in regressions (see his blog [1][2], and also the discussions on the Statistical Modeling, Causal Inference, and Social Science Blog [1][2]).

I implemented the method in R (using ggplot2), and used an additional method of determining the shading (especially concerning Andrew Gelman’s comment that traditional statistical summaries (such as 95% intervals) give too much weight to the edges. In the following I will show how to produce plots like that:

I used following procedure:

1. Compute smoothers from 1000 bootstrap samples of the original sample (this results in a spaghetti plot)
2. Calculate a density estimate for each vertical cut through the bootstrapped smoothers. The area under the density curve always is 1, so the ink is constant for each y-slice.
3. Shade the figure according to these density estimates.

## Now let’s construct some plots!

The basic scatter plot:

No we show the bootstrapped smoothers (a “spaghetti plot”). Each spaghetti has a low alpha. That means that overlapping spaghettis produce a darker color and already give weight to highly populated regions.

Here is the shading according to the smoother’s density:

Now, we can overplot the median smoother estimate for each x value (the “median smoother”):

Or, a visually weighted smoother:

Finally, we can add the plain linear regression line (which obviously does not refelct the data points very well):

At the end of this post is the function that produces all of these plots. The function returns a ggplot object, so you can modify it afterwards, e.g.:

?View Code RSPLUS
 vwReg(y~x, df, shade=FALSE, spag=TRUE) + xlab("Implicit power motive") + ylab("Corrugator activity during preparation")

## Here are two plots with actual data I am working on:

The correlation of both variables is .22 (p = .003).

A) As a heat map (note: the vertical breaks at the left and right end occur due to single data points that get either sampled or not during the bootstrap):

B) As a spaghetti plot:

Finally, here’s the code (sometimes the code box is collapsed – click the arrow on the top right of the box to open it). Comments and additions are welcome.

?View Code RSPLUS
 [Update: I removed the code, as an updated version has been published here (see at the end of the post)]

# Shading regions of the normal: The Stanine scale

For the presentation of norm values, often stanines are used (standard nine). These values mark a person’s relativ position in comparison to the sample or to norm values.
According to Wikipedia:

The underlying basis for obtaining stanines is that a normal distribution is divided into nine intervals, each of which has a width of 0.5 standard deviations excluding the first and last, which are just the remainder (the tails of the distribution). The mean lies at the centre of the fifth interval.

For illustration purposes, I wanted to plot the regions of the stanine values in the standard normal distribution – here’s the result:

First: Calculate the stanine boundaries and draw the normal curve:

?View Code RSPLUS
 # First: Calculate stanine breaks (on a z scale) stan.z <- c(-3, seq(-1.75, +1.75, length.out=8), 3)   # Second: get cumulative probabilities for these z values stan.PR <- pnorm(stan.z)   # define a color ramp from blue to red (... or anything else ...) c_ramp <- colorRamp(c("darkblue", "red"), space="Lab")   # draw the normal curve, without axes; reduce margins on left, top, and right par(mar=c(2,0,0,0)) curve(dnorm(x,0,1), xlim=c(-3,3), ylim=c(-0.03, .45), xlab="", ylab="", axes=FALSE)

Next: Calculate the shaded regions and plot a polygon for each region:

?View Code RSPLUS
 # Calculate polygons for each stanine region # S.x = x values of polygon boundary points, S.y = y values for (i in 1:(length(stan.z)-1)) { S.x <- c(stan.z[i], seq(stan.z[i], stan.z[i+1], 0.01), stan.z[i+1]) S.y <- c(0, dnorm(seq(stan.z[i], stan.z[i+1], 0.01)), 0) polygon(S.x,S.y, col=rgb(c_ramp(i/9), max=255)) }

And finally: add some legends to the plot:

?View Code RSPLUS
 # print stanine values in white # font = 2 prints numbers in boldface text(seq(-2,2, by=.5), 0.015, label=1:9, col="white", font=2)   # print cumulative probabilities in black below the curve text(seq(-1.75,1.75, by=.5), -0.015, label=paste(round(stan.PR[-c(1, 10)], 2)*100, "%", sep=""), col="black", adj=.5, cex=.8) text(0, -0.035, label="Percentage of sample <= this value", adj=0.5, cex=.8)

And finally, here’s a short script for shading only one region (e.g., the lower 2.5%):

?View Code RSPLUS
 # draw the normal curve curve(dnorm(x,0,1), xlim=c(-3,3), main="Normal density")   # define shaded region from.z <- -3 to.z <- qnorm(.025)   S.x <- c(from.z, seq(from.z, to.z, 0.01), to.z) S.y <- c(0, dnorm(seq(from.z, to.z, 0.01)), 0) polygon(S.x,S.y, col="red")

# Comparing all quantiles of two distributions simultaneously

Summary: A new function in the WRS package compares many quantiles of two distributions simultaneously while controlling the overall alpha error.

When comparing data from two groups, approximately 99.6% of all psychological research compares the central tendency (that is a subjective estimate).

In some cases, however, it would be sensible to compare different parts of the distributions. For example, in reaction time (RT) experiments two groups may only differ in the fast RTs, but not in the long. Measures of central tendency might obscure or miss this pattern, as following example demonstrates.

Imagine RT distributions for two experimental conditions (“black” and “red”). Participants in the red condition have some very fast RTs:

?View Code RSPLUS
 set.seed(1234) RT1 <- rnorm(100, 350, 52) RT2 <- c(rnorm(85, 375, 55), rnorm(15, 220, 25)) plot(density(RT1), xlim=c(100, 600)) lines(density(RT2), col=2)

A naïve (but common) approach would be to compare both distributions with a t test:

t.test(RT1, RT2)
######################
data:  RT1 and RT2
t = -0.3778, df = 168.715, p-value = 0.706
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-22.74478  15.43712
sample estimates:
mean of x mean of y
341.8484  345.5022

Results show that both groups do not differ in their central tendency.

Now let’s do the same with a new method!

The function qcomhd from the WRS package compares user-defined quantiles of both distributions using a Harrell–Davis estimator in conjunction with a percentile bootstrap. The method seems to improve over other methods: “Currently, when there are tied values, no other method has been found that performs reasonably well. Even with no tied values, method HD can provide a substantial gain in power when q ≤ .25 or q ≥ .75 compared to other techniques that have been proposed”. The method is described in the paper “Comparing two independent groups via the upper and lower quantiles” by Wilcox, Erceg-Hurn, Clark and Carlson (2013).
You can use the function as soon as you install the latest version (17) of the WRS package:
install.packages("WRS", repos="http://R-Forge.R-project.org")

Let’s compare all percentiles from the 10th to the 90th:

qcomhd(RT1, RT2, q = seq(.1, .9, by=.1))

The graphical output shows how groups differ in the requested quantiles, and the confidence intervals for each quantile:

The text output (see below) also shows that groups differ significantly in the 10th, the 50th, and the 60th percentile. The column labeled ‘’.value’’shows the p value for a single quantile bootstrapping test. As we do multiple tests (one for each quantile), the overall Type 1 error (defaulting to .05) is controlled by the Hochberg method. Therefore, for each p value a critical p value is calculated that must be undercut (see column ‘_crit’. The column ‘signify’ marks all tests which fulfill this condition:

    q  n1  n2    est.1    est.2 est.1.est.2    ci.low       ci.up      p_crit p.value signif
1 0.1 100 100 285.8276 218.4852   67.342399  41.04707 84.67980495 0.005555556   0.001      *
2 0.2 100 100 297.5061 264.7904   32.715724 -16.52601 68.80486452 0.025000000   0.217
3 0.3 100 100 310.8760 320.0196   -9.143593 -33.63576 32.95577465 0.050000000   0.589
4 0.4 100 100 322.5014 344.0439  -21.542475 -40.43463  0.03938696 0.010000000   0.054
5 0.5 100 100 331.4413 360.3548  -28.913498 -44.78068 -9.11259108 0.007142857   0.006      *
6 0.6 100 100 344.8502 374.7056  -29.855369 -46.88886 -9.69559705 0.006250000   0.005      *
7 0.7 100 100 363.6210 388.0228  -24.401872 -47.41493 -4.13498039 0.008333333   0.016
8 0.8 100 100 385.8985 406.3956  -20.497097 -47.09522  2.23935390 0.012500000   0.080
9 0.9 100 100 419.4520 444.7892  -25.337206 -55.84177 11.49107833 0.016666667   0.174

To summarize, we see that we have significant differences between both groups: the red group has significantly more faster RTs, but in their central tendency longer RTs.

Recommendations for comparing groups:

1. Always plot the densities of both distributions.
2. Make a visual scan: Where do the groups differ? Is the central tendency a reasonable summary of the distributions and of the difference between both distributions?
3. If you are interested in the central tendency, think about the test for trimmed means, as in most cases this describes the central tendency better than the arithmetic mean.
4. If you are interested in comparing quantiles in the tails of the distribution, use the qcomhd function.

### References

Wilcox, R. R., Erceg-Hurn, D. M, Clark, F., & Carlson, M. (in press). Comparing two independent groups via the lower and upper quantiles. Journal of Statistical Computation and Simulation. doi:10.1080/00949655.2012.754026.