Felix Schönbrodt

Dr. Dipl.-Psych.


Interactive exploration of a prior’s impact

The probably most frequent criticism of Bayesian statistics sounds something like “It’s all subjective – with the ‘right’ prior, you can get any result you want.”.

In order to approach this criticism it has been suggested to do a sensitivity analysis (or robustness analysis), that demonstrates how the choice of priors affects the conclusions drawn from the Bayesian analysis. Ideally, it can be shown that for virtually any reasonable prior the conclusions remain the same. In this case, the critique “it’s all in the prior” can be refuted on empirical grounds.

In their recent paper “The p < .05 rule and the hidden costs of the free lunch in inference” Jeff Rouder and colleagues argue that in the case of the default Bayes factor for t tests the choice of the H1 prior distribution does not make a strong difference (see Figure 6, right panel). They come to the conclusion that “Prior scale does matter, and may change the Bayes factor by a factor of 2 or so, but it does not change the order of magnitude.” (p. 24).

The default Bayes factor for t tests (Rouder, Speckman, Sun, Morey, Iverson, 2009) assumes that effect sizes (expressed as Cohen’s d) are distributed as a Cauchy distribution (this is the prior distribution for H1). The spread of the Cauchy distribution can be changed with the scale parameter r. Depending on the specific research area, one can use a wider (large r‘s, e.g. r =1.5) or a thinner (small r’s, e.g. r = 0.5) Cauchy distribution. This corresponds to the prior belief that typically larger or smaller effect sizes can be expected.

For the two-sample t test, the BayesFactor package for R suggest three defaults for the scale parameter:

  • “medium” (r = sqrt(2)/2 = 0.71),
  • “wide” (r = 1), and
  • “ultra-wide” (r = sqrt(2) = 1.41).

Here’s a display for these distributions:


For a given effect size: How does the choice of the prior distribution change the resulting Bayes factor?

The following shiny app demonstrates how the choice of the prior influences the Bayes factor for a given effect size and sample size. Try moving the sliders! You can also provide arbitrary values for r (as comma-separated values; r must be > 0; reasonable ranges are between 0.2 and 2).

For a robustness analysis simply compare the lines at each vertical cut. An important line is the solid blue line at log(1), which indicates the same support for H1 and H0. All values above that line are in favor of the H1, all values below that line are in favor of H0.



As you will see, in most situations the Bayes factors for all r‘s are either above log(1), or below log(1). That means, regardless of the choice of the prior you will come to the same conclusion. There are very few cases where a data line for one r is below log(1) and the other is above log(1).  In this case, different r‘s would come to different conclusions. But in these ambiguous situations the evidence for H1 or for H0 is always in the “anectodal” region, which is a very weak evidence. With the default r’s, the ratio of the resulting Bayes factors is indeed maximally “a factor of 2 or so”.

To summarize, within a reasonable range of prior distributions it is not possible that one prior generates strong evidence for H1, while some other prior generates strong evidence for H0. In that sense, the conclusions drawn from a default Bayes factor are robust to the choice of (reasonable) priors.


Rouder, J. N., Morey, R. D., Verhagen, J., Province, J. M., & Wagenmakers, E. J. (submitted). The p < .05 rule and the hidden costs of the free lunch in inference. Retrieved from http://pcl.missouri.edu/biblio/author/29

Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16, 225–237.

Related posts

Comments (2) | Trackback

A short taxonomy of Bayes factors

I started to familiarize myself with Bayesian statistics. In this post I’ll show some insights I had about Bayes factors (BF).

What are Bayes factors?

Bayes factors provide a numerical value that quantifies how well a hypothesis predicts the empirical data relative to a competing hypothesis. For example, if the BF is 4, this indicates: “This empirical data is 4 times more probable if H₁ were true than if H₀ were true.”
. Hence, evidence points towards H₁. A BF of 1 means that data are equally likely to be occured under both hypotheses. In this case, it would be impossible to decide between both.

More formally, the BF can be written as:

\(BF_{10} = \frac{p(D|H_1)}{p(D|H_0)}\)

where D is the data.

(for more formal introductions to BFs, see Wikipedia, Bayesian-Inference, or the classic paper by Kass and Raftery, 1995).

Hence, the BF is a ratio of probabilities, and is related to larger class of likelihood-ratio test.

Although many authors agree about the many theoretical advantages of BFs, until recently it was complicated and unclear how to compute a BF even for the simplest standard designs (Rouder, Morey, Speckman, & Province, 2012). Fortunately, over the last years default Bayes factors for several standard designs have been developed (Rouder et al., 2012; Rouder, Speckman, Sun, Morey, & Iverson, 2009; Morey & Rouder, 2011). For example, for a two-sample t test, a BF can be derived simply by plugging the t value and the sample sizes into a formula. The BF is easy to compute by the R package BayesFactor (Morey & Rouder, 2013), or by online calculators [1][2].

Flavors of Bayes factors

When I started to familiarize myself with BFs, I was sometimes confused, as the same number seemed to mean different things in different publications. And indeed, four types of Bayes factors can be distinguished. “Under the hood”, all four types are identical, but you have to be aware which type has been employed in the specific case.

The first distinction is, whether the BF indicates “\(H_0\) over \(H_1\)” (=\(BF_{01}\)), or “\(H_1\) over \(H_0\)” (=\(BF_{10}\)). A \(BF_{01}\) of 2 means “Data is 2 times more likely to be occured under \(H_0\) than under \(H_1\)”, while the same situation would be a \(BF_{10}\) of 0.5 (i.e., the reciprocal value 1/2). Intuitively, I prefer larger values to be “better”, and as I usually would like to have evidence for \(H_1\) instead of \(H_0\), I prefer the \(BF_{10}\).

The second distinction is, whether one reports the raw BF, or the natural logarithm of the BF. The logarithm has the advantage that the scale above 1 (evidence for \(H_1\)) is identical to the scale below 1 (evidence for \(H_0\)). In the previous example, a \(BF_{10}\) of 2 is equivalent to a \(BF_{01}\) of 0.5. Taking the log of both values leads to \(log(BF_{10})\) = 0.69 and \(log(BF_{01})\) = -0.69: Same value, reversed sign. This makes the log(BF) ideal for visualization, as the scale is linear in both directions. Following graphic shows the relationship between raw/log BF:

Figure 1

Figure 1


As you can see in the Table of Figure 1, different authors use different flavors. Notably, the website calculator gives “raw-” BF, while the BayesFactor package gives “log+” BF. (That caused me some headache … To be clear: The help page of ?ttest.tstat clearly states that it computes the “log(e) Bayes factor (against the point null)”. Once you understand the system, everything falls into place; but it took me some time to figure out how to understand and convert the different metrics.)

Figure 2 shows conversion paths of the different BF flavors:

Figure 2

Figure 2


For example, if you enter t = 2.51, n = 100 into the website, you get a JZS Bayes Factor = 0.6140779. The same numbers in the BayesFactor package give a BF of 0.4876335:

> ttest.tstat(t=2.51, n1=100, r=1)
[1] 0.4876335

# now apply the conversion formula 1/exp(BF)
> 1/(exp(0.4876335))
[1] 0.6140779

As you can see: After applying the conversion, it is exactly the same.

Related posts: Exploring the robustness of Bayes Factors: A convenient plotting function


Morey, R. D. & Rouder, J. N. (2011). Bayes factor approaches for testing interval null hypotheses. Psychological Methods, 16(4), 406–419. PMID: 21787084. doi:10.1037/a0024377

Morey, R. D. & Rouder, J. N. (2013). {BAYESFACTOR}: computation of bayes factors for common designs. R package version 0.9.4. Retrieved from http://CRAN.R- project.org/package=BayesFactor

Rouder, J. N., Morey, R. D., Speckman, P. L., & Province, J. M. (2012). Default bayes factors for ANOVA designs. Journal of Mathematical Psychology, 56(5), 356–374. doi:10.1016/j.jmp.2012.08.001

Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225–237.

Comments (5) | Trackback

New robust statistical functions in WRS package – Guest post by Rand Wilcox

Today a new version (0.23.1) of the WRS package (Wilcox’ Robust Statistics) has been released. This package is the companion to his rather exhaustive book on robust statistics, “Introduction to Robust Estimation and Hypothesis Testing” (Amazon Link de/us).

For a fail-safe installation of the package, follow this instruction.

As a guest post, Rand Wilcox describes the new functions of the newest WRS version – have fun!

“Hi everyone,

As you probably know, when standard assumptions are violated, classic methods for comparing groups and studying associations can have very poor power and yield highly misleading results. The better known methods for dealing with these problem (transforming the data or testing assumptions) are ineffective compared to more modern methods. Simply removing outliers among the dependent variable and applying standard techniques to the remaining data is disastrous.

Methods I derive to deal with these problems can be applied with R functions stored in an R package (WRS) maintained by Felix Schönbrodt. Felix asked me to briefly describe my recent efforts for a newsletter he posts. In case this might help some of you, a brief description of my recently developed methods and corresponding R functions are provided below. (The papers I cite illustrate that they can make a substantial difference compared to extant techniques.)

Comparing groups:

Sometimes it can be important and more informative to compare the tails (upper and lower quantiles) of two groups rather than a measure of location that is centrally located. Example: have been involved in a study aimed at determining whether intervention reduced depressive symptoms. But the typical individual was not very depressed prior to intervention and no difference is found using the more obvious techniques. Simply ignoring the less depressed individuals results in using the wrong standard error – a very serious problem. But comparing quantiles, it was found that the more depressed individuals benefitted the most from intervention.

The new method beats the shift function. See

Wilcox, R. R., Erceg-Hurn, D., Clark, F. & Carlson, M. (2013). Comparing two independent groups via the lower and upper quantiles. Journal of Statistical Computation and Simulation. DOI: 10.1080/00949655.2012.754026

Use the R function qcomhd.

For dependent groups must use another method. There are, in fact, two distinct ways of viewing the problem. See

Wilcox, R. R. & Erceg-Hurn, D. (2012). Comparing two dependent groups via quantiles. Journal of Applied Statistics, 39, 2655–2664.

Use the R function Dqcomhd.

When comparing two groups based on a Likert scale, use the function disc2com.

It performs a global test of P(X=x)=P(Y=x) for all x using a generalization of the Storer–Kim method for comparing binomials.

binband: a multiple comparison method for the individual cell probabilities.


tshdreg: This is a modification of the Theil–Sen estimator. When there are tied values among the dependent variable, this modification might result in substantially higher power. A paper (Wilcox & Clark, in press) provides details. The function tsreg now checks whether there are any tied values and prints a message suggesting that you might want to use tshdreg instead.

qhdsm: A quantile regression smoother. That is, plot the regression line when predicting some quantile without specifying a parametric form for the regression line. Multiple quantile regression lines can be plotted. The method can be more satisfactory than using the function qsmcobs (a spline-type method), which often creates waves and curvature that give an incorrect sense of the association. Another advantage of qhsdm is that it can be used with more than one predictor; qsmcobs is limited to one predictor only. The strategy behind qhdsm is to get an initial approximation of the regression line using a running interval smoother in conjunction with the Harrell–Davis quantile estimator and then smoothed again via LOESS.

It is surprising how often an association is found when dealing with the higher and lower quantiles of the dependent variable that are not detected by least squares and other robust estimators.

qhdsm2g: Plots regression lines for two groups using the function qhdsm.

rplot has been updated: setting the argument LP=TRUE gives a smoother regression line.

rplotCI. Same as rplot but includes lines indicating a confidence interval for the predicted Y values

rplotpbCI. Same as rplotCI, only use a bootstrap method to compute confidence intervals.

ANCOVA or comparing regression lines associated with independent groups:

ancJN: The function fits a robust regression line for each group and then determines whether the predicted Y values differ significantly at specified points. So it has connections to the classic Johnson-Neyman method. That is, the method provides an indication of where the regression lines cross. Both types of heteroscedasticity are allowed, which can result in improved power beyond the improved power stemming from a robust estimator. See

Wilcox, R. R. (2013). A heteroscedastic method for comparing regression lines at specified design points when using a robust regression estimator. Journal of Data Science, 11, 281–291

anctspb: Like ancJN but uses a percentile bootstrap method that might help when there are tied values among the dependent variable.

ancGLOB. A robust global ANCOVA method. Like the function ancova, it provides a flexible way of dealing with curvature and heteroscedasticity is allowed. But this function can reject in situations where ancova does not reject. The function returns a p-value and the hypothesis of identical regression lines is rejected if the p-value is less than or equal to a critical p-value. In essence, it can beat reliance on improved versions of the Bonferroni method. (Details are in a paper submitted for publication.) It does not dominate my original ANCOVA method (applied with the R function ancova) in terms of power, but have encountered situations where it makes a practical difference.

It determines a critical p-value via the R function ancGLOB_pv.

In essence, simulations are used. By default, the number of replications is iter=500. But suggest using iter=2000 or larger. Execution time can be reduced substantially with cpp=TRUE, which calls a C++ version of the function written by Xiao He. Here are the commands to install the C++ version:

install_github('WRScpp', 'mrxiaohe')

For a global test that two parametric regression lines are identical, see

Wilcox, R. R. & Clark, F. (in press). Heteroscedastic global tests that the regression parameters for two or more independent groups are identical. Communications in Statistics– Simulation and Computation.

ancGpar performs the robust method. The paper includes a different method when using least squares regression. It is based in part on the HC4 estimator, which deals with heteroscedasticity. But if there are outliers among the dependent variable, you are much better off using a robust estimator.

ANCOVA or comparing regression lines associated with dependent groups:

Dancova: ANCOVA for two dependent groups that provides a flexible way of dealing with curvature. Both types of heteroscedasticity are allowed. Roughly, approximate the regression lines with a running interval smoother and at specified design points compare the regression lines. This is an extension of the R function ancova to dependent groups. The function can do an analysis on either the marginal measures of location or a measure of location based on the difference scores. When using a robust estimator, the choice between these two approaches can be important. Defaults to using a trimmed mean.

Dancovamp: Like Dancova only designed to handle multiple covariates.

Danctspb: Compare regression lines of two dependent groups using a robust regression estimator. The default is to use Theil–Sen, but any estimator can be used via the argument regfun. So in contrast to Dancova, a parametric form for the regression line is made. As usual, can eliminate outliers among the independent variable by setting the argument xout=TRUE. When a parametric regression line provides a more accurate fit, can have more power compared to using a smoother. But when there is curvature that is not modeled well with a parametric fit, the reverse can happen.

Note: a version of ancGLOB for dependent groups is being studied.

Coefficient alpha

Rcoefalpha: computes a robust analog of coefficient alpha. Developed this method some years ago but just got around to writing an R function. See

Wilcox, R. R. (1992). Robust generalizations of classical test reliability and Cronbach’s alpha. British Journal of Mathematical and Statistical Psychology, 45, 239–254.

R. R. Wilcox

Have fun exploring these new methods!

Comments (2) | Trackback

Exploring the robustness of Bayes Factors: A convenient plotting function

One critique frequently heard about Bayesian statistics is the subjectivity of the assumed prior distribution. If one is cherry-picking a prior, of course the posterior can be tweaked, especially when only few data points are at hand. For example, see the Scholarpedia article on Bayesian statistics:

In the uncommon situation that the data are extensive and of simple structure, the prior assumptions will be unimportant and the assumed sampling model will be uncontroversial. More generally we would like to report that any conclusions are robust to reasonable changes in both prior and assumed model: this has been termed inference robustness

(David Spiegelhalter and Kenneth Rice (2009) Bayesian statistics. Scholarpedia, 4(8):5230.)

Therefore, it is suggested that …

In particular, audiences should ideally fully understand the contribution of the prior distribution to the conclusions. (ibid)

In the example of Bayes factors for t tests (Rouder, Speckman, Sun, Morey, & Iverson, 2009), the assumption that has to be defined a priori is the effect size δ expected under the H1. In the BayesFactor package for R, this can be adjusted via the r parameter. By default, it is set to 0.5, but it can be made wider (larger r’s, which means one expects larger effects) or narrower (r’s close to zero, which means one expects smaller effects in the population).

In their reanalysis of Bem’s ESP data, Wagenmakers, Wetzels, Borsboom, Kievit, and van der Maas (2011, PDF) proposed a robustness analysis for Bayes factors (BF), which simply shows the BF for a range of priors. If the conclusion is the same for a large range of priors, it could be judged to be robust (this is also called a “sensitivity analysis”).

I wrote an R function that can generate plots like this. Here’s a reproduction of Wagenmakers’ et al (2011) analysis of Bem’s data – it looks pretty identical:

## Bem data, two sided
# provide t values, sample sizes, and the location(s) of the red dot(s)
# set forH1 to FALSE in order to vertically flip the plot.
# Usually I prefer higher BF to be in favor of H1, but I flipped it in order to match Wagenmakers et al (2011)
     ts=c(2.51, 2.55, 2.23, 1.74, 1.92, 2.39, 2.03, 1.8, 1.31, 2.96),
     ns=c(100, 97, 100, 150, 100, 150, 99, 150, 200, 50),
     dots=1, forH1 = FALSE)


Bayes factor robustness analysis, two-sided (cf. Wagenmakers et al., 2011)

Bayes factor robustness analysis, two-sided (cf. Wagenmakers et al., 2011)


You can throw in as many t values and corresponding sample sizes as you want. Furthermore, the function can compute one-sided Bayes factors as described in Wagenmakers and Morey (2013). If this approach is applied to the Bem data, the plot looks as following – everything is shifted a bit into the H1 direction:

# Bem data one-sided
     ts=c(2.51, 2.55, 2.23, 1.74, 1.92, 2.39, 2.03, 1.8, 1.31, 2.96),
     ns=c(100, 97, 100, 150, 100, 150, 99, 150, 200, 50),
     dots=1, sides="one", forH1 = FALSE)


Bayes factor robustness analysis, one-sided (cf. Wagenmakers et al., 2011; Wagenmakers and Morey, 2013)

Bayes factor robustness analysis, one-sided (cf. Wagenmakers et al., 2011; Wagenmakers and Morey, 2013)

Finally, here’s the function:

## This source code is licensed under the FreeBSD license
## (c) 2013 Felix Schönbrodt

#' @title Plots a comparison of a sequence of priors for t test Bayes factors
#' @details
#' @param ts A vector of t values
#' @param ns A vector of corresponding sample sizes
#' @param rs The sequence of rs that should be tested. r should run up to 2 (higher values are implausible; E.-J. Wagenmakers, personal communication, Aug 22, 2013)
#' @param labels Names for the studies (displayed in the facet headings)
#' @param dots Values of r's which should be marked with a red dot
#' @param plot If TRUE, a ggplot is returned. If false, a data frame with the computed Bayes factors is returned
#' @param sides If set to "two" (default), a two-sided Bayes factor is computed. If set to "one", a one-sided Bayes factor is computed. In this case, it is assumed that positive t values correspond to results in the predicted direction and negative t values to results in the unpredicted direction. For details, see Wagenmakers, E. J., & Morey, R. D. (2013). Simple relation between one-sided and two-sided Bayesian point-null hypothesis tests.
#' @param nrow Number of rows of the faceted plot.
#' @param forH1 Defines the direction of the BF. If forH1 is TRUE, BF > 1 speak in favor of H1 (i.e., the quotient is defined as H1/H0). If forH1 is FALSE, it's the reverse direction.
#' @references
#' Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t-tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin and Review, 16, 225-237.
#' Wagenmakers, E.-J., & Morey, R. D. (2013). Simple relation between one-sided and two-sided Bayesian point-null hypothesis tests. Manuscript submitted for publication
#' Wagenmakers, E.-J., Wetzels, R., Borsboom, D., Kievit, R. & van der Maas, H. L. J. (2011). Yes, psychologists must change the way they analyze their data: Clarifications for Bem, Utts, & Johnson (2011)

BFrobustplot <- function(
    ts, ns, rs=seq(0, 2, length.out=200), dots=1, plot=TRUE,
    labels=c(), sides="two", nrow=2, xticks=3, forH1=TRUE)
    # compute one-sided p-values from ts and ns
    ps <- pt(ts, df=ns-1, lower.tail = FALSE)   # one-sided test
    # add the dots location to the sequences of r's
    rs <- c(rs, dots)
    res <- data.frame()
    for (r in rs) {
        # first: calculate two-sided BF
        B_e0 <- c()
        for (i in 1:length(ts))
            B_e0 <- c(B_e0, exp(ttest.tstat(t = ts[i], n1 = ns[i], rscale=r)$bf))
        # second: calculate one-sided BF
        B_r0 <- c()
        for (i in 1:length(ts)) {
            if (ts[i] > 0) {
                # correct direction
                B_r0 <- c(B_r0, (2 - 2*ps[i])*B_e0[i])
            } else {
                # wrong direction
                B_r0 <- c(B_r0, (1 - ps[i])*2*B_e0[i])
        res0 <- data.frame(t=ts, n=ns, BF_two=B_e0, BF_one=B_r0, r=r)
        if (length(labels) > 0) {
            res0$labels <- labels
            res0$heading <- factor(1:length(labels), labels=paste0(labels, "\n(t = ", ts, ", df = ", ns-1, ")"), ordered=TRUE)
        } else {
            res0$heading <- factor(1:length(ts), labels=paste0("t = ", ts, ", df = ", ns-1), ordered=TRUE)
        res <- rbind(res, res0)
    # define the measure to be plotted: one- or two-sided?
    res$BF <- res[, paste0("BF_", sides)]

    # Flip BF if requested
    if (forH1 == FALSE) {
        res$BF <- 1/res$BF
    if (plot==TRUE) {
        p1 <- ggplot(res, aes(x=r, y=log(BF))) + geom_line() + facet_wrap(~heading, nrow=nrow) + theme_bw() + ylab("log(BF)")
        p1 <- p1 + geom_hline(yintercept=c(c(-log(c(30, 10, 3)), log(c(3, 10, 30)))), linetype="dotted", color="darkgrey")
        p1 <- p1 + geom_hline(yintercept=log(1), linetype="dashed", color="darkgreen")
        # add the dots
        p1 <- p1 + geom_point(data=res[res$r %in% dots,], aes(x=r, y=log(BF)), color="red", size=2)
        # add annotation
        p1 <- p1 + annotate("text", x=max(rs)*1.8, y=-2.85, label=paste0("Strong~H[", ifelse(forH1==TRUE,0,1), "]"), hjust=1, vjust=.5, size=3, color="black", parse=TRUE)
        p1 <- p1 + annotate("text", x=max(rs)*1.8, y=-1.7 , label=paste0("Moderate~H[", ifelse(forH1==TRUE,0,1), "]"), hjust=1, vjust=.5, size=3, color="black", parse=TRUE)
        p1 <- p1 + annotate("text", x=max(rs)*1.8, y=-.55 , label=paste0("Anectodal~H[", ifelse(forH1==TRUE,0,1), "]"), hjust=1, vjust=.5, size=3, color="black", parse=TRUE)
        p1 <- p1 + annotate("text", x=max(rs)*1.8, y=2.86 , label=paste0("Strong~H[", ifelse(forH1==TRUE,1,0), "]"), hjust=1, vjust=.5, size=3, color="black", parse=TRUE)
        p1 <- p1 + annotate("text", x=max(rs)*1.8, y=1.7  , label=paste0("Moderate~H[", ifelse(forH1==TRUE,1,0), "]"), hjust=1, vjust=.5, size=3, color="black", parse=TRUE)
        p1 <- p1 + annotate("text", x=max(rs)*1.8, y=.55  , label=paste0("Anectodal~H[", ifelse(forH1==TRUE,1,0), "]"), hjust=1, vjust=.5, vjust=.5, size=3, color="black", parse=TRUE)
        # set scale ticks
        p1 <- p1 + scale_y_continuous(breaks=c(c(-log(c(30, 10, 3)), 0, log(c(3, 10, 30)))), labels=c("-log(30)", "-log(10)", "-log(3)", "log(1)", "log(3)", "log(10)", "log(30)"))
        p1 <- p1 + scale_x_continuous(breaks=seq(min(rs), max(rs), length.out=xticks))

    } else {



Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t-tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin and Review, 16, 225-237. [for a PDF, see bottom of this page]
Wagenmakers, E.-J., & Morey, R. D. (2013). Simple relation between one-sided and two-sided Bayesian point-null hypothesis tests. Manuscript submitted for publication (website)
Wagenmakers, E.-J., Wetzels, R., Borsboom, D., Kievit, R. & van der Maas, H. L. J. (2011). Yes, psychologists must change the way they analyze their data: Clarifications for Bem, Utts, & Johnson (2011) [PDF]

Comments (4) | Trackback

At what sample size do correlations stabilize?

Maybe you have encountered this situation: you run a large-scale study over the internet, and out of curiosity, you frequently  the correlation between two variables.

My experience with this practice is usually frustrating, as in small sample sizes (and we will see what “small” means in this context) correlations go up and down, change sign, move from “significant” to “non-significant” and back. As an example, see Figure 1 which shows the actual trajectory of a correlation, plotted against sample size (I also posted a video of this evolution).

It is simply the order how participants dropped into the study (i.e., data has not been rearranged). In this case, the correlation started really strong (r = .69) and continuously decayed until it’s final r of .26. The light gray lines show some exemplary bootstrapped alternative trajectories.


Figure 1: The evolution of a correlation.

In this particular case, at least the sign was stable (“There is a positive relationship in the population”, see also “Type-S errors”). Other trajectories in this data set, however, changed signs or their significance status. One correlation even changed from “negative significant” to “positive significant”!

Obviously, the estimate of a correlation stabilizes with increasing sample size. Now I wanted to know: At which sample size exactly can I expect a correlation to be stable? An informal query amongst colleagues revealed estimates between  n = 80 and n = 150.

Together with Marco Perugini, I did a systematic analysis of this question. The results of this simulation study are reported [PDF, 0.39 MB]. In this paper a “corridor of stability” (COS) has been utilized: Deviations from the true value are defined as tolerable as long as they stay within that corridor (see also Figure 1 for a COS of +/- .1). The point of stability (POS) is that sample size from which on a specific trajectory does not leave the COS anymore.

The point of stability depends on the effect size (How strong is the true correlation?), the width of the corridor of stability (How much deviation from the true value am I willing to accept?), and the confidence in the decision (How confident do I want to be that the trajectory does not leave the COS any more?). If you’re interested in the details: read the paper. It’s not long.

The bottom line is: For scenarios in psychology, correlations stabilize when n approaches 250. That means, estimates with n > 250 are not only significant, they also are fairly accurate (see also Kelley & Maxwell, 2003, and Maxwell, Kelley, & Rausch, 2008, for elaborated discussions on parameter accuracy).

Additional analyses (not reported in the publication)

Figure 2 shows the distribution of POS values, depending on the half-width of the COS and on effect size rho. The horizontal axis is cut at n = 300, although several POS were > 300. It can be seen that all distributions have a very long tail. This makes the estimation of the 95th quantile very unstable. Therefore we used a larger number of 100’000 bootstrap replications in each experimental condition in order to get fairly stable estimates for the extreme quantiles.



Figure 2: Distribution of POS values, depending on half-width of COS and effect size rho

Finally, Figure 3 shows the probability that a trajectory leaves the COS with increasing sample size.


Figure 3

The dotted lines mark the confidence levels of 80%, 90%, and 95% which were used in the publications. The n where the curves intersect these dotted lines indicate the values reported in Table 1 of the publication. For example, if the true correlation is .3 (which is already more than the average effect size in psychology) and you collect 100 participants, there’s still a chance of 50% that your correlation will leave the corridor between .21 and .39 (which are the boundaries for w=.1).

What is the conclusion? Significance tests determine the sign of a correlation. This conclusion can be made with much lower sample sizes. However, when we want to make an accurate conclusion about the size of an effect with some confidence (and we do not want to make a “Type M” error), we need much larger samples.

The full R source code for the simulations can be downloaded here.


Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. Annual Review of Psychology, 59, 537–563. doi:10.1146/annurev.psych.59.103006.093735 [PDF]
Schönbrodt, F. D., & Perugini, M. (2013). At what sample size do correlations stabilize? Journal of Research in Personality, 47, 609-612. doi:10.1016/j.jrp.2013.05.009 [PDF]
Of course, you would never do that in a “sequential sampling” style, where you stop as soon as the correlation reached significance. In contrary, I suppose that you’ve run an a-priori power analysis and collect exactly this amount of participants.
Schönbrodt, F. D., & Perugini, M. (in press). At what sample size do correlations stabilize? Journal of Research in Personality. doi:10.1016/j.jrp.2013.05.009
i.e., rho = .21, w = .1, confidence = 80%
Comments (6) | Trackback

Installation of WRS package (Wilcox’ Robust Statistics)

Update Feb 17, 2014: WRS moved to Github – This installation procedure has been updated and still is valid

Some users had trouble installing the WRS package from R-Forge. Here’s a method that should work automatically and fail-safe:

# first: install dependent packages
install.packages(c("MASS", "akima", "robustbase"))

# second: install suggested packages
install.packages(c("cobs", "robust", "mgcv", "scatterplot3d", "quantreg", "rrcov", "lars", "pwr", "trimcluster", "parallel", "mc2d", "psych", "Rfit"))

# third: install an additional package which provides some C functions
install_github("WRScpp", username="mrxiaohe")

# fourth: install WRS
install_github("WRS", username="nicebread", subdir="pkg")

WRS cannot be hosted on CRAN, as CRAN demands help files for every user-visible function. This has not been done for WRS (yet). For the time being, this somewhat more complicated installation routine has to be used.

Comments (10) | Trackback

The first CREDAM Award for creative data management goes to … the German government!

“If you torture the data long enough, it will confess.”
This aphorism, attributed to Ronald Coase, sometimes has been used in a disrespective manner, as if was wrong to do creative data analysis. This view obviously is misleading. In contrast, we at IRET have a much more positive and humanistic view of data management, and therefore we have made this aphorism to our leading guide in difficult times.

We at IRET have made it to our mission to proliferate and foster creative ways of data analysis. Therefore, we proudly introduce an award in recognition of outstanding data creativity: the CREDAM Award. CREDAM is both an acronym (CREative DAta Management), and a statement: credam (lat.) means “I will believe”, or “I will trust”.

This years CREDAM Award goes to …….. the German government!

A new report on poverty in Germany is going to be published soon. What does the data say?

Overall property in possession of rich households
Overall property in possession of complete lower half

Seems like a pretty clear picture, and in a previous version of the report, the authors concluded (based on this and other data), that “income disparity increased” (see Süddeutsche Zeitung). But that is wrong!! But why is it wrong? Well, that interpretation “does not reflect the opinion of the German government”.

On the pressure of the leader of the minor coalition partner, Philipp Rösler (which currently would be elected by 4% of Germans), this conclusion was re-interpreted. Now, the report comes to the completely opposite conclusion: “income disparity decreases!

As this is a great example of creative data analysis, which liberates us from restrictive and anally retentive “scientific” procedures, we are happy to award the first CREDAM trophy to the German government, especially Phillip Rösler. Congratulations!

(Maybe we should think about adopting this strategy for scientific reports as well. Given highly flexible approaches of data analysis, conclusions should rather be based on a majority vote of all (co-)authors and reviewers, not on empirical evidence.)

No comments | Trackback

Improved evolution of correlations

Update June 2013: A systematic analysis of the topic has been published:
Schönbrodt, F. D., & Perugini, M. (2013). At what sample size do correlations stabilize? Journal of Research in Personality, 47, 609-612. doi:10.1016/j.jrp.2013.05.009

Check also the supplementary website, where you can find the PDF of the paper.

As an update of this post: here’s an improved version of “The evolution of correlations”.

From the original post:
“This is the evolution of a bivariate correlation between two questionnaire scales, “hope of power” and “fear of losing control”. Both scales were administered in an open online study. The video shows how the correlation evolves from r = .69*** (n=20) to r = .26*** (n=271). It does not stabilize until n = 150.

Data has not been rearranged – it is the random order how participants dropped into the study. This had been a rather extreme case of an unstable correlation – other scales in this study were stable right from the beginning. Maybe this video could help as an anecdotal caveat for a careful interpretation of correlations with small n’s (and with ‘small’ I mean n < 100) …”

The right panel now displays the correlation in each step. The horizontal green line is the final correlation that is approached, the curved dotted line shows the marginal correlation that would be significant at that sample size. As the empirical curve always is above this dotted line, it is significantly different from zero in each step.

Evolution of correlations – improved from Felix Schönbrodt on Vimeo.


Here the code that created the movie. It’s not fully self-contained – the function plotReg plots the dual-panel display, dat0, A, and B are parameters passed to this function. You can insert any other function here. The function loops through the rows of a data frame and saves a plot at every step into a subfolder. Finally, the function needs the command line version of ffmpeg, which connects the pictures to a movie.

makeMovie <- function(fname, dat0, A, B, fps=15) {
    # create a new directory for the pictures
    # create the picture sequence
    picName <- paste(fname, "/", fname, "_%03d.jpg", sep="")
    jpeg(picName, width=800, height=450, quality=95)
    for (i in 15:nrow(dat0)) {
      plotReg(A, B, i, keep=15)
    # delete any existing movie file
    # point system to R's working directory
    system(paste("cd ", gsub(" ", "\\ ", getwd(), fixed=TRUE)))
    # show & execute the command line expression for ffmpeg to glue the pictures together
    print(paste(paste0("ffmpeg -r ", fps, " -i ", fname, "/", fname, "_%03d.jpg -sameq -r 25 ",  paste0(fname,".avi"))))
    system(paste(paste0("ffmpeg -r ", fps, " -i ", fname, "/", fname, "_%03d.jpg -sameq -r 25 ",  paste0(fname,".avi"))))
Comments (5) | Trackback

Optimizing parameters for an oscillator – Video

Here’s a video how the modFit function from the FME package optimizes parameters for an oscillation. A Nelder-Mead-optimizer (R function optim) finds the best fitting parameters for an undampened oscillator. Minimum was found after 72 iterations, true parameter eta was -.05:

Evolution of parameters in optimization process from Felix Schönbrodt on Vimeo.

More on estimating parameters of differential equations is coming later on this blog!

Things I’ve learned:

  • ffmpeg does not like pngs. They are internally converted to jpg in a very low quality and I could not find a way to improve this quality. Lesson learned: Export high quality jpgs from your R function
  • Use a standard frame rate for the output file (i.e., 24, 25, or 30 fps)
  • My final ffmpeg command:
    ffmpeg -r 10 -i modFit%03d.jpg -r 25 -b:v 5000K modelFit.avi
    • -r 10: Use 10 pictures / second as input
    • -i modFit%03d.jpg: defines the names of the input files, modFit001.jpg, modFit002.jpg, …
    • -r 25: Set framerate of output file to 25 fps
    • -b:v 5000K: set bitrate of video to a high value
    • modelFit.mp4: video name and encoding type (mp4)
No comments | Trackback

R-package: Wilcox’ Robust Statistics updated (WRS v0.20)

Rand Wilcox constantly updates the functions accompanying his books on robust statistics. Recently, they have been updated to version 20. The functions are available in the WRS package for R – for installation simply type
install.packages("WRS", repos="http://R-Forge.R-project.org", type="source")

Update: For installation of the WRS package, please use only the fail-safe installation procedure described here!

In version 0.20, a number of functions dealing with ANCOVA have been added and some others improved. Unfortunately, only very few help files exist for the functions. I would recommend to check out the source code, as most functions have a comment section roughly explaining the parameters. Alternatively, consult Wilcox’ books for descriptions of the functions.

Wilcox, R. R. (2012). Introduction to Robust Estimation and Hypothesis Testing, 3rd Ed. Academic Press.
Wilcox, R. (2012). Modern Statistics for the Social and Behavioral Sciences: A Practical Introduction. CRC Press.
Wilcox, R. R. (2010). Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy. Springer, 2nd Ed.
Wilcox, R. R. (2009). Basics Statistics: Understanding Conventional Methods and Modern Insights. New York: Oxford.

Comments (5) | Trackback