These days psychology really is exciting, and I do not mean the Förster case …
In May 2014 a special issue full of replication attempts has been released – all open access, all raw data released! This is great work, powered by the open science framework and from my point of view a major leap forward.
One of the replication attempts, about wether “Cleanliness Influence Moral Judgments” generated a lot of heat in the social media. This culminated in a blog post by one of the replicators, Brent Donnellan, an independent analysis of the data by Chris Fraley, and finally a long post by Simone Schnall who is the original first author of the effect, which generated a lot of comments.
Here’s my personal summary, conclusions, and insights I gained from the debate (Much of this has been stated by several other commenters, so this is more the wisdom of the crowd than my own insights).
As long as we stick to the p < .05 ritual, 1 in 20 studies will produce false positive results if there is no effect in the population. Depending on the specific statistical test, the degree of violation of the assumptions of this test, and the amount of QRPs you apply, the actual Type I error rate can be both lower and higher than the nominal 5% (in practice, I’d bet on “higher”).
We know how to fix this – e.g., Bayesian statistics (Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011), “revised standards” (i.e., use p < .005 as magic threshold; Johnson, 2013), or focus on accuracy in parameter estimation instead of NHST (Maxwell, Kelley, & Rausch, 2008; Schönbrodt & Perugini, 2013).
A single study is hardly ever conclusive. Hence, a way to deal with uncertainty is to meta-analyze and to combine evidence. Let’s make a collaborative effort to increase knowledge, not (only) personal fame.
This has been emphasized several times, and, as far as I read the papers of the special issue, no replicator made this implication. In contrast, the discussions were generally worded very cautious. Look in the original literature of the special issue, and you will hardly (or not at all) find evidence for “replication bullying”.
The nasty false-positive monster is lurking around for everybody of us. And when, someday, one of my studies cannot be replicated, then, of course I’ll be pissed off at first, but then …
It does *not* imply that I am a bad researcher.
It does *not* imply that the replicator is a “bully”.
But it means that we increased knowledge a little bit.
Or, as Dale Barr (@dalejbarr) tweeted: “If you publish a study demonstrating an effect, you don’t OWN that effect.”
As several others argued, it usually is very helpful, sensible, and fruitful to include the original authors in the replication effort. In an ideal world we make a collaborative effort to increase knowledge, and a very good example for a “best practice adversarial collaboration”, see the paper and procedure by Dora Matzke and colleagues (2013). Here, proponents and skeptics of an effect worked, together with an impartial referee, on a confirmatory study.
To summarize, involvement of original authors is nice and often fruitful, but definitely not necessary.
Remember the old times where “debate” was carried out with a time-lag of at least 6 months (until your comment was printed in the journal, if you were lucky), and usually didn’t happen at all?
I *love* reading and participating the active debate. This feels so much more like science than what we used to do. What happens here is not a shitstorm – it’s a debate! And the majority of comments really is directed at the issue and has a calm and constructive tone.
So, use the new media for post-publication-review, #HIBAR (Had I Been A Reviewer), and real exchange. The rules of the game seem to change, slowly.
(R.Kreckel, Habilitation vs. Tenure Track, Forschung und Lehre 01/2012, p.12)
Wenn ihr das unterstützt, dann unterschreibt online (kann man auch anonym machen), und leitet die Petition an eure Netzwerke weiter.
Es kann jeder unterschreiben – also auch eure Eltern, Freunde, etc!
In a recent post on the DataColada blog, Uri Simonsohn wrote about “We cannot afford to study effect size in the lab“. The central message is: If we want accurate effect size (ES) estimates, we need large sample sizes (he suggests four-digit n’s). As this is hardly possible in the lab we have to use other research tools, like onelin studies, archival data, or more within-subject designs.
While I agree to the main point of the post, I’d like to discuss and extend some of the conclusions. As the DataColada blog has no comments section, I’ll comment in my own blog …
“Does it make sense to push for effect size reporting when we run small samples? I don’t see how.”
“Properly powered studies teach you almost nothing about effect size.”
It is true that the ES estimate with n=20 will be utterly imprecise, and reporting this ES estimate could misguide readers who give too much importance to the point estimate and do not take the huge CI into account (maybe, because it has not been reported).
Still, and here’s my disagreement, I’d argue that also small-n studies should report the point estimate (along with the CI), as a meta-analysis of many imprecise small-n estimates still can give an unbiased and precise cumulative estimate. This, of course, would require that all estimates are reported, not only the significant ones (van Assen, van Aert, Nuijten, & Wicherts, 2014).
Here’s an example – we simulate a population with a true Cohen’s d of 0.8. Then we look at three scenarios: a) a single small-n study with 20 participants, b) 400 participants, and c) 20 * n=20 studies which are meta-analyzed.
This single small-n study has g = 1.337 [ 0.64 , 2.034] (g is an unbiased estimate of d). Quite biased, but the true ES is in the CI.
This single large-n study has g = 0.76 [ 0.616 , 0.903] . It close to the true ES, and has quite narrow CI (i.e., high precision).
Now, here’s a meta-analyses of 20*n=20 studies:
The meta-analysis reveals g = 0.775 [0.630; 0.920]. This has nearly exactly the same CI width as the n=400 study, and a slightly different ES estimate.
Here’s a plot of the results:
To summarize, a single small-n study hardly teaches something about effect sizes – but many small-n‘s do. But meta-analyses are only possible, if the ES is reported.
“But just how big an n do we need to study effect size? I am about to show that the answer has four-digits.”
In Uri’s post (and the linked R code) the precision issue is approached from the power side – if you increase power, you also increase precision. But you can also directly compute the necessary sample size for a desired precision. This is called the AIPE-framework (“accuracy in parameter estimation”) made popular by Ken Kelley, Scott Maxwell, and Joseph Rausch (Kelley & Rausch, 2006; Kelley & Maxwell, 2003; Maxwell, Kelley, & Rausch, 2008). The necessary functions are implemented in the MBESS package for R. If you want a CI width of .10 around an expected ES of 0.5, you need 3170 participants:
The same point has been made from a Bayesian point of view in a blog post from John Kruschke: notice the sample size on the x-axis.
In our own analysis on how correlations evolve with increasing sample size (Schönbrodt & Perugini, 2013; see also blog post), we conclude that for typical effect sizes in psychology, you need 250 participants to get sufficiently accurate and stable estimates of the ES:
It’s certainly hard to give general guidelines how much precision is sensible, but here are our thoughts we based our stability analyses on. We used a CI-like “corridor of stability” (see Figure) with a half-width of w= .10, w=.15, and w=.20 (everything in the “correlation-metric”).
w = .20 was chosen for following reason: The average reported effect size in psychology is around r = .21 (Richard, Bond, & Stokes-Zoota, 2003). For this effect size, an accuracy of w = .20 would result in a CI that is “just significant” and does not include a reversal of the sign of the effect. Hence, with the typical effect sizes we are dealing with in psychology, a CI with a half-width > .20 would not make much sense.
w = .10 was chosen as it corresponds to a “small effect size” à la Cohen. This is arbitrary, of course, but at least some anchor. And w = .15 is just in between.
Using these numbers and an ES estimate of, say, r = .29, a just tolerable precision would be [.10; .46] (w = .20), a tolerable precision [.15; .42] (w = .15) and a moderate precision [.20; .38] (w = .10).
If we use this lower threshold of “just tolerable precision”, we would need in the two-sample group difference around 200-250 participants per group. While I am not sure whether we really need four-digit samples for typical scenarios, I am sure that we need at least three-digit samples when we want to talk about “precision”.
Regardless of the specific level of precision and method used, however, one thing is clear: Accuracy does not come in cheaply. We need much less participants for an hypothesis test (Is there a non-zero effect or not?) compared to an accurate estimate.
With increasing sample size, unfortunately you have diminishing returns on precision: As you can see in the dotted lines in Figure 2, the CI levels off, and you need disproportionally large sample sizes to squeeze out the last tiny percentages of a shrinking CI. If you follow Pareto’s principle, you should stop somewhen. Probably in scientific progress accuracy will be rather achieved in meta-analyzing several studies (which also gives you an estimate about the ES variability and possible moderators) than doing one mega-study.
Hence: Always report your ES estimate, even in small-n studies.
Kelley, K., & Maxwell, S. E. (2003). Sample Size for Multiple Regression: Obtaining Regression Coefficients That Are Accurate, Not Simply Significant. Psychological Methods, 8, 305–321. doi:10.1037/1082-989X.8.3.305
Kelley, K., & Rausch, J. R. (2006). Sample size planning for the standardized mean difference: Accuracy in parameter estimation via narrow confidence intervals. Psychological Methods, 11, 363–385. doi:10.1037/1082-989X.11.4.363
Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample Size Planning for Statistical Power and Accuracy in Parameter Estimation. Annual Review of Psychology, 59(1), 537–563. doi:10.1146/annurev.psych.59.103006.093735
van Assen, M. A. L. M., van Aert, R. C. M., Nuijten, M. B., & Wicherts, J. M. (2014). Why publishing everything is more effective than selective publishing of statistically significant results. PLoS ONE, 9, e84896. doi:10.1371/journal.pone.0084896