These days psychology really is exciting, and I do not mean the Förster case …
Last week a special issue full of replication attempts has been released – all open access, all raw data released! This is great work, powered by the open science framework and from my point of view a major leap forward.
One of the replication attempts, about wether “Cleanliness Influence Moral Judgments” generated a lot of heat in the social media. This culminated in a blog post by one of the replicators, Brent Donnellan, an independent analysis of the data by Chris Fraley, and finally a long post by Simone Schnall who is the original first author of the effect, which generated a lot of comments.
Here’s my personal summary, conclusions, and insights I gained from the debate (Much of this has been stated by several other commenters, so this is more the wisdom of the crowd than my own insights).
1. Accept uncertainty in scientific findings.
As long as we stick to the p < .05 ritual,
at least 1 in 20 studies will produce false positive results. No need to evoke ceiling effects, QRP, or other bad things: 1 in 20 valid studies which were perfectly planned from very skilled researchers will produce false positives!
Update 2014-5-27: As @johnmyleswhite and @hardsci noticed on Twitter: This statement was wrong (or at least: only correct under specific conditions). They really got me. (reminder for myself: think twice before you ever write anything about p values again!). Here’s an updated version (maybe my friends at Twitter want to check):
As long as we stick to the p < .05 ritual, 1 in 20 studies will produce false positive results if there is no effect in the population. Depending on the specific statistical test, the degree of violation of the assumptions of this test, and the amount of QRPs you apply, the actual Type I error rate can be both lower and higher than the nominal 5% (in practice, I’d bet on “higher”). If, in contrast, there is an effect in the population, there cannot be false positives, and low p values correctly indicate evidence against the H0. To summarize, my original statement only applies if you assume that in psychology we only investigate null effects. Although I’m quite sceptical about effect sizes in psychology, this might be too harsh ;-).
(Well, that sounds less catchy then my original wording. Damn.)
We know how to fix this – e.g., Bayesian statistics (Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011), “revised standards” (i.e., use p < .005 as magic threshold; Johnson, 2013), or focus on accuracy in parameter estimation instead of NHST (Maxwell, Kelley, & Rausch, 2008; Schönbrodt & Perugini, 2013).
A single study is hardly ever conclusive. Hence, a way to deal with uncertainty is to meta-analyze and to combine evidence. Let’s make a collaborative effort to increase knowledge, not (only) personal fame.
2. A failed replication does not imply fraud, QRP, or lack of competence of the original author.
This has been emphasized several times, and, as far as I read the papers of the special issue, no replicator made this implication. In contrast, the discussions were generally worded very cautious. Look in the original literature of the special issue, and you will hardly (or not at all) find evidence for “replication bullying”.
The nasty false-positive monster is lurking around for everybody of us. And when, someday, one of my studies cannot be replicated, then, of course I’ll be pissed off at first, but then …
It does *not* imply that I am a bad researcher.
It does *not* imply that the replicator is a “bully”.
But it means that we increased knowledge a little bit.
3. Transparency fosters progress.
Without open data in OSF, we couldn’t even discuss about ceiling effects!
4. Do I have to ask Galileo before I throw a stone from a tower? No.
Or, as Dale Barr (@dalejbarr) tweeted: “If you publish a study demonstrating an effect, you don’t OWN that effect.”
As several others argued, it usually is very helpful, sensible, and fruitful to include the original authors in the replication effort. In an ideal world we make a collaborative effort to increase knowledge, and a very good example for a “best practice adversarial collaboration”, see the paper and procedure by Dora Matzke and colleagues (2013). Here, proponents and skeptics of an effect worked, together with an impartial referee, on a confirmatory study.
To summarize, involvement of original authors is nice and often fruitful, but definitely not necessary.
5. Celebrate the debate.
Remember the old times where “debate” was carried out with a time-lag of at least 6 months (until your comment was printed in the journal, if you were lucky), and usually didn’t happen at all?
I *love* reading and participating the active debate. This feels so much more like science than what we used to do. What happens here is not a shitstorm – it’s a debate! And the majority of comments really is directed at the issue and has a calm and constructive tone.
So, use the new media for post-publication-review, #HIBAR (Had I Been A Reviewer), and real exchange. The rules of the game seem to change, slowly.