These days psychology really is exciting, and I do not mean the Förster case …
In May 2014 a special issue full of replication attempts has been released – all open access, all raw data released! This is great work, powered by the open science framework and from my point of view a major leap forward.
One of the replication attempts, about wether “Cleanliness Influence Moral Judgments” generated a lot of heat in the social media. This culminated in a blog post by one of the replicators, Brent Donnellan, an independent analysis of the data by Chris Fraley, and finally a long post by Simone Schnall who is the original first author of the effect, which generated a lot of comments.
Here’s my personal summary, conclusions, and insights I gained from the debate (Much of this has been stated by several other commenters, so this is more the wisdom of the crowd than my own insights).
1. Accept uncertainty in scientific findings.
As long as we stick to the p < .05 ritual, 1 in 20 studies will produce false positive results if there is no effect in the population. Depending on the specific statistical test, the degree of violation of the assumptions of this test, and the amount of QRPs you apply, the actual Type I error rate can be both lower and higher than the nominal 5% (in practice, I’d bet on “higher”).
We know how to fix this – e.g., Bayesian statistics (Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011), “revised standards” (i.e., use p < .005 as magic threshold; Johnson, 2013), or focus on accuracy in parameter estimation instead of NHST (Maxwell, Kelley, & Rausch, 2008; Schönbrodt & Perugini, 2013).
A single study is hardly ever conclusive. Hence, a way to deal with uncertainty is to meta-analyze and to combine evidence. Let’s make a collaborative effort to increase knowledge, not (only) personal fame.
2. A failed replication does not imply fraud, QRP, or lack of competence of the original author.
This has been emphasized several times, and, as far as I read the papers of the special issue, no replicator made this implication. In contrast, the discussions were generally worded very cautious. Look in the original literature of the special issue, and you will hardly (or not at all) find evidence for “replication bullying”.
The nasty false-positive monster is lurking around for everybody of us. And when, someday, one of my studies cannot be replicated, then, of course I’ll be pissed off at first, but then …
It does *not* imply that I am a bad researcher.
It does *not* imply that the replicator is a “bully”.
But it means that we increased knowledge a little bit.
3. Transparency fosters progress.
Without open data in OSF, we couldn’t even discuss about ceiling effects [1][2][3][4]!
See also my related post: Reanalyzing the Schnall/Johnson “cleanliness” data sets: New insights from Bayesian and robust approaches
4. Do I have to ask Galileo before I throw a stone from a tower? No.
Or, as Dale Barr (@dalejbarr) tweeted: “If you publish a study demonstrating an effect, you don’t OWN that effect.”
As several others argued, it usually is very helpful, sensible, and fruitful to include the original authors in the replication effort. In an ideal world we make a collaborative effort to increase knowledge, and a very good example for a “best practice adversarial collaboration”, see the paper and procedure by Dora Matzke and colleagues (2013). Here, proponents and skeptics of an effect worked, together with an impartial referee, on a confirmatory study.
To summarize, involvement of original authors is nice and often fruitful, but definitely not necessary.
5. Celebrate the debate.
Remember the old times where “debate” was carried out with a time-lag of at least 6 months (until your comment was printed in the journal, if you were lucky), and usually didn’t happen at all?
I *love* reading and participating the active debate. This feels so much more like science than what we used to do. What happens here is not a shitstorm – it’s a debate! And the majority of comments really is directed at the issue and has a calm and constructive tone.
So, use the new media for post-publication-review, #HIBAR (Had I Been A Reviewer), and real exchange. The rules of the game seem to change, slowly.
I wonder why almost all of the debate is about the cleanliness study. There is also a (triple) non-replication of the moral licensing effect, which as an effect is far more substantial for our field. It has even produced a spin off, this is research about how we self license in heath matters; for example, taking vitamin pills and then feeling licensed to eat something “bad”. If this effect goes down, great part of psychology go with it.
https://plus.google.com/101046916407340625977/posts/5xXwpVQcDTg
I don’t know whether the recommendations after “We know how to fix this” are meant sarcastically. With the possible exception of using confidence intervals, so long as they are not also interpreted as dichotomously as seen in abuses of significance testing, and supplemented by a series of intervals at different levels, I don’t see how the other attempts are fixes, especially of QRPS.