June 1, 2017

It is well known that publication bias and *p*-hacking
inflate effect size estimates from meta-analyses. In the last years,
methodologists have developed an ever growing menu of statistical
approaches to correct for such overestimation. However, to date it was
unclear under which conditions they perform well, and what to do if they
disagree. Born out of a Twitter discussion, Evan Carter, Joe Hilgard,
Will Gervais and I did a large simulation project, where we compared the
performance of naive random effects meta-analysis (RE), trim-and-fill
(TF),* p*-curve, *p*-uniform, PET, PEESE, PET-PEESE, and the three-parameter selection model (3PSM).

Previous investigations typically looked only at publication bias *or*
questionable research practices QRPs (but not both), used
non-representative study-level sample sizes, or only compared few
bias-correcting techniques, but not all of them. Our goal was to
simulate a research literature that is as realistic as possible for
psychology. In order to simulate several research environments, we fully
crossed five experimental factors: (1) the true underlying
effect, δ (0, 0.2, 0.5, 0.8); (2) between-study
heterogeneity, τ (0, 0.2, 0.4); (3) the number of studies in the
meta-analytic sample, *k* (10, 30, 60, 100); (4) the percentage
of studies in the meta-analytic sample produced under publication bias
(0%, 60%, 90%); and (5) the use of QRPs in the literature that produced
the meta-analytic sample (none, medium, high).

This blog post summarizes some insights from our study, internally called “meta-showdown”. Check out the preprint; and the interactive app metaExplorer. The fully reproducible and reusable simulation code is on Github, and more information is on OSF.

In this blog post, I will highlight some lessons that we learned during the project, primarily focusing on **what not do to when performing a meta-analysis**.

**Constraints on Generality
disclaimer: These recommendations apply to typical sample sizes, effect
sizes, and heterogeneities in psychology; other research literatures
might have different settings and therefore a different performance of
the methods. Furthermore, the recommendations rely on the modeling
assumptions of our simulation. We went a long way to make them as
realistic as possible, but other assumptions could lead to other
results.**

If studies have no publication bias, nothing can beat plain old
random effects meta-analysis: it has the highest power, the least bias,
and the highest efficiency compared to all other methods. Even in the
presence of some (though not extreme) QRPs, naive RE performs better
than all other methods. When can we expect no publication bias? If (and,
in my opinion *only if*) we meta-analyze a set of registered reports.

But.

In *any* other setting except registered reports, a
consequential amount of publication bias must be expected. In the field
of psychology/psychiatry, more than 90% of all published hypothesis
tests are significant (Fanelli, 2011) despite the average power being
estimated as around 35% (Bakker, van Dijk, & Wicherts, 2012) – the
gap points towards a huge publication bias. In the presence of
publication bias, naive random effects meta-analysis and trim-and-fill
have false positive rates approaching 100%:

More thoughts about trim-and-fill’s inability to recover δ=0 are in Joe Hilgard’s blog post. (Note: this insight is not really new and has been shown multiple times before, for example by Moreno et al., 2009, and Simonsohn, Nelson, and Simmons, 2014).

**Our recommendation: Never trust
meta-analyses based on naive random effects and trim-and-fill, unless
you can rule out publication bias. Results from previously published
meta-analyses based on these methods should be treated with a lot of
skepticism.
**

As a default, heterogeneity should always be expected – even under
the most controlled conditions, where many labs perform the same
computer-administered experiment, a large proportion showed significant
and substantial amounts of between-study heterogeneity (cf. ManyLabs 1
and 3; see also our supplementary document for more details). *p*-curve and *p*-uniform assume homogeneous effect sizes, and their performance is impacted to a large extent by heterogeneity:

As you can see, all other methods retain the nominal false positive rate, but *p*-curve and *p*-uniform go through the roof as soon as heterogeneity comes into play (see also McShane, Böckenholt, & Hansen, 2016; van Aert et al., 2016).

Under H1, heterogeneity leads to overestimation of the true effect:

(additional settings for these plots: no QRPs, no publication bias, *k* = 100 studies, true effect size = 0.5)

Note that in their presentation of *p*-curve, Simonsohn et al. (2014) emphasize that, in the presence of heterogeneity, *p*-curve is intended as an estimate of the average true effect size among the studies *submitted* to *p*-curve (see here, Supplement 2). *p*-curve
may indeed yield an accurate estimate of the true effect size among the
significant studies, but in our view, the goal of bias-correction in
meta-analysis is to estimate the average effect of all *conducted*
studies. Of course this latter estimation hinges on modeling
assumptions (e.g., that the effects are normally distributed), which can
be disputed, and there might be applications where indeed the
underlying true effect of all significant studies is more interesting.

Furthermore, as McShane et al (2016) demonstrate, *p*-curve and *p*-uniform are constrained versions of the more general three-parameter selection model (3PSM; Iyengar & Greenhouse, 1988).
The 3PSM estimates (a) the mean of the true effect, δ, (b) the
heterogeneity, τ, and (c) the probability that a non-significant result
enters the literature, *p*. The constraints of *p*-curve and *p*-uniform are: 100% publication bias (i.e., *p* = 0) and homogeneity (i.e., τ = 0). Hence, for the estimation of effect sizes, 3PSM seems to be a good replacement for *p*-curve and *p*-uniform, as it makes these constraints testable.

**Our recommendation: Do not use p-curve or p-uniform for effect size estimation when heterogeneity can be expected (which is nearly always the case).**

Many bias-correcting methods are driven by QRPs – the more QRPs, the stronger the downward correction. However, this effect can get so strong, that methods overadjust into the opposite direction, even if all studies in the meta-analysis are of the same sign:

Note: You need to set the option “Keep negative estimates” to get this plot.

**Our recommendation: Ignore
bias-corrected results that go into the opposite direction; set the
estimate to zero, do not reject H₀.
**

Typical small-study effects (e.g., by *p*-hacking or
publication bias) induce a negative correlation between sample size and
effect size – the smaller the sample, the larger the observed effect
size. PET-PEESE aims to correct for that relationship. In the absence of
bias and QRPs, however, random fluctuations can lead to a *positive*
correlation between sample size and effect size, which leads to a PET
and PEESE slope of the unintended sign. Without publication bias, this
reversal of the slope actually happens quite often.

See for example the next figure. The true effect size is zero (red dot), naive random effects meta-analysis slightly overestimates the true effect (see black dotted triangle), but PET and PEESE massively overadjust towards more positive effects:

PET-PEESE was never intended to correct in the reverse direction. An underlying biasing process would have to systematically remove small studies that show a significant result with larger effect sizes, and keep small studies with non-significant results. In the current incentive structure, I see no reason for such a process.

**Our recommendation: Ignore the PET-PEESE correction if it has the wrong sign.**

A bias can be more easily accepted if it always is conservative –
then one could conclude: “This method might miss some true effects, but *if*
it indicates an effect, we can be quite confident that it really
exists”. Depending on the conditions (i.e., how much publication bias,
how much QRPs, etc.), however, PET/PEESE sometimes shows huge
overestimation and sometimes huge underestimation.

For example, with no publication bias, some heterogeneity (τ=0.2), and severe QRPs, PET/PEESE *underestimates* the true effect of δ = 0.5:

In contrast, if no effect exists in reality, but strong publication bias, large heterogeneity and no QRPs, PET/PEESE *overestimates* at lot:

In fact, the distribution of PET/PEESE estimates looks virtually identical for these two examples, although the underlying true effect is δ = 0.5 in the upper plot and δ = 0 in the lower plot. Furthermore, note the huge spread of PET/PEESE estimates (the error bars visualize the 95% quantiles of all simulated replications): Any single PET/PEESE estimate can be very far off.

**Our recommendation: As one cannot know the condition of reality, it is safest not to use PET/PEESE at all.
**

Again, please consider the “Constraints on Generality” disclaimer above.

- When you can exclude publication bias (i.e., in the context of registered reports), do not use bias-correcting techniques. Even in the presence of some QRPs they perform worse than plain random effects meta-analysis.
- In any other setting except registered reports, expect publication bias, and do not use random effects meta-analysis or trim-and-fill. Both will give you a 100% false positive rate in typical settings, and a biased estimation.
- Under heterogeneity,
*p*-curve and*p*-uniform overestimate the underlying effect and have false positive rates >= 50% - Even if all studies entering a meta-analysis point into the same
direction (e.g., all are positive), bias-correcting techniques sometimes
overadjust and return a significant estimate of the
*opposite*direction. Ignore these results, set the estimate to zero, do not reject H₀. - Sometimes PET/PEESE adjust into the wrong direction (i.e., increasing the estimated true effect size)

As with any general recommendations, there might be good reasons to ignore them.

- The
*p*-uniform package (v. 0.0.2) very rarely does not provide a lower CI. In this case, ignore the estimate. - Do not run
*p*-curve or*p*-uniform on <=3 significant and directionally consistent studies. Although computationally possible, this gives hugely variable results, which are often very biased. See our supplemental material for more information and plots. - If the 3PSM method (in the implementation of McShane et al., 2016) returns an incomplete covariance matrix, ignore the result (even if a point estimate is provided).

**Now you probably ask: But what should I use? Read our preprint for an answer!**

[…] post Correcting bias in meta-analyses: What not to do (meta-showdown Part 1) appeared first on […]