In a recent post on the DataColada blog, Uri Simonsohn wrote about “We cannot afford to study effect size in the lab“. The central message is: If we want accurate effect size (ES) estimates, we need large sample sizes (he suggests four-digit n’s). As this is hardly possible in the lab we have to use other research tools, like onelin studies, archival data, or more within-subject designs.
While I agree to the main point of the post, I’d like to discuss and extend some of the conclusions. As the DataColada blog has no comments section, I’ll comment in my own blog …
“Does it make sense to push for effect size reporting when we run small samples? I don’t see how.”
“Properly powered studies teach you almost nothing about effect size.”
It is true that the ES estimate with n=20 will be utterly imprecise, and reporting this ES estimate could misguide readers who give too much importance to the point estimate and do not take the huge CI into account (maybe, because it has not been reported).
Still, and here’s my disagreement, I’d argue that also small-n studies should report the point estimate (along with the CI), as a meta-analysis of many imprecise small-n estimates still can give an unbiased and precise cumulative estimate. This, of course, would require that all estimates are reported, not only the significant ones (van Assen, van Aert, Nuijten, & Wicherts, 2014).
Here’s an example – we simulate a population with a true Cohen’s d of 0.8. Then we look at three scenarios: a) a single small-n study with 20 participants, b) 400 participants, and c) 20 * n=20 studies which are meta-analyzed.
This single small-n study has g = 1.337 [ 0.64 , 2.034] (g is an unbiased estimate of d). Quite biased, but the true ES is in the CI.
This single large-n study has g = 0.76 [ 0.616 , 0.903] . It close to the true ES, and has quite narrow CI (i.e., high precision).
Now, here’s a meta-analyses of 20*n=20 studies:
The meta-analysis reveals g = 0.775 [0.630; 0.920]. This has nearly exactly the same CI width as the n=400 study, and a slightly different ES estimate.
Here’s a plot of the results:
To summarize, a single small-n study hardly teaches something about effect sizes – but many small-n‘s do. But meta-analyses are only possible, if the ES is reported.
“But just how big an n do we need to study effect size? I am about to show that the answer has four-digits.”
In Uri’s post (and the linked R code) the precision issue is approached from the power side – if you increase power, you also increase precision. But you can also directly compute the necessary sample size for a desired precision. This is called the AIPE-framework (“accuracy in parameter estimation”) made popular by Ken Kelley, Scott Maxwell, and Joseph Rausch (Kelley & Rausch, 2006; Kelley & Maxwell, 2003; Maxwell, Kelley, & Rausch, 2008). The necessary functions are implemented in the MBESS package for R. If you want a CI width of .10 around an expected ES of 0.5, you need 3170 participants:
The same point has been made from a Bayesian point of view in a blog post from John Kruschke: notice the sample size on the x-axis.
In our own analysis on how correlations evolve with increasing sample size (Schönbrodt & Perugini, 2013; see also blog post), we conclude that for typical effect sizes in psychology, you need 250 participants to get sufficiently accurate and stable estimates of the ES:
It’s certainly hard to give general guidelines how much precision is sensible, but here are our thoughts we based our stability analyses on. We used a CI-like “corridor of stability” (see Figure) with a half-width of w= .10, w=.15, and w=.20 (everything in the “correlation-metric”).
w = .20 was chosen for following reason: The average reported effect size in psychology is around r = .21 (Richard, Bond, & Stokes-Zoota, 2003). For this effect size, an accuracy of w = .20 would result in a CI that is “just significant” and does not include a reversal of the sign of the effect. Hence, with the typical effect sizes we are dealing with in psychology, a CI with a half-width > .20 would not make much sense.
w = .10 was chosen as it corresponds to a “small effect size” à la Cohen. This is arbitrary, of course, but at least some anchor. And w = .15 is just in between.
Using these numbers and an ES estimate of, say, r = .29, a just tolerable precision would be [.10; .46] (w = .20), a tolerable precision [.15; .42] (w = .15) and a moderate precision [.20; .38] (w = .10).
If we use this lower threshold of “just tolerable precision”, we would need in the two-sample group difference around 200-250 participants per group. While I am not sure whether we really need four-digit samples for typical scenarios, I am sure that we need at least three-digit samples when we want to talk about “precision”.
Regardless of the specific level of precision and method used, however, one thing is clear: Accuracy does not come in cheaply. We need much less participants for an hypothesis test (Is there a non-zero effect or not?) compared to an accurate estimate.
With increasing sample size, unfortunately you have diminishing returns on precision: As you can see in the dotted lines in Figure 2, the CI levels off, and you need disproportionally large sample sizes to squeeze out the last tiny percentages of a shrinking CI. If you follow Pareto’s principle, you should stop somewhen. Probably in scientific progress accuracy will be rather achieved in meta-analyzing several studies (which also gives you an estimate about the ES variability and possible moderators) than doing one mega-study.
Hence: Always report your ES estimate, even in small-n studies.
Kelley, K., & Maxwell, S. E. (2003). Sample Size for Multiple Regression: Obtaining Regression Coefficients That Are Accurate, Not Simply Significant. Psychological Methods, 8, 305–321. doi:10.1037/1082-989X.8.3.305
Kelley, K., & Rausch, J. R. (2006). Sample size planning for the standardized mean difference: Accuracy in parameter estimation via narrow confidence intervals. Psychological Methods, 11, 363–385. doi:10.1037/1082-989X.11.4.363
Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample Size Planning for Statistical Power and Accuracy in Parameter Estimation. Annual Review of Psychology, 59(1), 537–563. doi:10.1146/annurev.psych.59.103006.093735
van Assen, M. A. L. M., van Aert, R. C. M., Nuijten, M. B., & Wicherts, J. M. (2014). Why publishing everything is more effective than selective publishing of statistically significant results. PLoS ONE, 9, e84896. doi:10.1371/journal.pone.0084896