In a recent post on the DataColada blog, Uri Simonsohn wrote about “We cannot afford to study effect size in the lab“. The central message is: If we want accurate effect size (ES) estimates, we need large sample sizes (he suggests four-digit n’s). As this is hardly possible in the lab we have to use other research tools, like onelin studies, archival data, or more within-subject designs.

While I agree to the main point of the post, I’d like to discuss and extend some of the conclusions. As the DataColada blog has no comments section, I’ll comment in my own blog …

“Does it make sense to push for effect size reporting when we run small samples? I don’t see how.”

“Properly powered studies teach you almost nothing about effect size.”

It is true that the ES estimate with *n*=20 will be utterly imprecise, and reporting this ES estimate could misguide readers who give too much importance to the point estimate and do not take the huge CI into account (maybe, because it has not been reported).

**Still, and here’s my disagreement, I’d argue that also small- n studies should report the point estimate (along with the CI)**, as a meta-analysis of many imprecise small-

*n*estimates still can give an unbiased and precise cumulative estimate. This, of course, would require that all estimates are reported, not only the significant ones (van Assen, van Aert, Nuijten, & Wicherts, 2014).

Here’s an example – we simulate a population with a true Cohen’s d of 0.8. Then we look at three scenarios: a) a single small-

*n*study with 20 participants, b) 400 participants, and c) 20 *

*n*=20 studies which are meta-analyzed.

set.seed(0xBEEF1)

library(compute.es)

library(metafor)

library(ggplot2)

X1 <- rnorm(1000000, mean=0, sd=1)

X2 <- rnorm(1000000, mean=0, sd=1) - 0.8

## Compute effect size for ...

# ... a single n=20 study

x1 <- sample(X1, 20)

x2 <- sample(X2, 20)

(t1 <- t.test(x1, x2))

(ES1 <- tes(t1$statistic, 20, 20, dig=3))

This single small-*n* study has g = 1.337 [ 0.64 , 2.034] (g is an unbiased estimate of d). Quite biased, but the true ES is in the CI.

This single large-*n* study has g = 0.76 [ 0.616 , 0.903] . It close to the true ES, and has quite narrow CI (i.e., high precision).

Now, here’s a meta-analyses of 20*n=20 studies:

dat <- data.frame()

for (i in 1:20) {

x1 <- sample(X1, 20)

x2 <- sample(X2, 20)

dat <- rbind(dat, data.frame(

study=i,

m1i=mean(x1),

m2i=mean(x2),

sd1i=sd(x1),

sd2i=sd(x2),

n1i=20, n2i=20

))

}

}

# Do a fixed-effect model meta-analysis

es <- escalc("SMD", m1i=m1i, m2i=m2i, sd1i=sd1i, sd2i=sd2i, n1i=n1i, n2i=n2i, data=dat, append=TRUE)

(meta <- rma(yi, vi, data=es, method="FE"))

The meta-analysis reveals g = 0.775 [0.630; 0.920]. This has nearly exactly the same CI width as the n=400 study, and a slightly different ES estimate.

Here’s a plot of the results:

res <- data.frame(

n = factor(c("a) n=20", "b) n=400", "c) 20 * n=20\n(meta-analysis)"), ordered=TRUE),

point_estimate = c(ES1$d, ES2$d, meta$b),

ci.lower = c(ES1$l.d, ES2$l.d, meta$ci.lb),

ci.upper = c(ES1$u.d, ES2$u.d, meta$ci.ub)

)

ggplot(res, aes(x=n, y=point_estimate, ymin=ci.lower, ymax=ci.upper)) + geom_pointrange() + theme_bw() + xlab("") + ylab("Cohen's d") + geom_hline(yintercept=0.8, linetype="dotted", color="darkgreen")

To summarize, a single small-*n* study hardly teaches something about effect sizes – but many small-*n*‘s do. But meta-analyses are only possible, if the ES is reported.

“But just how big an

ndo we need to study effect size? I am about to show that the answer has four-digits.”

In Uri’s post (and the linked R code) the precision issue is approached from the power side – if you increase power, you also increase precision. But you can also directly compute the necessary sample size for a desired precision. This is called the AIPE-framework (“accuracy in parameter estimation”) made popular by Ken Kelley, Scott Maxwell, and Joseph Rausch (Kelley & Rausch, 2006; Kelley & Maxwell, 2003; Maxwell, Kelley, & Rausch, 2008). The necessary functions are implemented in the MBESS package for R. If you want a CI width of .10 around an expected ES of 0.5, you need 3170 participants:

The same point has been made from a Bayesian point of view in a blog post from John Kruschke: notice the sample size on the x-axis.

In our own analysis on how correlations evolve with increasing sample size (Schönbrodt & Perugini, 2013; see also blog post), we conclude that for typical effect sizes in psychology, you need 250 participants to get sufficiently accurate and stable estimates of the ES:

### How much precision is needed?

It’s certainly hard to give general guidelines how much precision is sensible, but here are our thoughts we based our stability analyses on. We used a CI-like “corridor of stability” (see Figure) with a half-width of w= .10, w=.15, and w=.20 (everything in the “correlation-metric”).

w = .20 was chosen for following reason: The average reported effect size in psychology is around *r* = .21 (Richard, Bond, & Stokes-Zoota, 2003). For this effect size, an accuracy of w = .20 would result in a CI that is “just significant” and does not include a reversal of the sign of the effect. Hence, with the typical effect sizes we are dealing with in psychology, a CI with a half-width > .20 would not make much sense.

w = .10 was chosen as it corresponds to a “small effect size” à la Cohen. This is arbitrary, of course, but at least some anchor. And w = .15 is just in between.

Using these numbers and an ES estimate of, say, *r* = .29, a just tolerable precision would be [.10; .46] (w = .20), a tolerable precision [.15; .42] (w = .15) and a moderate precision [.20; .38] (w = .10).

If we use this lower threshold of “just tolerable precision”, we would need in the two-sample group difference around 200-250 participants per group. While I am not sure whether we really need four-digit samples for typical scenarios, **I am sure that we need at least three-digit samples when we want to talk about “precision”.**

Regardless of the specific level of precision and method used, however, one thing is clear: **Accuracy does not come in cheaply**. We need much less participants for an hypothesis test (Is there a non-zero effect or not?) compared to an accurate estimate.

With increasing sample size, unfortunately you have diminishing returns on precision: As you can see in the dotted lines in Figure 2, the CI levels off, and you need disproportionally large sample sizes to squeeze out the last tiny percentages of a shrinking CI. If you follow Pareto’s principle, you should stop somewhen. Probably in scientific progress accuracy will be rather achieved in meta-analyzing several studies (which also gives you an estimate about the ES variability and possible moderators) than doing one mega-study.

**Hence: Always report your ES estimate, even in small- n studies.**

### References

Kelley, K., & Maxwell, S. E. (2003). Sample Size for Multiple Regression: Obtaining Regression Coefficients That Are Accurate, Not Simply Significant. *Psychological Methods, 8*, 305–321. doi:10.1037/1082-989X.8.3.305

Kelley, K., & Rausch, J. R. (2006). Sample size planning for the standardized mean difference: Accuracy in parameter estimation via narrow confidence intervals. *Psychological Methods, 11*, 363–385. doi:10.1037/1082-989X.11.4.363

Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample Size Planning for Statistical Power and Accuracy in Parameter Estimation. *Annual Review of Psychology, 59(1)*, 537–563. doi:10.1146/annurev.psych.59.103006.093735

*Review of General Psychology, 7*, 331–363. doi:10.1037/1089-2680.7.4.331

van Assen, M. A. L. M., van Aert, R. C. M., Nuijten, M. B., & Wicherts, J. M. (2014). Why publishing everything is more effective than selective publishing of statistically significant results.* PLoS ONE, 9*, e84896. doi:10.1371/journal.pone.0084896

Would it be feasible to quickly check the impact of

– priors

– within-designs

on the outcomes? For example, in cognitive neuroscience, almost all studies are within.

Sorry, I did not get your point here. The impact of within-designs on what?

At the end of the post by Simonsohn you’re responding to, he writes:

“Advocating a focus on effect size estimation, then, implies advocating for either:

1) Leaving the lab (e.g., mTurk, archival data).

2) Running within-subject designs.”

Do repeated-measures designs make it substantially easier to get a desirably narrow CI?

Sorry, took a while to figure this out.

There are different ways to calculate the effect size for repeated measures. Some take the correlation between the paired measures into account, some don’t (For an excellent overview, see Lakens, 2013).

A straightforward definition by Cohen defines the effect size as “difference score divided by standard deviation of this difference score”, which can be calculated as d_z = t / sqrt(n).

Using this definition, you can use the ss.aipe.smd function from MBESS, and take the n, which is for each group in a independent-groups design, as the n for a single within-design group. That means, for this definition of within-group effect size, you get the same precision with half of the sample size.

Q: Do repeated-measures designs make it substantially easier to get a desirably narrow CI?

A: Yes.

(Thanks to Ken Kelley for advice on how to calculate this!)

References:

Lakens D(2013) Calculating and Reporting Effect Sizes to Facilitate Cumulative Science: A Practical Primer for t-tests and ANOVAs. Front. Psychology 4:863. doi:10.3389/fpsyg.2013.00863

Vielen Dank 🙂 Wobei 400/2 immer noch eine Menge ist.

Nice post, and your point that reporting ES estimates is useful for future meta-analysts is a good one. However, should ES estimates from small studies come with a disclaimer, “for meta analysis use only”? I.e. should we discourage people from ‘interpreting’ them because they are so noisy, as Uri has shown?

For me, the process looks like:

1. We have data and make data analysis

2. We build summary indexes for the results, which reduce the data to a few numbers, which (hopefully) are meaningful for the research question at hand. These summary numbers should convey all relevant information.

3. We communicate the results to other researchers and to a broader audience.

Concerning step 2, I argued that we should at least report the point estimate and the CI of the ES. Only the p-value is too reduced and misses important information.

Your idea with the disclaimer refers to step three. I hope that in the future trained researchers take the CI always into account and do not give too much about the point estimate when the CI is wide.

*But*, given that even trained statisticians have misconceptions about what p-values, for example, really mean (Oakes, 1986; Haller & Krauss, 2002; Gigerenzer et al., 2004), a disclaimer probably won’t hurt!

And when we communicate to non-scientist, we definitely should add that disclaimer!

IMO it’d be sufficient, or at least an improvement, to report what the Confidence Interval says – which will usually amount to “the results are compatible with trivially small, but also moderately big effects”.

If I had to choose between “only CI” vs. “only point estimate”, I’d definitely vote for the first! This indeed would help not to focus too much on the point estimate.

This would, however, put some extra burden on a future meta-analyst, as s/he would have to calculate the mean of the CI to get the estimate (and in case of bootstrapped CIs the point estimate is not always located at the mean).

So maybe rather a disclaimer.

CIs as the only *inferential* statistic is however an option.

Include the descriptive statistics necessary for calculating the main measures (variance, mean, median …) and make the data publicly available, but focus the report of the inference on the bland CI.

But my original point was about what kind of narrative report the statistics should license. Currently, researchers feel justified to talk about “significant” differences based on *p* and, at times, measures of ES.

IMO, it would be superior, compared to what we’re doing right now, to restrict the Discussion and Results on what is licensed by a CI.

“Our data indicate that a wide range of values, including slightly positive and strongly positive, is believable for the difference between oranges and apples; however, our study leaves us with little confidence in negative, zero, or extremely high values.”

Very good point! I wholeheartedly agree.

I think for hypothesis tests we need at least 95% confidence (alpha .05), but it should be acceptable to discuss the potential relevance of an effect based on more narrow confidence intervals. E.g., to discuss the implications of an effect size that with 70% confidence is between .40 and .55.

I’d argue for study-wise confidence level 95%. For example, if the same sample were used to estimate 5 effect sizes, giving a 99% confidence interval for each effect would work.

Hi Felix, your post is very insightful. You’ve made me think very deeply about these ideas, thank you!

I wonder, though (about an almost inconsequential point), given that articles suffer from statistics clutter as it is, wouldn’t it be better if people just reported the necessary descriptives for computing effect sizes and their variances?

Imagine this throughout an article: main effect for A text, F(1,796) = 4.13, p = .04, d = -0.29, 95%CI [-0.57 -0.01], main effect for B text, F(1,796) = 2.13, p = .14, d = -0.21, 95%CI [-0.48, 0.07], and interaction AB text, F(1,796) = 5.79 .016, d = -0.34, 95%CI [-0.62, -0.06]. All of this text would likely include means, standard deviation and (rarely) sample sizes.

It seems to me that means, standard deviations, cell sizes, and test statistics would be more than enough to implement precise effect size estimation in future meta-analyses. Covariance matrices might be necessary for within-subjects designs and complex models, but, in general, it seems burdensome on readers and authors to require or even just encourage so many estimates.

That said, your website and your paper(s) are very enlightening, thank you!