Maybe you have encountered this situation: you run a large-scale study over the internet, and out of curiosity, you frequently the correlation between two variables.

My experience with this practice is usually frustrating, as in small sample sizes (and we will see what “small” means in this context) correlations go up and down, change sign, move from “significant” to “non-significant” and back. As an example, see Figure 1 which shows the actual trajectory of a correlation, plotted against sample size (I also posted a video of this evolution).

It is simply the order how participants dropped into the study (i.e., data has not been rearranged). In this case, the correlation started really strong (*r* = .69) and continuously decayed until it’s final *r* of .26. The light gray lines show some exemplary bootstrapped alternative trajectories.

In this particular case, at least the sign was stable (“There is a positive relationship in the population”, see also “Type-S errors”). Other trajectories in this data set, however, changed signs or their significance status. One correlation even changed from “negative significant” to “positive significant”!

Obviously, the estimate of a correlation stabilizes with increasing sample size. Now I wanted to know: At which sample size exactly can I expect a correlation to be stable? An informal query amongst colleagues revealed estimates between

*n*= 80 and

*n*= 150.

Together with Marco Perugini, I did a systematic analysis of this question. The results of this simulation study are reported [tippy title=”here”]Schönbrodt, F. D., & Perugini, M. (in press). At what sample size do correlations stabilize?

*Journal of Research in Personality*. doi:10.1016/j.jrp.2013.05.009[/tippy] [PDF, 0.39 MB]. In this paper a “corridor of stability” (

*COS*) has been utilized: Deviations from the true value are defined as tolerable as long as they stay within that corridor (see also Figure 1 for a COS of +/- .1). The point of stability (

*POS*) is that sample size from which on a specific trajectory does not leave the COS anymore.

The point of stability depends on the effect size (How strong is the true correlation?), the width of the corridor of stability (How much deviation from the true value am I willing to accept?), and the confidence in the decision (How confident do I want to be that the trajectory does not leave the COS any more?). If you’re interested in the details: read the paper. It’s not long.

The bottom line is: For typical scenarios in psychology (i.e., rho = .21, w = .1, confidence = 80%),

**correlations stabilize when**. That means, estimates with n > 250 are not only significant, they also are fairly

*n*approaches 250*accurate*(see also Kelley & Maxwell, 2003, and Maxwell, Kelley, & Rausch, 2008, for elaborated discussions on parameter accuracy).

# Additional analyses (not reported in the publication)

Figure 2 shows the distribution of POS values, depending on the half-width of the COS and on effect size rho. The horizontal axis is cut at n = 300, although several POS were > 300. It can be seen that all distributions have a very long tail. This makes the estimation of the 95th quantile very unstable. Therefore we used a larger number of 100’000 bootstrap replications in each experimental condition in order to get fairly stable estimates for the extreme quantiles.

Finally, Figure 3 shows the probability that a trajectory leaves the COS with increasing sample size.

The dotted lines mark the confidence levels of 80%, 90%, and 95% which were used in the publications. The *n* where the curves intersect these dotted lines indicate the values reported in Table 1 of the publication. For example, if the true correlation is .3 (which is already more than the average effect size in psychology) and you collect 100 participants, there’s still a chance of 50% that your correlation will leave the corridor between .21 and .39 (which are the boundaries for w=.1).

What is the conclusion? Significance tests determine the sign of a correlation. This conclusion can be made with much lower sample sizes. However, when we want to make an accurate conclusion about the *size* of an effect with some confidence (and we do not want to make a “Type M” error), we need much larger samples.

The full R source code for the simulations can be downloaded here.

*References:*

*Psychological Methods*,

*8*, 305–321. [PDF]

*Annual Review of Psychology*,

*59*, 537–563. doi:10.1146/annurev.psych.59.103006.093735 [PDF]

*Journal of Research in Personality, 47*, 609-612. doi:10.1016/j.jrp.2013.05.009 [PDF]

This is an awesome article! You present a compelling argument with clear analysis and great (although expensive) recommendations. Thank you for your hard work and public discourse!

Nice argument and obviously useful. I’ve faced a lot of this myself and wondered about it. Sad that the n is so high. Often in medical research we need results in smaller samples, when expensive procedures or small populations can cause problems for doing large studies.

I found this post very interesting and insightful the first time I read it.

I ran into an issue at work today that made me think back to this. When analyzing data using a correlation test in minitab, I get a p-value of 0.000 and a pearson value of around -0.2 which implies the opposite of what the p-value tells me. Now you say the higher the N the more stabilized it becomes and in this case the N is over 500 samples. I was wondering if you had anything to read on this phenomenon or any suggestions/knowledge on which to believe or where to look for an error on my end. I’m not sure what is going wrong.

I lean towards believing the p-value more because 1) it tells me what I want to hear (haha) and 2) I know exactly what a p-value is while I have a more foggy understanding of the pearson.

Thank you for your time,

Alec J. Greenspan