Supplementary material for Schönbrodt & Perugini (2013): At what sample size do correlations stabilize?
Maybe you have encountered this situation: you run a large-scale study over the internet, and out of curiosity, you frequently check the correlation between two variables.
My experience with this practice is usually frustrating, as in small sample sizes (and we will see what "small" means in this context) correlations go up and down, change sign, move from "significant" to "non-significant" and back. As an example, see Figure 1 which shows the actual trajectory of a correlation, plotted against sample size (I also posted a video of this evolution).
It is simply the order how participants dropped into the study (i.e., data has not been rearranged). In this case, the correlation started really strong (r = .69) and continuously decayed until it’s final r of .26. The light gray lines show some exemplary bootstrapped alternative trajectories.
Here's also a video of the same data set:
In this particular case, at least the sign was stable ("There is a positive relationship in the population", see also "Type-S errors"). Other trajectories in this data set, however, changed signs or their significance status. One correlation even changed from "negative significant" to "positive significant"!
Obviously, the estimate of a correlation stabilizes with increasing sample size. Now I wanted to know: At which sample size exactly can I expect a correlation to be stable? An informal query amongst colleagues revealed estimates between n = 80 and n = 150.
Together with Marco Perugini, I did a systematic analysis of this question. The results of this simulation study are reported here [PDF, 0.39 MB]. In this paper a "corridor of stability" (COS) has been utilized: Deviations from the true value are defined as tolerable as long as they stay within that corridor (see also Figure 1 for a COS of +/- .1). The point of stability (POS) is that sample size from which on a specific trajectory does not leave the COS anymore.
The point of stability depends on the effect size (How strong is the true correlation?), the width of the corridor of stability (How much deviation from the true value am I willing to accept?), and the confidence in the decision (How confident do I want to be that the trajectory does not leave the COS any more?). If you’re interested in the details: read the paper. It’s not long.
The bottom line is: For typical scenarios in psychology, correlations stabilize when n approaches 250. That means, estimates with n > 250 are not only significant, they also are fairly accurate (see also Kelley & Maxwell, 2003, and Maxwell, Kelley, & Rausch, 2008, for elaborated discussions on parameter accuracy).
Additional analyses (not reported in the publication)
Figure 2 shows the distribution of POS values, depending on the half-width of the COS and on effect size rho. The horizontal axis is cut at n = 300, although several POS were > 300. It can be seen that all distributions have a very long tail. This makes the estimation of the 95th quantile very unstable. Therefore we used a larger number of 100’000 bootstrap replications in each experimental condition in order to get fairly stable estimates for the extreme quantiles.
Figure 3 shows the probability that a trajectory leaves the COS with increasing sample size.
The dotted lines mark the confidence levels of 80%, 90%, and 95% which were used in the publications. The n where the curves intersect these dotted lines indicate the values reported in Table 1 of the publication. For example, if the true correlation is .3 (which is already more than the average effect size in psychology) and you collect 100 participants, there’s still a chance of 50% that your correlation will leave the corridor between .21 and .39 (which are the boundaries for w=.1).
What is the conclusion? Significance tests determine the sign of a correlation. This conclusion can be made with much lower sample sizes. However, when we want to make an accurate conclusion about the size of an effect with some confidence (and we do not want to make a "Type M" error), we need much larger samples.
Finally, Figure 4 compares the POS values for several non-normal marginal distributions. It can be seen that the POS values are virtually identical. Please note that we only employed typical non-normal distorbutions (i.e., some skewness, somewhat heavier tails). For extreme deviations from normality or extreme outliers, results might be different.
For the full R source code for the simulations see right sidebar.