After calculating a correlation coefficient, it is usually reasonable to check its significance. Even if the variables have no correlation, for samples of finite size the correlation coefficient will be non-zero. Zero correlation coefficient is even more improbable than exactly 500 heads from 1000 coin tosses.
The algorithms represented on this page let us take 3 tests for correlation coefficient significance. The first test is a two-tailed test checking a hypothesis about zero correlation between two variables. The left-tailed test checks null hypotheses about non-negative correlation (i.e. correlation coefficient is greater than or equal to 0). Right-tailed test checks null hypothesis about non-positive correlation.
Significance test for Pearson's correlation coefficient is performed by PearsonCorrelationSignificance subroutine. This subroutine requires samples to be normal, because tails of Pearson's correlation coefficient distribution have been calculated for normal samples only. If samples differ slightly from normal distribution, this test is applicable, but its results will be not accurate. As deviation increases, the results become less credible. Therefore, if you are not confident that samples are close enough to normal distribution, it's better to use non-parametric correlation coefficient (Spearman's rank correlation coefficient) and the corresponding test which doesn't require sample normality. This test is performed by SpearmanRankCorrelationSignificance subroutine.
As it was noted above, the significance test for rank correlation doesn't depend on sample distribution. One more advantage of the non-parametric correlation coefficient is that it is less affected by the outliers. If the sample size is small, one big outlier can enlarge Pearson's correlation coefficient and make the wrong conclusion. Spearman's rank correlation coefficient is less affected by outliers (independently of the outlier size, its impact on correlation coefficient is bounded from above), which makes it irreplaceable when processing noisy data.
This article is intended for personal use only.
C++ source. MPFR/GMP is used.
GMP source is available from gmplib.org. MPFR source is available from www.mpfr.org.
Python version (CPython and IronPython are supported).