Correlation by Compression

Abstract. Discovering correlated variables is one of the core problems in data analysis. Many measures for correlation have been proposed, yet it is surprisingly ill-defined in general. That is, most, if not all, make very strong assumptions on the data distribution or type of dependency they can detect.

In this work, we provide a general theory on correlation, without making any such assumptions. Simply put, we propose correlation by compression. To this end, we propose two correlation measures based on solid information theoretic foundations, i.e. Kolmogorov complexity. The proposed correlation measures possess interesting properties desirable for any sensible correlation measure. However, Kolmogorov complexity is not computable, and hence we propose practical and computable instantiations based on the Minimum Description Length (MDL) principle.

In practice, we can apply the proposed measures on any type of data by instantiating them with any lossless real-world compressors that reward dependencies. Extensive experiments show that the correlation measures works well in practice, have high statistical power, and find meaningful correlations on binary data, while they are easily extendible to other data types.

Implementation

the Python source code (March 2017) by Kailash Budhathoki and Jilles Vreeken.

Related Publications

Budhathoki, K & Vreeken, J Correlation by Compression. In: Proceedings of the SIAM Conference on Data Mining (SDM), SIAM, 2017. (25% acceptance rate)
Budhathoki, K Correlation by Compression. M.Sc. Thesis, Saarland University, 2015.