Journal article

treeKL: A distance between high dimension empirical distributions

This paper offers a methodological contribution for computing the distance between two empirical distributions in an Euclidean space of very large dimension. We propose to use decision trees instead of relying on standard quantification of the feature space. Our contribution is twofold: We first define a new distance between empirical distributions, based on the Kullback-Leibler (KL) divergence between the distributions over the leaves of decision trees built for the two empirical distributions. Then, we propose a new procedure to build these unsupervised trees efficiently. The performance of this new metric is illustrated on image clustering and neuron classification. Results show that the tree-based method outperforms standard methods based on standard bag-of-features procedures. (C) 2012 Elsevier B.V. All rights reserved.

Related material