On some graph-based two-sample tests for high dimension, low sample size data

Sarkar, Soham; Biswas, Rahul; Ghosh, Anil K.

doi:10.1007/s10994-019-05857-4

research article

On some graph-based two-sample tests for high dimension, low sample size data

Sarkar, Soham

•

Biswas, Rahul

•

Ghosh, Anil K.

2020

Machine Learning

Testing for equality of two high-dimensional distributions is a challenging problem, and this becomes even more challenging when the sample size is small. Over the last few decades, several graph-based two-sample tests have been proposed in the literature, which can be used for data of arbitrary dimensions. Most of these test statistics are computed using pairwise Euclidean distances among the observations. But, due to concentration of pairwise Euclidean distances, these tests have poor performance in many high-dimensional problems. Some of them can have powers even below the nominal level when the scale-difference between two distributions dominates the location-difference. To overcome these limitations, we introduce some new dissimilarity indices and use them to modify some popular graph-based tests. These modified tests use the distance concentration phenomenon to their advantage, and as a result, they outperform the corresponding tests based on the Euclidean distance in a wide variety of examples. We establish the high-dimensional consistency of these modified tests under fairly general conditions. Analyzing several simulated as well as real data sets, we demonstrate their usefulness in high dimension, low sample size situations.

Type

research article

DOI

10.1007/s10994-019-05857-4

Web of Science ID

WOS:000496197500002

Authors

Sarkar, Soham

•

Biswas, Rahul

•

Ghosh, Anil K.

Publication date

2020

Publisher

SPRINGER

Published in

Machine Learning

Volume

109

Start page

279

End page

306

Subjects

Computer Science, Art...

Computer Science

distance concentratio...

high-dimensional cons...

minimum spanning tree...

nearest neighbor

non-bipartite matchin...

permutation test

shortest hamiltonian ...

large numbers

multivariate

distributions

equality

laws

Peer reviewed

REVIEWED

EPFL units

SMAT

Available on Infoscience

November 27, 2019

Use this identifier to reference this record

https://infoscience.epfl.ch/handle/20.500.14299/163416