Abstract

Motivation: Unbiased clustering methods are needed to analyze growing numbers of complex data sets. Currently available clustering methods often depend on parameters that are set by the user, they lack stability, and are not applicable to small data sets. To overcome these shortcomings we used topological data analysis, an emerging field of mathematics that can discerns additional feature and discover hidden insights on data sets and has a wide application range. Results: We have developed a topology-based clustering method called Two-Tier Mapper (TTMap) for enhanced analysis of global gene expression datasets. First, TTMap discerns divergent features in the control group, adjusts for them, and identifies outliers. Second, the deviation of each test sample from the control group in a high-dimensional space is computed, and the test samples are clustered using a new Mapper-based topological algorithm at two levels: a global tier and local tiers. All parameters are either carefully chosen or data-driven, avoiding any user-induced bias. The method is stable, different datasets can be combined for analysis, and significant subgroups can be identified. It outperforms current clustering methods in sensitivity and stability on synthetic and biological datasets, in particular when sample sizes are small; outcome is not affected by removal of control samples, by choice of normalization, or by subselection of data. TTMap is readily applicable to complex, highly variable biological samples and holds promise for personalized medicine.

Details

Actions