Clustering is a method for discovering structure in data, widely used across many scientific disciplines. The two main clustering problems this dissertation considers are K-means and K-medoids. These are NP-hard problems in the number of samples and clusters, and both have well studied heuristic approximation algorithms. An example is Lloyd's algorithm for K-means, which is so widely used that it has become synonymous with the problem it attempts to solve.
A large part of this dissertation is about accelerating Lloyd's algorithm, and its mini-batch and K-medoids variants. The basic tool used to achieve these accelerations is the triangle inequality, which can be applied in a multitude of ways to eliminate costly distance calculations between data samples, as well as to reduce the number of comparisons of these distances.
The first effective use of the triangle inequality to accelerate K-means was by Elkan (2003) with novel refinements appearing more recently in Hamerly (2010). In Chapter 1 we extend these approaches. First, we show that by using centers stored from previous iterations, one can greatly reduce the number of sample-center distance computations, with substantial improvements in algorithm execution time. We then present an improvement over previous triangle inequality based algorithms for low-dimensions, which uses inter-center distances in a novel way.
Chapter 2 considers the use of the triangle inequality to accelerate the mini-batch variant of Lloyd's algorithm (Sculley, 2011). The main difficulty of incorporating triangle inequality bounding in this setting is that clusters can move significantly during the iterations in which a sample is unused, which makes triangle inequality bounding ineffective. We propose a modified sampling scheme to reduce the length of these periods of dormancy, and present an algorithm which achieves an order of magnitude acceleration over the standard mini-batch algorithm.
We then turn attention to the K-medoids problem. In Chapter 3 we focus on the specific problem of determining the medoid of a set. With N samples in R-d, we present a simple algorithm of complexity O(N^{3/2}) to determine the medoid. It is the first sub-quadratic algorithm for this problem when d > 1. The algorithm makes use of the triangle inequality to eliminate all but O(N^{1/2}) samples as potential medoid candidates.
Finally, in Chapter 4 we compare different K-medoids algorithms, and find that CLARANS, which iteratively replaces randomly selected centers with non-centers, avoids the local minima of other popular K-medoids algorithms. This motivates the use of CLARANS for initializing Lloyd's algorithm for K-means, which results in improved final energies as compared to K-means++ seeding. We use the triangle inequality to offset the increased computation required by CLARANS.
EPFL_TH8375.pdf
openaccess
3.89 MB
Adobe PDF
7246d3235af953d81e4893d6905d1496