Novel Algorithms For Clustering

Newling, James Peter

doi:10.5075/epfl-thesis-8375

Newling, James Peter

2018

Télécharger

Formats

Format
BibTeX
MARC
MARCXML
DublinCore
EndNote
NLM
RefWorks
RIS

Fichiers

Résumé

Clustering is a method for discovering structure in data, widely used across many scientific disciplines. The two main clustering problems this dissertation considers are K-means and K-medoids. These are NP-hard problems in the number of samples and clusters, and both have well studied heuristic approximation algorithms. An example is Lloyd's algorithm for K-means, which is so widely used that it has become synonymous with the problem it attempts to solve. A large part of this dissertation is about accelerating Lloyd's algorithm, and its mini-batch and K-medoids variants. The basic tool used to achieve these accelerations is the triangle inequality, which can be applied in a multitude of ways to eliminate costly distance calculations between data samples, as well as to reduce the number of comparisons of these distances. The first effective use of the triangle inequality to accelerate K-means was by Elkan (2003) with novel refinements appearing more recently in Hamerly (2010). In Chapter 1 we extend these approaches. First, we show that by using centers stored from previous iterations, one can greatly reduce the number of sample-center distance computations, with substantial improvements in algorithm execution time. We then present an improvement over previous triangle inequality based algorithms for low-dimensions, which uses inter-center distances in a novel way. Chapter 2 considers the use of the triangle inequality to accelerate the mini-batch variant of Lloyd's algorithm (Sculley, 2011). The main difficulty of incorporating triangle inequality bounding in this setting is that clusters can move significantly during the iterations in which a sample is unused, which makes triangle inequality bounding ineffective. We propose a modified sampling scheme to reduce the length of these periods of dormancy, and present an algorithm which achieves an order of magnitude acceleration over the standard mini-batch algorithm. We then turn attention to the K-medoids problem. In Chapter 3 we focus on the specific problem of determining the medoid of a set. With N samples in R-d, we present a simple algorithm of complexity O(N^{3/2}) to determine the medoid. It is the first sub-quadratic algorithm for this problem when d > 1. The algorithm makes use of the triangle inequality to eliminate all but O(N^{1/2}) samples as potential medoid candidates. Finally, in Chapter 4 we compare different K-medoids algorithms, and find that CLARANS, which iteratively replaces randomly selected centers with non-centers, avoids the local minima of other popular K-medoids algorithms. This motivates the use of CLARANS for initializing Lloyd's algorithm for K-means, which results in improved final energies as compared to K-means++ seeding. We use the triangle inequality to offset the increased computation required by CLARANS.

Détails

Titre Novel Algorithms For Clustering

Auteur(s) Newling, James Peter

Directeur(s)

Fleuret, François

Pagination 148

Date 2018

Editeur Lausanne, EPFL

Mots-clés (libres)

clustering; triangle inequality; k-means; k-medoids; medoid; mini-batch

Langue Anglais

DOI https://doi.org/10.5075/epfl-thesis-8375

Laboratoires LIDIAP

Le document apparaît dans Production scientifique et compétences > STI - Faculté des sciences et techniques de l'ingénieur > IEM - Institute of Electrical and Micro Engineering > LIDIAP - Laboratoire de l'IDIAP
Production scientifique et compétences > Euler Center for Signal Processing
Production scientifique et compétences > Thèses EPFL
Travail produit à l'EPFL
Publié
Thèses

Date de création de la notice 2018-02-15

Actions

Aperçu

Sélectionner le fichier :