Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Novel Algorithms For Clustering
 
doctoral thesis

Novel Algorithms For Clustering

Newling, James Peter  
2018

Clustering is a method for discovering structure in data, widely used across many scientific disciplines. The two main clustering problems this dissertation considers are K-means and K-medoids. These are NP-hard problems in the number of samples and clusters, and both have well studied heuristic approximation algorithms. An example is Lloyd's algorithm for K-means, which is so widely used that it has become synonymous with the problem it attempts to solve.

A large part of this dissertation is about accelerating Lloyd's algorithm, and its mini-batch and K-medoids variants. The basic tool used to achieve these accelerations is the triangle inequality, which can be applied in a multitude of ways to eliminate costly distance calculations between data samples, as well as to reduce the number of comparisons of these distances.

The first effective use of the triangle inequality to accelerate K-means was by Elkan (2003) with novel refinements appearing more recently in Hamerly (2010). In Chapter 1 we extend these approaches. First, we show that by using centers stored from previous iterations, one can greatly reduce the number of sample-center distance computations, with substantial improvements in algorithm execution time. We then present an improvement over previous triangle inequality based algorithms for low-dimensions, which uses inter-center distances in a novel way.

Chapter 2 considers the use of the triangle inequality to accelerate the mini-batch variant of Lloyd's algorithm (Sculley, 2011). The main difficulty of incorporating triangle inequality bounding in this setting is that clusters can move significantly during the iterations in which a sample is unused, which makes triangle inequality bounding ineffective. We propose a modified sampling scheme to reduce the length of these periods of dormancy, and present an algorithm which achieves an order of magnitude acceleration over the standard mini-batch algorithm.

We then turn attention to the K-medoids problem. In Chapter 3 we focus on the specific problem of determining the medoid of a set. With N samples in R-d, we present a simple algorithm of complexity O(N^{3/2}) to determine the medoid. It is the first sub-quadratic algorithm for this problem when d > 1. The algorithm makes use of the triangle inequality to eliminate all but O(N^{1/2}) samples as potential medoid candidates.

Finally, in Chapter 4 we compare different K-medoids algorithms, and find that CLARANS, which iteratively replaces randomly selected centers with non-centers, avoids the local minima of other popular K-medoids algorithms. This motivates the use of CLARANS for initializing Lloyd's algorithm for K-means, which results in improved final energies as compared to K-means++ seeding. We use the triangle inequality to offset the increased computation required by CLARANS.

  • Files
  • Details
  • Metrics
Type
doctoral thesis
DOI
10.5075/epfl-thesis-8375
Author(s)
Newling, James Peter  
Advisors
Fleuret, François  
Jury

Prof. Robert West (président) ; Dr François Fleuret (directeur de thèse) ; Prof. Martin Jaggi, Prof. Stéphane Marchand-Maillet, Prof. Raphael Sznitman (rapporteurs)

Date Issued

2018

Publisher

EPFL

Publisher place

Lausanne

Public defense year

2018-02-15

Thesis number

8375

Total of pages

148

Subjects

clustering

•

triangle inequality

•

k-means

•

k-medoids

•

medoid

•

mini-batch

EPFL units
LIDIAP  
Faculty
STI  
School
IEL  
Doctoral School
EDIC  
Available on Infoscience
February 15, 2018
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/144853
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés