Algorithms For Clustering Problems : Theoretical Guarantees and Empirical Evaluations

Norouzi Fard, Ashkan

doi:10.5075/epfl-thesis-8546

doctoral thesis

Algorithms For Clustering Problems : Theoretical Guarantees and Empirical Evaluations

2018

Clustering is a classic topic in combinatorial optimization and plays a central role in many areas, including data science and machine learning. In this thesis, we first focus on the dynamic facility location problem (i.e., the facility location problem in evolving metrics). We present a new LP-rounding algorithm for facility location problems, which yields the first constant factor approximation algorithm for the dynamic facility location problem. Our algorithm installs competing exponential clocks on clients and facilities, and connects every client by the path that repeatedly follows the smallest clock in the neighborhood. The use of exponential clocks gives rise to several properties that distinguish our approach from previous LP-roundings for facility location problems. In particular, we use \emph{no clustering} and we enable clients to connect through paths of \emph{arbitrary lengths}. In fact, the clustering-free nature of our algorithm is crucial for applying our LP-rounding approach to the dynamic problem.

Furthermore, we present both empirical and theoretical aspects of the $k$-means problem. The best known algorithm for $k$-means with a provable guarantee is a simple local-search heuristic that yields an approximation guarantee of $9+\epsilon$, a ratio that is known to be tight with respect to such methods. We overcome this barrier by presenting a new primal-dual approach that enables us (1) to exploit the geometric structure of $k$-means and (2) to satisfy the hard constraint that at most $k$ clusters are selected without deteriorating the approximation guarantee. Our main result is a $6.357$-approximation algorithm with respect to the standard LP relaxation. Our techniques are quite general and we also show improved guarantees for the general version of $k$-means where the underlying metric is not required to be Euclidean and for $k$-median in Euclidean metrics.

We also improve the running time of our algorithm to almost linear running time and still maintain a provable guarantee. We compare our algorithm with {\sc K-Means++} (a widely studied algorithm) and show that we obtain better accuracy with comparable and even better running time.

Name

EPFL_TH8546.pdf

Access type

openaccess

Size

3.61 MB

Format

Adobe PDF

Checksum (MD5)

f5f163368fed0b257900db0f71e8b468