This thesis focuses on designing spectral tools for graph clustering in sublinear time. With the emergence of big data, many traditional polynomial time, and even linear time algorithms have become prohibitively expensive. Processing modern datasets requires a new set of algorithms for computing with extremely constrained resources, i.e., \emph{sublinear algorithms}. Clustering is one of the well-known techniques for solving large-scale optimization problems in a wide variety of domains, including machine learning, data science and graph analysis~\cite{aydin2016distributed, rolnick2016geocuts, gargi2011large}.
Efficient sublinear solutions for fundamental graph clustering problems require going well beyond classic techniques.
In this thesis, we present an \emph{optimal} sublinear-time algorithm for \textit{testing $k$-clusterability problem}, i.e., quickly determining whether the graph can be partitioned into at most $k$ expanders, or is far from any such graph. This is a generalization of a well-studied problem of testing graph expansion. The classic results on testing $k$-clusterability either consider the testing expansion problem (i.e, $k=1$ vs $k\geq 2$) \cite{KaleS_SIAMJC11,NachmiasS10}, or address the problem for larger values of $k$ under the assumption that the gap between conductances of accepted and rejected graphs is at least logarithmic in the size of the graph \cite{CzumajPS_STOC15}. We overcome these barriers by developing novel spectral techniques based on analyzing the spectrum of the Gram matrix of
random walk transition probabilities. We complement our algorithm with a matching lower bound on the query complexity of testing $k$-clusterability, which improves upon the long-standing previous lower bound for testing graph expansion.
Furthermore, we extend our previous result from the \textit{property testing} framework to an efficient clustering algorithm in the \textit{local computation algorithm} (LCA) model. We focus on a popular variant of graph clustering where the input graph can be partitioned into $k$ expanders with outer conductance bounded by $\epsilon$. We construct a small space data structure that allows quickly classifying vertices of $G$ according to the cluster they belong to in sublinear time. Our spectral clustering oracle provides $O(\epsilon \log k)$ error per cluster for any $\epsilon \ll 1/\log k$. Our main contribution is a sublinear time oracle that provides dot product access to the spectral embedding of the graph. We estimate dot products with high precision using an appropriate linear transformation of the Gram matrix of random walk transition probabilities. Finally, using dot product access to the spectral embedding we design a spectral clustering oracle. At a high level, our approach amounts to hyperplane partitioning in the spectral embedding of the graph but crucially operates on a nested sequence of carefully defined subspaces in the spectral embedding to achieve per cluster recovery guarantees.
EPFL_TH8132.pdf
n/a
openaccess
Copyright
1.55 MB
Adobe PDF
dbf1e7650262b87ec24b0fd65e530d7f