Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. epsilon-Coresets for Clustering (with Outliers) in Doubling Metrics
 
conference paper

epsilon-Coresets for Clustering (with Outliers) in Doubling Metrics

Huang, Lingxiao  
•
Jiang, Shaofeng H. -C.
•
Li, Jian
Show more
January 1, 2018
2018 Ieee 59Th Annual Symposium On Foundations Of Computer Science (Focs)
59th IEEE Annual Symposium on Foundations of Computer Science (FOCS)

We study the problem of constructing epsilon-coresets for the (k, z)-clustering problem in a doubling metric M(X, d). An epsilon-coreset is a weighted subset S subset of X with weight function w : S -> R->= 0, such that for any k-subset C is an element of X, it holds that Sigma(x is an element of S) w(x).d(z) (x, C) is an element of (1 +/- epsilon) . Sigma(x is an element of X) d(z) (x, C).

We present an efficient algorithm that constructs an epsilon-coreset for the (k, z)-clustering problem in M(X, d), where the size of the coreset only depends on the parameters k, z, epsilon and the doubling dimension ddim(M). To the best of our knowledge, this is the first efficient epsilon-coreset construction of size independent of vertical bar X vertical bar for general clustering problems in doubling metrics.

To this end, we establish the first relation between the doubling dimension of M(X, d) and the shattering dimension (or VC-dimension) of the range space induced by the distance d. Such a relation is not known before, since one can easily construct instances in which neither one can be bounded by (some function of) the other. Surprisingly, we show that if we allow a small (1 +/- epsilon)-distortion of the distance function d (the distorted distance is called the smoothed distance function), the shattering dimension can be upper bounded by O(epsilon(-O(ddim(M)))). For the purpose of coreset construction, the above bound does not suffice as it only works for unweighted spaces. Therefore, we introduce the notion of tau-error probabilistic shattering dimension, and prove a (drastically better) upper bound of O(ddim(M).log(1/epsilon)+log log 1/tau) for the probabilistic shattering dimension for weighted doubling metrics. As it turns out, an upper bound for the probabilistic shattering dimension is enough for constructing a small coreset. We believe the new relation between doubling and shattering dimensions is of independent interest and may find other applications.

Furthermore, we study robust coresets for (k, z)-clustering with outliers in a doubling metric. We show an improved connection between alpha-approximation and robust coresets. This also leads to improvement upon the previous best known bound of the size of robust coreset for Euclidean space [Feldman and Langberg, STOC 11]. The new bound entails a few new results in clustering and property testing.

As another application, we show constant-sized (epsilon, k, z)-centroid sets in doubling metrics can be constructed by extending our coreset construction. Prior to our result, constant-sized centroid sets for general clustering problems were only known for Euclidean spaces. We can apply our centroid set to accelerate the local search algorithm (studied in [Friggstad et al., FOCS 2016]) for the (k, z)-clustering problem in doubling metrics.

  • Details
  • Metrics
Type
conference paper
DOI
10.1109/FOCS.2018.00082
Web of Science ID

WOS:000455014500073

Author(s)
Huang, Lingxiao  
Jiang, Shaofeng H. -C.
Li, Jian
Wu, Xuan
Date Issued

2018-01-01

Publisher

IEEE COMPUTER SOC

Publisher place

Los Alamitos

Published in
2018 Ieee 59Th Annual Symposium On Foundations Of Computer Science (Focs)
ISBN of the book

978-1-5386-4230-6

Series title/Series vol.

Annual IEEE Symposium on Foundations of Computer Science

Start page

814

End page

825

Subjects

Computer Science, Theory & Methods

•

Computer Science

•

coreset

•

clustering

•

doubling dimension

•

centroid set

•

outlier

•

hop-diameter

•

k-means

•

spanners

•

ptas

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
LTHC  
Event nameEvent placeEvent date
59th IEEE Annual Symposium on Foundations of Computer Science (FOCS)

Paris, FRANCE

Oct 07-09, 2018

Available on Infoscience
January 23, 2019
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/153783
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés