NPL | Mathematical Institute

Coresets for clustering very large datasets

Stephane Chretien

(NPL)

Abstract

Clustering is a very important task in data analytics and is usually addressed using (i) statistical tools based on maximum likelihood estimators for mixture models, (ii) techniques based on network models such as the stochastic block model, or (iii) relaxations of the K-means approach based on semi-definite programming (or even simpler spectral approaches). Statistical approaches of type (i) often suffer from not being solvable with sufficient guarantees, because of the non-convexity of the underlying cost function to optimise. The other two approaches (ii) and (iii) are amenable to convex programming but do not usually scale to large datasets. In the big data setting, one usually needs to resort to data subsampling, a preprocessing stage also known as "coreset selection". We will present this last approach and the problem of selecting a coreset for the special cases of K-means and spectral-type relaxations.