Sparse Clustering for High-Dimensional Data

  • Abdullah Al-Shabili

Student thesis: Master's Thesis

Abstract

In the current digital revolution era, the amount of data is continuously growing, in fact, studies show that as much as 90% of the current data were created in last few years only. One of the technology industry's great challenges is how to leverage the potential benefits from the enormous amount of data. Clustering analysis is one of the essentials in the machine learning, since it aims to group data observations into several homogeneous groups. There are many practical applications and problems of clustering embedded into our daily lives and our business decision making. However, traditional clustering algorithms are not efficient with the exponentially growing data volume and, specifically, high-dimensional data. Clustering high-dimensional data, which has a large number of features compared with the number of data observations, poses a great challenge in the development of efficient machine learning algorithms. High-dimensional data are ubiquitous in various fields, such as computer vision, signal and image processing, bioinformatics, etc. Motivated by sparsity concepts, there has been recently great advances to tailor the traditional clustering approaches for high-dimensional data. The basic motivation behind these efficient clustering algorithm is to realize that in practice only small fraction of the features is responsible of the clustering structure. For example, in gene expression only iii a small fraction of the genes is responsible for some biological function. Hence, there is a need to identify the relevant features for clustering and discard noisy features which are a burden on the clustering algorithm i.e. feature selection is required. Beside the higher clustering accuracy, sparse clustering yields more interpretable results, because only the features that are responsible of differences between clusters are identified. In this work, we propose to use the Sparse-Group LASSO (SGL) penalty to exploit complex sparse models. This model is capable of exploiting sparse group structures as well as within group sparsity, which allows us to identify subset of features that are noisy for clustering as well as features that are only noisy for some cluster pairs. This proposed penalty is a generalization of previously presented penalties in the literature, i.e., algorithms in the literature can be viewed as special cases of our proposed algorithms. We apply the SGL penalty to derive three algorithms: SGL-Gaussian Mixture Models (SGL-GMM) with common covariance matrices, SGL-K-means, and Cluster-Specific covariance matrices SGL-Gaussian Mixture Models (CS-SGL-GMM). We derive a modified Expectation-Maximization (EM) for each algorithm. Extensive experiments based on both synthetic and real data validate the advantages of our algorithms over comparable algorithms from the literature.
Date of AwardMay 2017
Original languageAmerican English

Keywords

  • Machine learning
  • clustering
  • K-means
  • Gaussian Mixture Models
  • sparsity
  • high-dimensional data.

Cite this

'