Semiparametric Subsampling and Data Condensation for Large-Scale Data Analytics

Omar Alhussein, Paul D. Yoo, Sami Muhaidat, Jie Liang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Subsampling is often used to reduce the complexity of large datasets. However, such methods need to ensure that the subsampled data are representative of the original dataset. Here, we introduce a new clustering-based data condensation (subsampling) framework for large datasets. The framework relies on the use of stratified sampling, Voronoi diagrams, and variational Bayes-based Gaussian mixture clustering. We tested the proposed framework on three large imbalanced benchmark datasets, namely cod-RNA, ds1.10, and ds1.100. The efficiency and generality of the proposed framework were assessed by comparing the predictive performance of the reduced datasets with the original datasets over two machine-learning classifiers, namely the random forest, and the radial basis function network. The evaluation metrics included the accuracy, F-measure and reduction percentage. We found that very high reduction percentages can be achieved using our new framework while maintaining satisfactory predictive performance.

Original languageBritish English
Title of host publication2019 IEEE Canadian Conference of Electrical and Computer Engineering, CCECE 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728103198
DOIs
StatePublished - May 2019
Event2019 IEEE Canadian Conference of Electrical and Computer Engineering, CCECE 2019 - Edmonton, Canada
Duration: 5 May 20198 May 2019

Publication series

Name2019 IEEE Canadian Conference of Electrical and Computer Engineering, CCECE 2019

Conference

Conference2019 IEEE Canadian Conference of Electrical and Computer Engineering, CCECE 2019
Country/TerritoryCanada
CityEdmonton
Period5/05/198/05/19

Keywords

  • clustering
  • Data Condensation
  • machine learning
  • subsampling

Fingerprint

Dive into the research topics of 'Semiparametric Subsampling and Data Condensation for Large-Scale Data Analytics'. Together they form a unique fingerprint.

Cite this