Semi-supervised latent Dirichlet allocation and its application for document classification

Di Wang, Marcus Thint, Ahmad Al-Rubaie

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

29 Scopus citations

Abstract

Latent Dirichlet Allocation (LDA) is an unsupervised topic modeling method widely applied in natural language processing. However, standard LDA does not permit the use of supervised labels to incorporate expert knowledge into the learning procedure. This paper describes a semi-supervised LDA (ssLDA) method that supports multiple-topic labels per document, to incorporate available expert knowledge during the model construction. This improvement enables the alignment of resulting model with human expectations for topic modeling and extraction. We apply ssLDA to document classification problem on benchmark datasets. We investigate and compare how the size of training set and proportion of supervised data affect the final model structure and improve the prediction accuracy.

Original languageBritish English
Title of host publicationProceedings of the 2012 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops, WI-IAT 2012
Pages306-310
Number of pages5
DOIs
StatePublished - 2012
Event2012 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops, WI-IAT 2012 - Macau, China
Duration: 4 Dec 20127 Dec 2012

Publication series

NameProceedings of the 2012 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops, WI-IAT 2012

Conference

Conference2012 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops, WI-IAT 2012
Country/TerritoryChina
CityMacau
Period4/12/127/12/12

Keywords

  • Latent Dirichlet allocation (LDA)
  • natural language processing
  • semi-supervised LDA
  • semi-supervised learning
  • supervised learning
  • unsuperviased learning

Fingerprint

Dive into the research topics of 'Semi-supervised latent Dirichlet allocation and its application for document classification'. Together they form a unique fingerprint.

Cite this