A Hierarchical Word Mover’s Distance

  • Hazem Lashen

Student thesis: Master's Thesis

Abstract

This thesis introduces a novel hierarchical approach to computing document distances, addressing limitations in traditional methods such as Bag-of-Words and Word Mover’s Distance. Our method combines hierarchical clustering with word embeddings to create a multi-level word hierarchy. The primary contribution is a technique that represents documents as weighted trees within this hierarchy. By propagating weights across the tree structure and tracking both the propagation work and weight distribution, we create a rich document representation. This representation forms the basis for a new document distance metric. We evaluate our method on document classification tasks, demonstrating its superiority over BoW and WMD baselines, while also offering significant speed up over the ubiquitous WMD baseline with an up to 7.5% improvement in accuracy, a 10.7%improvementin precision and a 14.3% improvement in F1score on the Ohsumed benchmark. Additionally, we present an alternative formulation based solely on tree structure, which outperforms all baselines, including Term Frequency-Inverse Document Frequency. Our findings suggest that our method offers a nuanced and effective approach to capturing semantic relationships between documents, with potential applications in information retrieval, text clustering, and other natural language processing tasks.
Date of Award10 Dec 2024
Original languageAmerican English
SupervisorAndreas Henschel (Supervisor)

Keywords

  • Document Distance Metrics
  • Document Classification
  • Optimal Transport
  • Earth Mover’s Distance
  • Machine Learning
  • Artificial Intelligence

Cite this

'