Person Identification from Noisy and Emotional Speech

Student thesis: Doctoral Thesis

Abstract

Speech signals are highly susceptible to emotional influences and acoustic interference, presenting significant challenges for real-time speech processing applications. Identifying the dominant signal amidst external influences is crucial for improving performance. Signal processing approaches alone are not sufficient to handle emotional and noise-filled speech data effectively. From a machine learning perspective, person identification from such speech constitutes a complex classification problem, necessitating the reliable identification of classes from audio samples that may be ambiguous to human listeners and the accurate characterization of emotion. Concerns in this field extend beyond technical difficulties; for example, how can the class be accurately identified from an emotive and possibly even confusing audio clip? For machine learning algorithms to effectively capture the small changes in speech that individuals can readily perceive, they must learn more about feature extraction and the non-linearity of audio signals. In this research, we have approached the complex signal processing problems as fine-grained classification tasks within the realm of artificial intelligence. A significant challenge in this field is the scarcity of sufficient data for training and evaluation. To address this, we employed cross-modal knowledge transfer by utilizing vision-based models, specifically VGG16 and the ViT (vision transformer). Our investigation focuses on learned feature extraction and cross-modality knowledge transfer to tackle inherent challenges and improve speaker identification performance. We have designed an end-to-end transformer model specifically tailored for emotional and noisy speech. This approach leverages four different emotional speech datasets in English and Arabic, augmented with various noises for training and evaluation. The other key contributions of this thesis include deep-sparse feature extraction, learned source segregation, CNN based categorization, CochleaWave filterbank and speaker count estimation to address critical challenges in processing real-world speech data. This work aims to advance the field by demonstrating the efficacy of our model in handling the dual challenges of noise and emotional variability in speech, thereby improving the robustness and accuracy of speaker identification systems in practical applications.
Date of Award19 Dec 2024
Original languageAmerican English
SupervisorNaoufel Werghi (Supervisor)

Keywords

  • Speaker identification
  • Transformer
  • Convolution Neural Network
  • Emotion
  • Noise

Cite this

'