ViT-LSTM synergy: a multi-feature approach for speaker identification and mask detection

Ali Bou Nassif, Ismail Shahin, Mohamed Bader, Abdelfatah Ahmed, Naoufel Werghi

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

The global health crisis caused by the COVID-19 pandemic has brought new challenges to speaker identification systems, particularly due to the acoustic alterations caused by the widespread use of face masks. Aiming to mitigate these distortions and improve the accuracy of speaker identification, this study introduces a novel two-level classification system, leveraging a unique integration of Vision Transformers (ViT) and Long Short-Term Memory (LSTM). This ViT-LSTM model was trained and tested on an extensive dataset composed of diverse speakers, both masked and unmasked, allowing a comprehensive evaluation of its capabilities. Our experimental results demonstrate remarkable improvements in speaker identification, with an accuracy score of 95.67%, significantly surpassing traditional and other deep learning-based methods. Moreover, our framework also shows considerable strength in detecting the presence of a mask, achieving an accuracy of 91.15% and outperforming existing state-of-the-art models. This study provides the first-ever benchmark for mask detection in the context of speaker identification, opening new pathways for research in this emerging area and presenting a robust solution for speaker identification in the era of face masks.

Original languageBritish English
Pages (from-to)22569-22586
Number of pages18
JournalNeural Computing and Applications
Volume36
Issue number35
DOIs
StatePublished - Dec 2024

Keywords

  • Convolutional neural networks (CNN)
  • Deep learning
  • Long short-term memory (LSTM)
  • Mask detection
  • Speaker Identification
  • Two-level classification
  • Vision Transformers (ViT)

Fingerprint

Dive into the research topics of 'ViT-LSTM synergy: a multi-feature approach for speaker identification and mask detection'. Together they form a unique fingerprint.

Cite this