TY - JOUR
T1 - ViT-LSTM synergy
T2 - a multi-feature approach for speaker identification and mask detection
AU - Nassif, Ali Bou
AU - Shahin, Ismail
AU - Bader, Mohamed
AU - Ahmed, Abdelfatah
AU - Werghi, Naoufel
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024.
PY - 2024/12
Y1 - 2024/12
N2 - The global health crisis caused by the COVID-19 pandemic has brought new challenges to speaker identification systems, particularly due to the acoustic alterations caused by the widespread use of face masks. Aiming to mitigate these distortions and improve the accuracy of speaker identification, this study introduces a novel two-level classification system, leveraging a unique integration of Vision Transformers (ViT) and Long Short-Term Memory (LSTM). This ViT-LSTM model was trained and tested on an extensive dataset composed of diverse speakers, both masked and unmasked, allowing a comprehensive evaluation of its capabilities. Our experimental results demonstrate remarkable improvements in speaker identification, with an accuracy score of 95.67%, significantly surpassing traditional and other deep learning-based methods. Moreover, our framework also shows considerable strength in detecting the presence of a mask, achieving an accuracy of 91.15% and outperforming existing state-of-the-art models. This study provides the first-ever benchmark for mask detection in the context of speaker identification, opening new pathways for research in this emerging area and presenting a robust solution for speaker identification in the era of face masks.
AB - The global health crisis caused by the COVID-19 pandemic has brought new challenges to speaker identification systems, particularly due to the acoustic alterations caused by the widespread use of face masks. Aiming to mitigate these distortions and improve the accuracy of speaker identification, this study introduces a novel two-level classification system, leveraging a unique integration of Vision Transformers (ViT) and Long Short-Term Memory (LSTM). This ViT-LSTM model was trained and tested on an extensive dataset composed of diverse speakers, both masked and unmasked, allowing a comprehensive evaluation of its capabilities. Our experimental results demonstrate remarkable improvements in speaker identification, with an accuracy score of 95.67%, significantly surpassing traditional and other deep learning-based methods. Moreover, our framework also shows considerable strength in detecting the presence of a mask, achieving an accuracy of 91.15% and outperforming existing state-of-the-art models. This study provides the first-ever benchmark for mask detection in the context of speaker identification, opening new pathways for research in this emerging area and presenting a robust solution for speaker identification in the era of face masks.
KW - Convolutional neural networks (CNN)
KW - Deep learning
KW - Long short-term memory (LSTM)
KW - Mask detection
KW - Speaker Identification
KW - Two-level classification
KW - Vision Transformers (ViT)
UR - http://www.scopus.com/inward/record.url?scp=85204531206&partnerID=8YFLogxK
U2 - 10.1007/s00521-024-10389-7
DO - 10.1007/s00521-024-10389-7
M3 - Article
AN - SCOPUS:85204531206
SN - 0941-0643
VL - 36
SP - 22569
EP - 22586
JO - Neural Computing and Applications
JF - Neural Computing and Applications
IS - 35
ER -