TY - GEN
T1 - CLIFS
T2 - 31st IEEE International Conference on Image Processing, ICIP 2024
AU - Ahmed, Abdelfatah
AU - Velayudhan, Divya
AU - ElMezain, Mahmoud
AU - Alradi, Muaz Khalifa
AU - Boudiaf, Abderrahmene
AU - Hassan, Taimur
AU - Deriche, Mohamed
AU - Bennamoun, Mohammed
AU - Werghi, Naoufel
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Baggage screening in airports is a cornerstone in airport security measures. The advent of computer vision technologies in recent years has led to the development of several automated systems for identifying security threats in baggage scans. However, existing methods struggle to adapt to new threat categories when faced with a scarcity of data samples, and the rapid emergence of new threats. Hence, in this paper, we propose a novel CLIP-driven few-shot framework (CLIFS) to explore the potential of multi-modality using text-image fusion through contrastive learning to learn relevant contextual features for recognizing security threats with limited samples. By integrating features from GPT-4 generated captions with image features, CLIFS leverages both visual and textual data to significantly improve threat classification performance with limited samples in a few-shot learning context. Our proposed CLIFS was rigorously tested on the SIXray public available baggage X-ray dataset, where it outperformed state-of-the-art by 31.3% in accuracy and 28.40% in F1-score for the challenging 5-shots scenario, demonstrating its robustness and effectiveness in classifying threats from limited data samples.
AB - Baggage screening in airports is a cornerstone in airport security measures. The advent of computer vision technologies in recent years has led to the development of several automated systems for identifying security threats in baggage scans. However, existing methods struggle to adapt to new threat categories when faced with a scarcity of data samples, and the rapid emergence of new threats. Hence, in this paper, we propose a novel CLIP-driven few-shot framework (CLIFS) to explore the potential of multi-modality using text-image fusion through contrastive learning to learn relevant contextual features for recognizing security threats with limited samples. By integrating features from GPT-4 generated captions with image features, CLIFS leverages both visual and textual data to significantly improve threat classification performance with limited samples in a few-shot learning context. Our proposed CLIFS was rigorously tested on the SIXray public available baggage X-ray dataset, where it outperformed state-of-the-art by 31.3% in accuracy and 28.40% in F1-score for the challenging 5-shots scenario, demonstrating its robustness and effectiveness in classifying threats from limited data samples.
KW - Baggage threat classification
KW - Contrastive Language Image Pretraining
KW - Contrastive Loss
KW - Few-shot learning
KW - Vision-Langugae Model
UR - https://www.scopus.com/pages/publications/85204712172
U2 - 10.1109/ICIP51287.2024.10647879
DO - 10.1109/ICIP51287.2024.10647879
M3 - Conference contribution
AN - SCOPUS:85204712172
T3 - Proceedings - International Conference on Image Processing, ICIP
SP - 753
EP - 759
BT - 2024 IEEE International Conference on Image Processing, ICIP 2024 - Proceedings
PB - IEEE Computer Society
Y2 - 27 October 2024 through 30 October 2024
ER -