ELTrack: Events-Language Description for Visual Object Tracking

Mohamad Yousif Abdulkareem Alansari, Khaled Alnuaimi, Sara Alansari, Sajid Javed

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

The integration of Natural Language (NL) descriptions into Visual Object Tracking (VOT) has shown promise in enhancing the performance of RGB-based tracking by providing richer, contextually aware information that helps to address issues like appearance variations, model drift, and ambiguous target representation. However, the growing complexity of VOT tasks, particularly in scenarios involving fast-moving objects and challenging lighting conditions, necessitates the development of more robust and adaptable tracking frameworks. Traditional visual trackers, which rely solely on RGB data, often struggle with these challenges. Event cameras offer promising solutions that capture changes in a scene as they happen because of their reduced latency, making them highly effective in scenarios where traditional visible cameras often struggle, such as in low-light environments or when tracking rapid motion. Despite the process in events and NL tracking, the fusion of events and NL remains underexplored due to the lack of large-scale NL-described datasets and event-based benchmarks. To address these gaps, we present ELTrack, a novel multi-modal NL-based VOT framework that, to the best of our knowledge, is the first to integrate event data with NL descriptions in VOT. ELTrack synthesizes event data, filters out noise, and applies imprinting and a step decay function to introduce a novel event image representation called Pseudo-Frames. Additionally, we generate NL descriptions using a Visual-Language (VL) image-captioning module featuring BLIP-2 and GPT-4. These modalities are seamlessly integrated using a superimpose fusion module to enhance tracking performance. Our ELTrack is a generic pipeline and can be integrated with any of the existing SoTA trackers. Extensive experiments demonstrate that ELTrack achieves significantly better performance across a variety of publicly available VOT datasets. The source code of the ELTrack is publicly available at: https://github.com/HamadYA/ELTrack-Correlating-Events-and-Language-for-Visual-Tracking.

Original languageBritish English
Pages (from-to)31351-31367
Number of pages17
JournalIEEE Access
Volume13
DOIs
StatePublished - 2025

Keywords

  • Events camera
  • multi-modal fusion
  • neuromorphic vision
  • visual object tracking (VOT)
  • visual-language object tracking

Fingerprint

Dive into the research topics of 'ELTrack: Events-Language Description for Visual Object Tracking'. Together they form a unique fingerprint.

Cite this