Abstract
Vision-Language Models (VLMs) have recently advanced the Visual Object Tracking (VOT) performance. In VLMs, a vision encoder is employed to obtain visual representation, and a text encoder is employed to estimate the textual embeddings using natural language descriptions. By aligning the visual and textual representations, the VLMs achieve robust performance in complex and diverse tracking scenarios, efficiently handling dynamic target appearances such as motion blur, occlusion, fast motion, and similar object distractors. However, the input textual description in many existing VLM-based trackers incorporates class and semantics details without any contextual information. This is addressed in some recent VLM-based State Of The Art (SOTA) trackers by implicitly predicting some important attributes of the target object and encoding them as textual descriptions within the tracking paradigm. However, the SOTA methods neglect the contextual relationship among the predicted attributes. In this work, we propose an Attention-based Image-Text alignment Tracker (AITrack) for robust VOT tasks. AITrack simplifies the process of VLM-based tracking using attention-based visual and textual alignment modules. AITrack utilizes a region-of-interest (ROI) text-guided encoder that leverages existing pre-trained language models to implicitly extract and encode textual features and a simple image encoder to encode visual features. A simple alignment module is implemented to combine both encoded visual and textual features, thereby inherently exposing the semantic relationship between the template and search frames with their surroundings, providing rich encodings for improved tracking performance. We employ a simple decoder that takes past predictions as spatiotemporal clues to effectively model the target appearance changes without the need for complex customized post-processings and prediction heads. Extensive experiments are performed on six publicly available VOT benchmark datasets demonstrating the strong capabilities of our AITtrack by gaining an average success rate of 2.0%.
| Original language | British English |
|---|---|
| Pages (from-to) | 67095-67111 |
| Number of pages | 17 |
| Journal | IEEE Access |
| Volume | 13 |
| DOIs | |
| State | Published - 2025 |
Keywords
- autoregressive tracking
- multi-modal tracking
- vision transformer
- vision-language model
- Visual object tracking
Fingerprint
Dive into the research topics of 'AITtrack: Attention-Based Image-Text Alignment for Visual Tracking'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver