A survey of the vision transformers and their CNN-transformer based variants: Artificial Intelligence Review

A. Khan, Z. Rauf, A. Sohail, A. Khan, H. Asif, A. Asif, U. Farooq

    Research output: Contribution to journalArticlepeer-review

    8 Scopus citations


    Vision transformers have become popular as a possible substitute to convolutional neural networks (CNNs) for a variety of computer vision applications. These transformers, with their ability to focus on global relationships in images, offer large learning capacity. However, they may suffer from limited generalization as they do not tend to model local correlation in images. Recently, in vision transformers hybridization of both the convolution operation and self-attention mechanism has emerged, to exploit both the local and global image representations. These hybrid vision transformers, also referred to as CNN-Transformer architectures, have demonstrated remarkable results in vision applications. Given the rapidly growing number of hybrid vision transformers, it has become necessary to provide a taxonomy and explanation of these hybrid architectures. This survey presents a taxonomy of the recent vision transformer architectures and more specifically that of the hybrid vision transformers. Additionally, the key features of these architectures such as the attention mechanisms, positional embeddings, multi-scale processing, and convolution are also discussed. In contrast to the previous survey papers that are primarily focused on individual vision transformer architectures or CNNs, this survey uniquely emphasizes the emerging trend of hybrid vision transformers. By showcasing the potential of hybrid vision transformers to deliver exceptional performance across a range of computer vision tasks, this survey sheds light on the future directions of this rapidly evolving architecture. © 2023, The Author(s), under exclusive licence to Springer Nature B.V.
    Original languageBritish English
    Pages (from-to)2917-2970
    Number of pages54
    JournalArtif Intell Rev
    StatePublished - 2023


    • Auto encoder
    • Channel boosting
    • Computer vision
    • Convolutional neural networks
    • Deep learning
    • Hybrid vision transformers
    • Image processing
    • Self-attention
    • Transformer
    • Deep neural networks
    • Image representation
    • Network architecture
    • Taxonomies
    • Attention mechanisms
    • Auto encoders
    • Computer vision applications
    • Convolutional neural network
    • Hybrid vision transformer
    • Images processing
    • Convolution


    Dive into the research topics of 'A survey of the vision transformers and their CNN-transformer based variants: Artificial Intelligence Review'. Together they form a unique fingerprint.

    Cite this