Abstract
In the world of deep learning, transformer models have become very significant, leading to improvements in many areas from understanding language to recognizing images, covering a wide range of applications. Despite their success, the deployment of these models in real-time applications, particularly on edge devices, poses significant challenges due to their computational intensity and memory demands. To overcome these challenges, we introduce a novel Hybrid Dynamic Pruning (HDP) technique, an efficient algorithm-architecture co-design approach that accelerates transformers using head sparsity, block sparsity and approximation to reduce computations in attention and reduce memory access. With the observation of the huge redundancy in attention scores and attention heads, we propose a novel integer-based block pruning to prune unimportant blocks in the attention matrix at run time. We also propose integerbased head pruning to detect and prune unimportant heads at an early stage at run time. Also, we propose an approximation method that reduces attention computations. To efficiently support these methods with lower latency, we propose the HDP Accelerator (HDPA) as a co-processor architecture, synthesized in two configurations - HDPA-edge and HDPA-server - to meet the needs of mobile and server platforms. Extensive experiments with different transformer models and benchmarks demonstrate that HDPA-server achieves 481× and 381× speedup in attention layer computation over Intel i7-1185G7 CPU and NVIDIA T4 GPU, respectively. Compared to other state-of-the-art accelerators, HDPA achieves 1.26× to 2.08× higher throughput, 1.3× to 18× greater MAC efficiency, and 1.1× to 5.1× improved energy efficiency, when normalized to the same computational load.
| Original language | British English |
|---|---|
| Journal | IEEE Transactions on Artificial Intelligence |
| DOIs | |
| State | Accepted/In press - 2025 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 7 Affordable and Clean Energy
Keywords
- approximation
- dynamic pruning
- Hardware acceleration
- self attention
- transforme
Fingerprint
Dive into the research topics of 'Efficient Transformer Inference Through Hybrid Dynamic Pruning'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver