TY - JOUR
T1 - A Multiplier-Free RNS-Based CNN Accelerator Exploiting Bit-Level Sparsity
AU - Sakellariou, Vasilis
AU - Paliouras, Vassilis
AU - Kouretas, Ioannis
AU - Saleh, Hani
AU - Stouraitis, Thanos
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2024/4/1
Y1 - 2024/4/1
N2 - In this work, a Residue Numbering System (RNS)-based Convolutional Neural Network (CNN) accelerator utilizing a multiplier-free distributed-arithmetic Processing Element (PE) is proposed. A method for maximizing the utilization of the arithmetic hardware resources is presented. It leads to an increase of the system's throughput, by exploiting bit-level sparsity within the weight vectors. The proposed PE design takes advantage of the properties of RNS and Canonical Signed Digit (CSD) encoding to achieve higher energy efficiency and effective processing rate, without requiring any compression mechanism or introducing any approximation. An extensive design space exploration for various parameters (RNS base, PE micro-architecture, encoding) using analytical models as well as experimental results from CNN benchmarks is conducted and the various trade-offs are analyzed. A complete end-to-end RNS accelerator is developed based on the proposed PE. The introduced accelerator is compared to traditional binary and RNS counterparts as well as to other state-of-the-art systems. Implementation results in a 22-nm process show that the proposed PE can lead to 1.85× and 1.54× more energy-efficient processing compared to binary and conventional RNS, respectively, with a 1.88× maximum increase of effective throughput for the employed benchmarks. Compared to a state-of-the-art, all-digital, RNS-based system, the proposed accelerator is 8.87× and 1.11× more energy- and area-efficient, respectively.
AB - In this work, a Residue Numbering System (RNS)-based Convolutional Neural Network (CNN) accelerator utilizing a multiplier-free distributed-arithmetic Processing Element (PE) is proposed. A method for maximizing the utilization of the arithmetic hardware resources is presented. It leads to an increase of the system's throughput, by exploiting bit-level sparsity within the weight vectors. The proposed PE design takes advantage of the properties of RNS and Canonical Signed Digit (CSD) encoding to achieve higher energy efficiency and effective processing rate, without requiring any compression mechanism or introducing any approximation. An extensive design space exploration for various parameters (RNS base, PE micro-architecture, encoding) using analytical models as well as experimental results from CNN benchmarks is conducted and the various trade-offs are analyzed. A complete end-to-end RNS accelerator is developed based on the proposed PE. The introduced accelerator is compared to traditional binary and RNS counterparts as well as to other state-of-the-art systems. Implementation results in a 22-nm process show that the proposed PE can lead to 1.85× and 1.54× more energy-efficient processing compared to binary and conventional RNS, respectively, with a 1.88× maximum increase of effective throughput for the employed benchmarks. Compared to a state-of-the-art, all-digital, RNS-based system, the proposed accelerator is 8.87× and 1.11× more energy- and area-efficient, respectively.
KW - AI hardware accelerator
KW - canonical signed digit
KW - RNS
UR - http://www.scopus.com/inward/record.url?scp=85167789269&partnerID=8YFLogxK
U2 - 10.1109/TETC.2023.3301590
DO - 10.1109/TETC.2023.3301590
M3 - Article
AN - SCOPUS:85167789269
SN - 2168-6750
VL - 12
SP - 667
EP - 683
JO - IEEE Transactions on Emerging Topics in Computing
JF - IEEE Transactions on Emerging Topics in Computing
IS - 2
ER -