Hardware-level Design Techniques for Energy-Efficient Inference of AI Models on Edge Devices

  • Huruy Tesfai

Student thesis: Doctoral Thesis

Abstract

Deep learning (DL) continues to revolutionize many fields including computer vision, robotics and natural language processing. The increasing demand for real-time processing and privacy preservation necessitates efficient Deep Neural Network (DNN) inference directly on edge devices. However, the computational demands of DNNs often limit deployment on resource-constrained devices at the edge. This thesis addresses some of these challenges by introducing a set of techniques for efficient low-power DL inference to enable accurate and resource-constrained deployment of DL models on edge devices.

The work is organized based on three key contributions. First, we introduce a custom gradient estimation for low-precision and hardware-friendly quantization schemes. As DNNmodelsareusuallyover-parameterized, there are many models of smaller size that closely approximate the heavy model. One effective technique to address this energy and storage cost is to use low bit-width quantization. To benefit from the quantization schemes that are easy to implement in hardware, such as power-of-two (POT), various gradient estimation methods were explored. A quantization error-aware gradient estimation method that maneuvers weight update to be as close to the projection steps as possible is proposed and implemented. Moreover, the clipping or scaling coefficients of the quantization scheme are learned jointly with the model parameters to minimize quantization error. Per-channel quantization was also applied to the quantized models to minimize the accuracy degradation due to the rigid resolution property of POT quantization. We show that comparable accuracy can be achieved when using the proposed gradient estimation for POT quantization, even at precision as low as 2 and 3 bits.

To achieve further reductions in power consumption, we exploit the direct correlation between dynamic power in CMOS technology and switching activity associated with transitions of data bits from one cycle to the next. This work introduces a one-time sorting and re-ordering scheme for the parameters of pre-trained models. This approach minimizes switching activity during weight fetches within matrix multiplication and convolution operations. The proposed re-ordering process ensures that the model’s output remains mathematically equivalent, while simultaneously eliminating the overhead associated with indexing to keep track of the sorted filters. Minimizing switching activities proves to be an effective approach to curbing dynamic power consumption depending on the compute architecture used, resulting in prolonged battery life. The proposed method has been validated using various pre-trained networks including: GoogLeNet, MobileNet, AlexNet and SqueezeNet, among others.

Finally, an energy-efficient systolic array was implemented using a multiplier design that allows the sharing of repetitive logic. The proposed encoding logic is shared across multiple processing elements of a systolic array, resulting in reduced switching activity and, consequently, smaller area and lower dynamic power consumption. We apply encoding at the input of the array, to remove repetitive logic inside the processing element (PE) units. In comparison to existing designs, significant power savings are achieved by the proposed approach, with our radix-16 design demonstrating a 22% reduction in power consumption compared to an optimal DesignWare multiplier and a 19% reduction compared to a Booth radix-4 design. Furthermore, this work introduces a full tree of lower height with pre-computation of multiplicand multiples using multiplexer networks. The proposed design is more efficient consuming 35% less power and requires 13% less area than a standard Booth-recoded multiplier with comparable speed. Hence the techniques applied in this thesis are suitable for the deployment of multiplyintensive applications such as DNNs and other algorithms.
Date of Award4 Jul 2024
Original languageAmerican English
SupervisorHANI Saleh (Supervisor)

Keywords

  • Artificial Intelligence
  • Deep Neural Networks
  • Quantization
  • Hardware Accelerators
  • Systolic Array
  • Edge-AI
  • FPGA
  • ASIC
  • Edge Computing

Cite this

'