A Dynamic Sliding Window Based Tensor Communication Scheduling Framework for Distributed Deep Learning

  • Yunqi Gao
  • , Bing Hu
  • , Mahdi Boloursaz Mashhadi
  • , Wei Wang
  • , Rahim Tafazolli
  • , Merouane Debbah

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Simultaneous tensor communication can effectively improve the scalability of distributed deep learning on large clusters. However, a fixed number of tensor blocks communicated concurrently violates the priority-based scheduling strategy and cannot minimize communication overheads. In this paper, we propose a novel simultaneous tensor communication framework, namely D-Credit, which transmits tensor blocks based on dynamic sliding windows to minimize per-iteration time in distributed DNN training. We build the mathematical model of D-Credit in two phases: (1) the overlap of gradient communication and backward propagation, and (2) the overlap of gradient communication and forward computation. We drive the optimal window sizes for the second phase analytically, and develop a greedy algorithm to efficiently determine the dynamic window sizes for the first phase of D-Credit. We implement the D-Credit architecture on PyTorch framework. Experimental results on two different GPU clusters demonstrate that at training speed, D-Credit can achieve up to 1.26x, 1.21x, 1.48x and 1.53x speedup compared to ByteScheduler, DeAR, PyTorch-DDP and WFBP, respectively. At energy consumption, D-Credit saves up to 17.8% and 25.1% of the training energy consumption compared to ByteScheduler and WFBP, respectively.

Original languageBritish English
Pages (from-to)1080-1095
Number of pages16
JournalIEEE Transactions on Network Science and Engineering
Volume12
Issue number2
DOIs
StatePublished - 2025

Keywords

  • communication scheduling
  • data parallelism
  • Distributed deep learning
  • generative pre-trained transformer (GPT)
  • tensor partitioning

Fingerprint

Dive into the research topics of 'A Dynamic Sliding Window Based Tensor Communication Scheduling Framework for Distributed Deep Learning'. Together they form a unique fingerprint.

Cite this