Abstract
Geo-Distributed Machine Learning (GDML) aims to train large-scale machine learning models across geographically dispersed datacenters. However, the performance of GDML systems is constrained by the limited Wide Area Network (WAN) bandwidth and the presence of the straggler problem. Existing GDML designs often show contradictory effects in addressing these challenges, while in-network computing attempts are typically restricted to single datacenter environments rather than the more complex GDML scenarios. To overcome these limitations, this paper proposes L3DML to facilitate GDML using the P4-based Software-defined Network (SDN). Our approach incorporates three key innovations. Firstly, we introduce a novel network addressing scheme that enables location-specific in-network gradient aggregation for GDML, eliminating the need for parameter servers. Secondly, we utilize the P4 data plane to integrate lossless gradient transmission within switches. Thirdly, we address the straggler problem by employing a unique Deep Reinforcement Learning (DRL) model set and a corresponding rate synchronization routing approach. L3DML is implemented on a prototype system consisting of several Intel Tofino switches and the Spirent network emulator. Experimental results indicate that L3DML outperforms existing solutions in terms of goodput, model accuracy, and training speed gain for large-scale GDML.
Recommended Citation
X. Hou et al., "L3DML: Facilitating Geo-Distributed Machine Learning in Network Layer," IEEE Transactions on Network and Service Management, Institute of Electrical and Electronics Engineers, Jan 2024.
The definitive version is available at https://doi.org/10.1109/TNSM.2024.3509031
Department(s)
Computer Science
Publication Status
Early Access
Keywords and Phrases
Deep reinforcement learning; Distributed training; In-network computing; Network-layer addressing; P4
International Standard Serial Number (ISSN)
1932-4537
Document Type
Article - Journal
Document Version
Citation
File Type
text
Language(s)
English
Rights
© 2025 Institute of Electrical and Electronics Engineers, All rights reserved.
Publication Date
01 Jan 2024
Comments
Natural Science Foundation of Beijing Municipality, Grant 4242008