Abstract

Geo-Distributed Machine Learning (GDML) aims to train large-scale machine learning models across geographically dispersed datacenters. However, the performance of GDML systems is constrained by the limited Wide Area Network (WAN) bandwidth and the presence of the straggler problem. Existing GDML designs often show contradictory effects in addressing these challenges, while in-network computing attempts are typically restricted to single datacenter environments rather than the more complex GDML scenarios. To overcome these limitations, this paper proposes L3DML to facilitate GDML using the P4-based Software-defined Network (SDN). Our approach incorporates three key innovations. Firstly, we introduce a novel network addressing scheme that enables location-specific in-network gradient aggregation for GDML, eliminating the need for parameter servers. Secondly, we utilize the P4 data plane to integrate lossless gradient transmission within switches. Thirdly, we address the straggler problem by employing a unique Deep Reinforcement Learning (DRL) model set and a corresponding rate synchronization routing approach. L3DML is implemented on a prototype system consisting of several Intel Tofino switches and the Spirent network emulator. Experimental results indicate that L3DML outperforms existing solutions in terms of goodput, model accuracy, and training speed gain for large-scale GDML.

Department(s)

Computer Science

Publication Status

Early Access

Comments

Natural Science Foundation of Beijing Municipality, Grant 4242008

Keywords and Phrases

Deep reinforcement learning; Distributed training; In-network computing; Network-layer addressing; P4

International Standard Serial Number (ISSN)

1932-4537

Document Type

Article - Journal

Document Version

Citation

File Type

text

Language(s)

English

Rights

© 2025 Institute of Electrical and Electronics Engineers, All rights reserved.

Publication Date

01 Jan 2024

Share

 
COinS