Skip to content

[draft] Elastic weight update setup and acceleration #1101#1188

Open
sjmshsh wants to merge 4 commits intoinclusionAI:mainfrom
sjmshsh:feature/elastic_weight_update
Open

[draft] Elastic weight update setup and acceleration #1101#1188
sjmshsh wants to merge 4 commits intoinclusionAI:mainfrom
sjmshsh:feature/elastic_weight_update

Conversation

@sjmshsh
Copy link
Copy Markdown

@sjmshsh sjmshsh commented Apr 15, 2026

Change Summary: Elastic Topology, Megatron Pipeline, and Archon Weight Sync

Background

This document summarizes the code changes completed for the remaining A / B / C work items:

  • A: auto-scaling awareness, health monitoring, and dynamic topology handling
  • B: Megatron distributed weight update pipelining
  • C: targeted regression test additions

It also records the hidden issues fixed along the way and the latest validation status.

Scope of Changes

A. Elastic topology and health monitoring

The inference side was extended so that remote inference workers can react to topology changes more safely.

Key additions:

  • New InferenceEngineConfig fields for health monitoring and topology control
  • Background health monitoring support in RemoteInfEngine
  • Automatic removal of unhealthy servers after consecutive failures
  • Automatic discovery of newly available servers
  • Topology change signaling for training-side group rebuild
  • Target-address filtering support for partial weight updates

Primary files:

  • areal/api/cli_args.py
  • areal/infra/remote_inf_engine.py

Implemented behavior:

  • enable_health_monitor
  • health_check_interval_seconds
  • health_check_failure_threshold
  • topology_change_cooldown_seconds
  • enable_topology_discovery
  • consume_group_rebuild_request()
  • get_last_topology_change_time()
  • active server tracking and selective target address resolution

Expected impact:

  • rollout workers can tolerate server churn more gracefully
  • unhealthy servers can be removed from the active set automatically
  • newly discovered servers can be synchronized and included later
  • topology rebuild decisions can be coordinated with less inconsistency

B. Megatron single-pending-bucket pipeline

Megatron distributed weight update flow was changed from a fully serialized pattern to a single pending bucket pipeline model.

Primary file:

  • areal/engine/megatron_engine.py

Key additions:

  • _PendingWeightUpdateBucket
  • _update_bucket_weights_from_distributed_async()
  • _wait_pending_weight_update_bucket()

What changed in practice:

  • the next bucket can issue the remote update request earlier
  • tensor broadcast can proceed asynchronously for the current bucket
  • the previous bucket is waited on in a controlled order
  • the control path and data path are no longer forced into a fully serialized sequence

Expected impact:

  • reduced idle gaps between HTTP control and XCCL communication
  • better overlap between request scheduling and weight broadcast
  • behavior closer to the already improved FSDP-style pipeline

C. Regression test additions

Targeted tests were added to cover the new behavior.

Primary files:

  • tests/test_rollout_controller.py
  • tests/test_megatron_engine.py

Added coverage:

  • unhealthy server deregistration after repeated health failures
  • discovery of a newly available inference server
  • target_server_addresses filtering logic
  • Megatron pending bucket wait order regression

Archon Improvements

Archon was also updated so that it participates in elastic topology handling rather than only carrying protocol-compatible metadata.

Primary file:

  • areal/experimental/engine/archon_weight_sync.py

Key changes:

  • added topology-aware rebuild flow via maybe_rebuild_weight_update_group()
  • added disk fallback sync for newly joined inference servers
  • added topology cooldown handling before immediate rebuild
  • added group destroy and recreate flow for updated server topology
  • synchronized topology rebuild decisions across ranks with broadcasted state

Expected impact:

  • Archon can follow topology changes in a way that is much closer to the main training engines
  • newly joined inference servers can be warmed up before entering the XCCL group
  • rebuild decisions are less likely to diverge across ranks

Important Hidden Issues Fixed

1. Disk fallback path source was incorrect

Problem:

  • some training-side fallback logic used self.config.fileroot
  • the training engine config does not reliably own that field

Fix:

  • fallback path resolution was changed to read from rollout_engine.config.fileroot

Affected areas:

  • areal/engine/fsdp_engine.py
  • areal/engine/megatron_engine.py
  • areal/experimental/engine/archon_weight_sync.py

Why it matters:

  • without this change, real topology expansion could fail during disk-based synchronization of newly added inference servers

2. Conditional barrier deadlock risk during rebuild

Problem:

  • previous rebuild logic could allow only part of the ranks to enter a barrier
  • that creates a real deadlock risk in distributed execution

Fix:

  • rebuild flow was normalized so all ranks enter the synchronization points consistently
  • only the designated rank or head performs group destroy operations
  • topology change decisions are aligned across ranks before rebuild

Why it matters:

  • reduces the chance of distributed hangs during server join/leave events
  • makes rebuild behavior more deterministic across the process group

Files Involved

Core implementation files:

  • areal/api/cli_args.py
  • areal/infra/remote_inf_engine.py
  • areal/engine/fsdp_engine.py
  • areal/engine/megatron_engine.py
  • areal/experimental/engine/archon_weight_sync.py

Test files:

  • tests/test_rollout_controller.py
  • tests/test_megatron_engine.py

Validation Status

Latest known status from this round:

  • targeted lint checks for the touched files were cleaned up
  • targeted test execution was attempted afterward
  • full automated verification was not completed in the current handoff flow

This document should therefore be treated as a change summary, not as a final release certification.

Remaining Gaps

The following items were not fully completed in this round:

  • AWEX still remains at the protocol-reserved stage and does not yet provide a true CUDA IPC or shared-memory implementation
  • topology discovery currently mainly reuses the existing name resolution path, and could be extended further if a stronger production registry exists
  • stronger timeout-based fault eviction and forced shrink/continue behavior for distributed rebuilds is still not fully closed-loop

Suggested Next Steps

If this work is continued, the highest-value follow-up items are:

  • add integration tests for join, cooldown, delayed rebuild, and unhealthy node removal flows
  • add a minimal runnable AWEX skeleton so the protocol path is complete even before a zero-copy implementation is introduced
  • validate topology rebuild behavior under a real multi-rank environment instead of only targeted local checks

@sjmshsh sjmshsh requested a review from garrett4wade as a code owner April 15, 2026 07:23
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the project title in the README.md file from 'AReaL' to 'AReaLs'. Feedback indicates that this change is likely a typo, as it introduces inconsistency with the rest of the documentation where the original name is used, and a suggestion has been provided to revert it.

Comment thread README.md Outdated
@@ -1,5 +1,5 @@
<h1 align="center">
<em>AReaL</em>: A Large-Scale Asynchronous Reinforcement Learning System
<em>AReaLs</em>: A Large-Scale Asynchronous Reinforcement Learning System
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The project name is consistently referred to as "AReaL" throughout the documentation (e.g., lines 15, 19, 23), repository URLs, and citations. Changing it to "AReaLs" in the main header appears to be a typo and introduces inconsistency with the rest of the file.

Suggested change
<em>AReaLs</em>: A Large-Scale Asynchronous Reinforcement Learning System
<em>AReaL</em>: A Large-Scale Asynchronous Reinforcement Learning System

sunnyxyli added 2 commits April 15, 2026 16:41
@sjmshsh sjmshsh requested review from nuzant and rchardx as code owners April 15, 2026 08:41
@sjmshsh sjmshsh changed the title [Feature] Elastic weight update setup and acceleration #1101 [draft] Elastic weight update setup and acceleration #1101 Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant