[draft] Elastic weight update setup and acceleration #1101 by sjmshsh · Pull Request #1188 · inclusionAI/AReaL

sjmshsh · 2026-04-15T07:23:21Z

Change Summary: Elastic Topology, Megatron Pipeline, and Archon Weight Sync

Background

This document summarizes the code changes completed for the remaining A / B / C work items:

A: auto-scaling awareness, health monitoring, and dynamic topology handling
B: Megatron distributed weight update pipelining
C: targeted regression test additions

It also records the hidden issues fixed along the way and the latest validation status.

Scope of Changes

A. Elastic topology and health monitoring

The inference side was extended so that remote inference workers can react to topology changes more safely.

Key additions:

New InferenceEngineConfig fields for health monitoring and topology control
Background health monitoring support in RemoteInfEngine
Automatic removal of unhealthy servers after consecutive failures
Automatic discovery of newly available servers
Topology change signaling for training-side group rebuild
Target-address filtering support for partial weight updates

Primary files:

areal/api/cli_args.py
areal/infra/remote_inf_engine.py

Implemented behavior:

enable_health_monitor
health_check_interval_seconds
health_check_failure_threshold
topology_change_cooldown_seconds
enable_topology_discovery
consume_group_rebuild_request()
get_last_topology_change_time()
active server tracking and selective target address resolution

Expected impact:

rollout workers can tolerate server churn more gracefully
unhealthy servers can be removed from the active set automatically
newly discovered servers can be synchronized and included later
topology rebuild decisions can be coordinated with less inconsistency

B. Megatron single-pending-bucket pipeline

Megatron distributed weight update flow was changed from a fully serialized pattern to a single pending bucket pipeline model.

Primary file:

areal/engine/megatron_engine.py

Key additions:

_PendingWeightUpdateBucket
_update_bucket_weights_from_distributed_async()
_wait_pending_weight_update_bucket()

What changed in practice:

the next bucket can issue the remote update request earlier
tensor broadcast can proceed asynchronously for the current bucket
the previous bucket is waited on in a controlled order
the control path and data path are no longer forced into a fully serialized sequence

Expected impact:

reduced idle gaps between HTTP control and XCCL communication
better overlap between request scheduling and weight broadcast
behavior closer to the already improved FSDP-style pipeline

C. Regression test additions

Targeted tests were added to cover the new behavior.

Primary files:

tests/test_rollout_controller.py
tests/test_megatron_engine.py

Added coverage:

unhealthy server deregistration after repeated health failures
discovery of a newly available inference server
target_server_addresses filtering logic
Megatron pending bucket wait order regression

Archon Improvements

Archon was also updated so that it participates in elastic topology handling rather than only carrying protocol-compatible metadata.

Primary file:

areal/experimental/engine/archon_weight_sync.py

Key changes:

added topology-aware rebuild flow via maybe_rebuild_weight_update_group()
added disk fallback sync for newly joined inference servers
added topology cooldown handling before immediate rebuild
added group destroy and recreate flow for updated server topology
synchronized topology rebuild decisions across ranks with broadcasted state

Expected impact:

Archon can follow topology changes in a way that is much closer to the main training engines
newly joined inference servers can be warmed up before entering the XCCL group
rebuild decisions are less likely to diverge across ranks

Important Hidden Issues Fixed

1. Disk fallback path source was incorrect

Problem:

some training-side fallback logic used self.config.fileroot
the training engine config does not reliably own that field

Fix:

fallback path resolution was changed to read from rollout_engine.config.fileroot

Affected areas:

areal/engine/fsdp_engine.py
areal/engine/megatron_engine.py
areal/experimental/engine/archon_weight_sync.py

Why it matters:

without this change, real topology expansion could fail during disk-based synchronization of newly added inference servers

2. Conditional barrier deadlock risk during rebuild

Problem:

previous rebuild logic could allow only part of the ranks to enter a barrier
that creates a real deadlock risk in distributed execution

Fix:

rebuild flow was normalized so all ranks enter the synchronization points consistently
only the designated rank or head performs group destroy operations
topology change decisions are aligned across ranks before rebuild

Why it matters:

reduces the chance of distributed hangs during server join/leave events
makes rebuild behavior more deterministic across the process group

Files Involved

Core implementation files:

areal/api/cli_args.py
areal/infra/remote_inf_engine.py
areal/engine/fsdp_engine.py
areal/engine/megatron_engine.py
areal/experimental/engine/archon_weight_sync.py

Test files:

tests/test_rollout_controller.py
tests/test_megatron_engine.py

Validation Status

Latest known status from this round:

targeted lint checks for the touched files were cleaned up
targeted test execution was attempted afterward
full automated verification was not completed in the current handoff flow

This document should therefore be treated as a change summary, not as a final release certification.

Remaining Gaps

The following items were not fully completed in this round:

AWEX still remains at the protocol-reserved stage and does not yet provide a true CUDA IPC or shared-memory implementation
topology discovery currently mainly reuses the existing name resolution path, and could be extended further if a stronger production registry exists
stronger timeout-based fault eviction and forced shrink/continue behavior for distributed rebuilds is still not fully closed-loop

Suggested Next Steps

If this work is continued, the highest-value follow-up items are:

add integration tests for join, cooldown, delayed rebuild, and unhealthy node removal flows
add a minimal runnable AWEX skeleton so the protocol path is complete even before a zero-copy implementation is introduced
validate topology rebuild behavior under a real multi-rank environment instead of only targeted local checks

gemini-code-assist

Code Review

This pull request updates the project title in the README.md file from 'AReaL' to 'AReaLs'. Feedback indicates that this change is likely a typo, as it introduces inconsistency with the rest of the documentation where the original name is used, and a suggestion has been provided to revert it.

gemini-code-assist · 2026-04-15T07:24:10Z

@@ -1,5 +1,5 @@
 <h1 align="center">
-<em>AReaL</em>: A Large-Scale Asynchronous Reinforcement Learning System
+<em>AReaLs</em>: A Large-Scale Asynchronous Reinforcement Learning System


The project name is consistently referred to as "AReaL" throughout the documentation (e.g., lines 15, 19, 23), repository URLs, and citations. Changing it to "AReaLs" in the main header appears to be a typo and introduces inconsistency with the rest of the file.

Suggested change

<em>AReaLs</em>: A Large-Scale Asynchronous Reinforcement Learning System

<em>AReaL</em>: A Large-Scale Asynchronous Reinforcement Learning System

init pr

6630395

sjmshsh requested a review from garrett4wade as a code owner April 15, 2026 07:23

gemini-code-assist bot reviewed Apr 15, 2026

View reviewed changes

sunnyxyli added 2 commits April 15, 2026 16:41

v1 code

18838f3

v1 code

777cdd0

sjmshsh requested review from nuzant and rchardx as code owners April 15, 2026 08:41

sjmshsh changed the title ~~[Feature] Elastic weight update setup and acceleration #1101~~ [draft] Elastic weight update setup and acceleration #1101 Apr 15, 2026

v1 code

b64d1cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[draft] Elastic weight update setup and acceleration #1101#1188

[draft] Elastic weight update setup and acceleration #1101#1188
sjmshsh wants to merge 4 commits intoinclusionAI:mainfrom
sjmshsh:feature/elastic_weight_update

sjmshsh commented Apr 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	<em>AReaLs</em>: A Large-Scale Asynchronous Reinforcement Learning System
	<em>AReaL</em>: A Large-Scale Asynchronous Reinforcement Learning System

Conversation

sjmshsh commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Summary: Elastic Topology, Megatron Pipeline, and Archon Weight Sync

Background

Scope of Changes

A. Elastic topology and health monitoring

B. Megatron single-pending-bucket pipeline

C. Regression test additions

Archon Improvements

Important Hidden Issues Fixed

1. Disk fallback path source was incorrect

2. Conditional barrier deadlock risk during rebuild

Files Involved

Validation Status

Remaining Gaps

Suggested Next Steps

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sjmshsh commented Apr 15, 2026 •

edited

Loading