[draft] Elastic weight update setup and acceleration #1101#1188
Open
sjmshsh wants to merge 4 commits intoinclusionAI:mainfrom
Open
[draft] Elastic weight update setup and acceleration #1101#1188sjmshsh wants to merge 4 commits intoinclusionAI:mainfrom
sjmshsh wants to merge 4 commits intoinclusionAI:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request updates the project title in the README.md file from 'AReaL' to 'AReaLs'. Feedback indicates that this change is likely a typo, as it introduces inconsistency with the rest of the documentation where the original name is used, and a suggestion has been provided to revert it.
| @@ -1,5 +1,5 @@ | |||
| <h1 align="center"> | |||
| <em>AReaL</em>: A Large-Scale Asynchronous Reinforcement Learning System | |||
| <em>AReaLs</em>: A Large-Scale Asynchronous Reinforcement Learning System | |||
Contributor
There was a problem hiding this comment.
The project name is consistently referred to as "AReaL" throughout the documentation (e.g., lines 15, 19, 23), repository URLs, and citations. Changing it to "AReaLs" in the main header appears to be a typo and introduces inconsistency with the rest of the file.
Suggested change
| <em>AReaLs</em>: A Large-Scale Asynchronous Reinforcement Learning System | |
| <em>AReaL</em>: A Large-Scale Asynchronous Reinforcement Learning System |
added 2 commits
April 15, 2026 16:41
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change Summary: Elastic Topology, Megatron Pipeline, and Archon Weight Sync
Background
This document summarizes the code changes completed for the remaining A / B / C work items:
It also records the hidden issues fixed along the way and the latest validation status.
Scope of Changes
A. Elastic topology and health monitoring
The inference side was extended so that remote inference workers can react to topology changes more safely.
Key additions:
InferenceEngineConfigfields for health monitoring and topology controlRemoteInfEnginePrimary files:
areal/api/cli_args.pyareal/infra/remote_inf_engine.pyImplemented behavior:
enable_health_monitorhealth_check_interval_secondshealth_check_failure_thresholdtopology_change_cooldown_secondsenable_topology_discoveryconsume_group_rebuild_request()get_last_topology_change_time()Expected impact:
B. Megatron single-pending-bucket pipeline
Megatron distributed weight update flow was changed from a fully serialized pattern to a single pending bucket pipeline model.
Primary file:
areal/engine/megatron_engine.pyKey additions:
_PendingWeightUpdateBucket_update_bucket_weights_from_distributed_async()_wait_pending_weight_update_bucket()What changed in practice:
Expected impact:
C. Regression test additions
Targeted tests were added to cover the new behavior.
Primary files:
tests/test_rollout_controller.pytests/test_megatron_engine.pyAdded coverage:
target_server_addressesfiltering logicArchon Improvements
Archonwas also updated so that it participates in elastic topology handling rather than only carrying protocol-compatible metadata.Primary file:
areal/experimental/engine/archon_weight_sync.pyKey changes:
maybe_rebuild_weight_update_group()Expected impact:
Important Hidden Issues Fixed
1. Disk fallback path source was incorrect
Problem:
self.config.filerootFix:
rollout_engine.config.filerootAffected areas:
areal/engine/fsdp_engine.pyareal/engine/megatron_engine.pyareal/experimental/engine/archon_weight_sync.pyWhy it matters:
2. Conditional barrier deadlock risk during rebuild
Problem:
Fix:
Why it matters:
Files Involved
Core implementation files:
areal/api/cli_args.pyareal/infra/remote_inf_engine.pyareal/engine/fsdp_engine.pyareal/engine/megatron_engine.pyareal/experimental/engine/archon_weight_sync.pyTest files:
tests/test_rollout_controller.pytests/test_megatron_engine.pyValidation Status
Latest known status from this round:
This document should therefore be treated as a change summary, not as a final release certification.
Remaining Gaps
The following items were not fully completed in this round:
Suggested Next Steps
If this work is continued, the highest-value follow-up items are: