Add cross-job checkpoint promotion script and training shape E2E tests#52
Draft
morgendave wants to merge 5 commits intomainfrom
Draft
Add cross-job checkpoint promotion script and training shape E2E tests#52morgendave wants to merge 5 commits intomainfrom
morgendave wants to merge 5 commits intomainfrom
Conversation
When the original RLOR trainer job has been deleted, `forge rft promote` fails with HTTP 404 because it tries to list checkpoints on a job that no longer exists. This script works around the problem by: 1. Reading checkpoints.jsonl to find the DCP state_path for the target step 2. Spinning up a temporary service-mode RLOR trainer with the same base model 3. Loading the cross-job DCP checkpoint into the temporary trainer 4. Exporting a sampler (HF-format) checkpoint from the loaded weights 5. Promoting the sampler checkpoint to a deployable Fireworks model 6. Cleaning up (deleting the temporary trainer) Supports: - Auto-inference of training shape from base model name - Direct promotion fallback when sampler_path exists - Dry-run mode to preview the plan without creating resources - Automatic cleanup of temporary trainer on success or failure Co-authored-by: zhiweiz <morgendave@users.noreply.github.com>
Contributor
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Root cause analysis:
- All Qwen3 (legacy) LoRA training shape versions have public=False
- This prevents external accounts (like AILabs) from resolving them
via resolve_training_profile()
- The shapes exist and have latestValidated versions, but those
versions aren't marked public
- Qwen 3.5 (qwen3p5-*) LoRA shapes are properly configured with
public=True and work correctly
Failing shapes (public=False on latestValidated):
- qwen3-4b-256k-h200-lora
- qwen3-8b-256k-h200-lora
- qwen3-4b-minimum-h200-lora
- qwen3-4b-minimum-h200-forward-lora
- qwen3-235b-2507-instruct-128k-b200-lora
- qwen3-235b-2507-instruct-128k-b200-forward-only-lora
Working shapes (public=True):
- qwen3p5-9b-256k-lora
- qwen3p5-27b-256k-lora
- qwen3p5-35b-a3b-256k-lora
- qwen3p5-397b-a17b-256k-lora
- qwen3-vl-8b-256k-h200-lora
Fix: Mark the latestValidated version as public=True for each
failing legacy shape, or direct users to the new qwen3p5 shapes.
Also updated cross_job_promote.py shape map to include Qwen 3.5 shapes.
Co-authored-by: zhiweiz <morgendave@users.noreply.github.com>
Updated test to focus on the actual customer issue: AILabs can't access qwen3p5-*-lora shapes with a permission error. Test now covers three phases: 1. Shape metadata validation (exists, trainerMode, versions, public/validated) 2. Dependency chain access (base model, deployment shape, deployment shape version — all must be accessible and public) 3. Job lifecycle E2E (opt-in: creates and deletes a real RLOR trainer job) All qwen3p5 LoRA shapes pass from the pyroworks account: - Shape versions have latestValidated=True and public=True - Deployment shape versions are public and validated - Base models are READY and accessible - Job creation succeeds The permission issue for AILabs is likely at the account level: - Accelerator quota (B200/B300 access) - Service-mode RLOR permission - Or SDK version mismatch Run with AILabs API key to reproduce: FIREWORKS_API_KEY=<ailabs-key> python test_qwen3_lora_shapes.py -v Co-authored-by: zhiweiz <morgendave@users.noreply.github.com>
Now uses the actual fireworks.training.sdk to exercise: 1. resolve_training_profile() - the exact API the SDK calls (GET /versions?filter=latest_validated=true&pageSize=1) 2. TrainerJobConfig validation with shape ref 3. Job lifecycle via mgr.create() with ?trainingShape= query param Key findings from reading the SDK source: - resolve_training_profile() filters versions with 'latest_validated=true' server-side - If the server returns no versions, it raises: 'No latest validated training-shape version was returned' - Job creation passes the shape ref as a QUERY PARAMETER (?trainingShape=...), not in the request body - The server does shape-to-infra resolution and validation All 4 qwen3p5-*-lora shapes pass resolution and config validation. The permission issue for AILabs is server-side — either: - The 'latest_validated=true' filter returns empty for their account - Or the job creation with ?trainingShape= fails at server validation Run with any account's API key to diagnose: FIREWORKS_API_KEY=<key> python test_qwen3_lora_shapes.py -v Co-authored-by: zhiweiz <morgendave@users.noreply.github.com>
Co-authored-by: zhiweiz <morgendave@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root Cause & E2E Results
Customer 404 error — RESOLVED
SDK v1.0.0a35 bug:
resolve_training_profile()looks up shapes under the caller's own account instead ofaccounts/fireworks/. Fixed in v1.0.0a61.Fix:
pip install --pre 'fireworks-ai[training]>=1.0.0a61'GRPO LoRA E2E — Results
Tested cookbook GRPO with
qwen3p5-9b-256k-lora(LoRA rank 16) onaialabsaccount:resolve_training_profile)kl_beta=0)kl_beta>0)save_weights_for_sampler_extafter creatinglora_rank=0clientCookbook LoRA GRPO issue
The cookbook's
rl_loop.pyhas two problems for LoRA GRPO withkl_beta > 0:Reference job creation passes
lora_rank=cfg.lora_rank(e.g. 16) to the forward-only reference shape, which rejects it with HTTP 400 (shape expectsFORWARD_ONLYmode, notLORA_TRAINER)Inline reference (correct approach: create a
lora_rank=0client on the same policy trainer) — creating a secondFiretitanServiceClientwithlora_rank=0on the same trainer endpoint causessave_weights_for_sampler_extto hang indefinitely. This appears to be a trainer-side issue where having two active sessions (one LoRA, one full-param) blocks checkpoint operations.CI coverage gap
The existing smoke tests (
test_grpo_smoke.py) usekl_beta=0andlora_rank=0— they don't cover the LoRA + reference model path.Cleaned up
aialabsaccountSlack Thread