Add cross-job checkpoint promotion script and training shape E2E tests by morgendave · Pull Request #52 · fw-ai/forge

morgendave · 2026-04-01T17:45:25Z

Root Cause & E2E Results

Customer 404 error — RESOLVED

SDK v1.0.0a35 bug: resolve_training_profile() looks up shapes under the caller's own account instead of accounts/fireworks/. Fixed in v1.0.0a61.

Fix: pip install --pre 'fireworks-ai[training]>=1.0.0a61'

GRPO LoRA E2E — Results

Tested cookbook GRPO with qwen3p5-9b-256k-lora (LoRA rank 16) on aialabs account:

Test	Result
Shape resolution (`resolve_training_profile`)	PASS
Job creation (LoRA policy trainer, B200x2)	PASS
Trainer reaches RUNNING (~6 min)	PASS
Deployment creation + hotload	PASS
Inference sampling (5 rounds, 80 completions)	PASS
GRPO without reference (`kl_beta=0`)	PASS (full pipeline)
GRPO with inline LoRA reference (`kl_beta>0`)	BLOCKED — hangs on `save_weights_for_sampler_ext` after creating `lora_rank=0` client

Cookbook LoRA GRPO issue

The cookbook's rl_loop.py has two problems for LoRA GRPO with kl_beta > 0:

Reference job creation passes lora_rank=cfg.lora_rank (e.g. 16) to the forward-only reference shape, which rejects it with HTTP 400 (shape expects FORWARD_ONLY mode, not LORA_TRAINER)
Inline reference (correct approach: create a lora_rank=0 client on the same policy trainer) — creating a second FiretitanServiceClient with lora_rank=0 on the same trainer endpoint causes save_weights_for_sampler_ext to hang indefinitely. This appears to be a trainer-side issue where having two active sessions (one LoRA, one full-param) blocks checkpoint operations.

CI coverage gap

The existing smoke tests (test_grpo_smoke.py) use kl_beta=0 and lora_rank=0 — they don't cover the LoRA + reference model path.

Cleaned up

Deleted 8 stale trainer jobs from aialabs account
Scaled down 3 test deployments to zero

Slack Thread

When the original RLOR trainer job has been deleted, `forge rft promote` fails with HTTP 404 because it tries to list checkpoints on a job that no longer exists. This script works around the problem by: 1. Reading checkpoints.jsonl to find the DCP state_path for the target step 2. Spinning up a temporary service-mode RLOR trainer with the same base model 3. Loading the cross-job DCP checkpoint into the temporary trainer 4. Exporting a sampler (HF-format) checkpoint from the loaded weights 5. Promoting the sampler checkpoint to a deployable Fireworks model 6. Cleaning up (deleting the temporary trainer) Supports: - Auto-inference of training shape from base model name - Direct promotion fallback when sampler_path exists - Dry-run mode to preview the plan without creating resources - Automatic cleanup of temporary trainer on success or failure Co-authored-by: zhiweiz <morgendave@users.noreply.github.com>

vercel · 2026-04-01T17:45:31Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
functional-chat	Ready	Preview, Comment	Apr 16, 2026 10:31pm
transcription-demo	Ready	Preview, Comment	Apr 16, 2026 10:31pm

Root cause analysis: - All Qwen3 (legacy) LoRA training shape versions have public=False - This prevents external accounts (like AILabs) from resolving them via resolve_training_profile() - The shapes exist and have latestValidated versions, but those versions aren't marked public - Qwen 3.5 (qwen3p5-*) LoRA shapes are properly configured with public=True and work correctly Failing shapes (public=False on latestValidated): - qwen3-4b-256k-h200-lora - qwen3-8b-256k-h200-lora - qwen3-4b-minimum-h200-lora - qwen3-4b-minimum-h200-forward-lora - qwen3-235b-2507-instruct-128k-b200-lora - qwen3-235b-2507-instruct-128k-b200-forward-only-lora Working shapes (public=True): - qwen3p5-9b-256k-lora - qwen3p5-27b-256k-lora - qwen3p5-35b-a3b-256k-lora - qwen3p5-397b-a17b-256k-lora - qwen3-vl-8b-256k-h200-lora Fix: Mark the latestValidated version as public=True for each failing legacy shape, or direct users to the new qwen3p5 shapes. Also updated cross_job_promote.py shape map to include Qwen 3.5 shapes. Co-authored-by: zhiweiz <morgendave@users.noreply.github.com>

Updated test to focus on the actual customer issue: AILabs can't access qwen3p5-*-lora shapes with a permission error. Test now covers three phases: 1. Shape metadata validation (exists, trainerMode, versions, public/validated) 2. Dependency chain access (base model, deployment shape, deployment shape version — all must be accessible and public) 3. Job lifecycle E2E (opt-in: creates and deletes a real RLOR trainer job) All qwen3p5 LoRA shapes pass from the pyroworks account: - Shape versions have latestValidated=True and public=True - Deployment shape versions are public and validated - Base models are READY and accessible - Job creation succeeds The permission issue for AILabs is likely at the account level: - Accelerator quota (B200/B300 access) - Service-mode RLOR permission - Or SDK version mismatch Run with AILabs API key to reproduce: FIREWORKS_API_KEY=<ailabs-key> python test_qwen3_lora_shapes.py -v Co-authored-by: zhiweiz <morgendave@users.noreply.github.com>

Now uses the actual fireworks.training.sdk to exercise: 1. resolve_training_profile() - the exact API the SDK calls (GET /versions?filter=latest_validated=true&pageSize=1) 2. TrainerJobConfig validation with shape ref 3. Job lifecycle via mgr.create() with ?trainingShape= query param Key findings from reading the SDK source: - resolve_training_profile() filters versions with 'latest_validated=true' server-side - If the server returns no versions, it raises: 'No latest validated training-shape version was returned' - Job creation passes the shape ref as a QUERY PARAMETER (?trainingShape=...), not in the request body - The server does shape-to-infra resolution and validation All 4 qwen3p5-*-lora shapes pass resolution and config validation. The permission issue for AILabs is server-side — either: - The 'latest_validated=true' filter returns empty for their account - Or the job creation with ?trainingShape= fails at server validation Run with any account's API key to diagnose: FIREWORKS_API_KEY=<key> python test_qwen3_lora_shapes.py -v Co-authored-by: zhiweiz <morgendave@users.noreply.github.com>

Co-authored-by: zhiweiz <morgendave@users.noreply.github.com>

vercel Bot deployed to Preview – transcription-demo April 1, 2026 17:45 View deployment

vercel Bot deployed to Preview – functional-chat April 1, 2026 17:45 View deployment

cursor Bot changed the title ~~Add cross-job checkpoint promotion script~~ Add cross-job checkpoint promotion script and training shape E2E tests Apr 16, 2026

vercel Bot deployed to Preview – functional-chat April 16, 2026 17:47 View deployment

vercel Bot deployed to Preview – transcription-demo April 16, 2026 17:47 View deployment

vercel Bot deployed to Preview – functional-chat April 16, 2026 18:12 View deployment

vercel Bot deployed to Preview – transcription-demo April 16, 2026 18:12 View deployment

vercel Bot deployed to Preview – functional-chat April 16, 2026 18:56 View deployment

vercel Bot deployed to Preview – transcription-demo April 16, 2026 18:56 View deployment

Add cookbook-test to gitignore

7ac4464

Co-authored-by: zhiweiz <morgendave@users.noreply.github.com>

vercel Bot deployed to Preview – functional-chat April 16, 2026 22:31 View deployment

vercel Bot deployed to Preview – transcription-demo April 16, 2026 22:31 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cross-job checkpoint promotion script and training shape E2E tests#52

Add cross-job checkpoint promotion script and training shape E2E tests#52
morgendave wants to merge 5 commits intomainfrom
cursor/cross-job-checkpoint-promotion-39fb

morgendave commented Apr 1, 2026 •

edited by cursor Bot

Loading

Uh oh!

vercel Bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

morgendave commented Apr 1, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause & E2E Results

Customer 404 error — RESOLVED

GRPO LoRA E2E — Results

Cookbook LoRA GRPO issue

CI coverage gap

Cleaned up

Uh oh!

vercel Bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

morgendave commented Apr 1, 2026 •

edited by cursor Bot

Loading

vercel Bot commented Apr 1, 2026 •

edited

Loading