Skip to content

Add cross-job checkpoint promotion script and training shape E2E tests#52

Draft
morgendave wants to merge 5 commits intomainfrom
cursor/cross-job-checkpoint-promotion-39fb
Draft

Add cross-job checkpoint promotion script and training shape E2E tests#52
morgendave wants to merge 5 commits intomainfrom
cursor/cross-job-checkpoint-promotion-39fb

Conversation

@morgendave
Copy link
Copy Markdown

@morgendave morgendave commented Apr 1, 2026

Root Cause & E2E Results

Customer 404 error — RESOLVED

SDK v1.0.0a35 bug: resolve_training_profile() looks up shapes under the caller's own account instead of accounts/fireworks/. Fixed in v1.0.0a61.

Fix: pip install --pre 'fireworks-ai[training]>=1.0.0a61'

GRPO LoRA E2E — Results

Tested cookbook GRPO with qwen3p5-9b-256k-lora (LoRA rank 16) on aialabs account:

Test Result
Shape resolution (resolve_training_profile) PASS
Job creation (LoRA policy trainer, B200x2) PASS
Trainer reaches RUNNING (~6 min) PASS
Deployment creation + hotload PASS
Inference sampling (5 rounds, 80 completions) PASS
GRPO without reference (kl_beta=0) PASS (full pipeline)
GRPO with inline LoRA reference (kl_beta>0) BLOCKED — hangs on save_weights_for_sampler_ext after creating lora_rank=0 client

Cookbook LoRA GRPO issue

The cookbook's rl_loop.py has two problems for LoRA GRPO with kl_beta > 0:

  1. Reference job creation passes lora_rank=cfg.lora_rank (e.g. 16) to the forward-only reference shape, which rejects it with HTTP 400 (shape expects FORWARD_ONLY mode, not LORA_TRAINER)

  2. Inline reference (correct approach: create a lora_rank=0 client on the same policy trainer) — creating a second FiretitanServiceClient with lora_rank=0 on the same trainer endpoint causes save_weights_for_sampler_ext to hang indefinitely. This appears to be a trainer-side issue where having two active sessions (one LoRA, one full-param) blocks checkpoint operations.

CI coverage gap

The existing smoke tests (test_grpo_smoke.py) use kl_beta=0 and lora_rank=0 — they don't cover the LoRA + reference model path.

Cleaned up

  • Deleted 8 stale trainer jobs from aialabs account
  • Scaled down 3 test deployments to zero

Slack Thread

Open in Web Open in Cursor 

When the original RLOR trainer job has been deleted, `forge rft promote`
fails with HTTP 404 because it tries to list checkpoints on a job that
no longer exists.

This script works around the problem by:
1. Reading checkpoints.jsonl to find the DCP state_path for the target step
2. Spinning up a temporary service-mode RLOR trainer with the same base model
3. Loading the cross-job DCP checkpoint into the temporary trainer
4. Exporting a sampler (HF-format) checkpoint from the loaded weights
5. Promoting the sampler checkpoint to a deployable Fireworks model
6. Cleaning up (deleting the temporary trainer)

Supports:
- Auto-inference of training shape from base model name
- Direct promotion fallback when sampler_path exists
- Dry-run mode to preview the plan without creating resources
- Automatic cleanup of temporary trainer on success or failure

Co-authored-by: zhiweiz <morgendave@users.noreply.github.com>
@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented Apr 1, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
functional-chat Ready Ready Preview, Comment Apr 16, 2026 10:31pm
transcription-demo Ready Ready Preview, Comment Apr 16, 2026 10:31pm

Request Review

Root cause analysis:
- All Qwen3 (legacy) LoRA training shape versions have public=False
- This prevents external accounts (like AILabs) from resolving them
  via resolve_training_profile()
- The shapes exist and have latestValidated versions, but those
  versions aren't marked public
- Qwen 3.5 (qwen3p5-*) LoRA shapes are properly configured with
  public=True and work correctly

Failing shapes (public=False on latestValidated):
  - qwen3-4b-256k-h200-lora
  - qwen3-8b-256k-h200-lora
  - qwen3-4b-minimum-h200-lora
  - qwen3-4b-minimum-h200-forward-lora
  - qwen3-235b-2507-instruct-128k-b200-lora
  - qwen3-235b-2507-instruct-128k-b200-forward-only-lora

Working shapes (public=True):
  - qwen3p5-9b-256k-lora
  - qwen3p5-27b-256k-lora
  - qwen3p5-35b-a3b-256k-lora
  - qwen3p5-397b-a17b-256k-lora
  - qwen3-vl-8b-256k-h200-lora

Fix: Mark the latestValidated version as public=True for each
     failing legacy shape, or direct users to the new qwen3p5 shapes.

Also updated cross_job_promote.py shape map to include Qwen 3.5 shapes.

Co-authored-by: zhiweiz <morgendave@users.noreply.github.com>
@cursor cursor Bot changed the title Add cross-job checkpoint promotion script Add cross-job checkpoint promotion script and training shape E2E tests Apr 16, 2026
Updated test to focus on the actual customer issue: AILabs can't access
qwen3p5-*-lora shapes with a permission error.

Test now covers three phases:
1. Shape metadata validation (exists, trainerMode, versions, public/validated)
2. Dependency chain access (base model, deployment shape, deployment
   shape version — all must be accessible and public)
3. Job lifecycle E2E (opt-in: creates and deletes a real RLOR trainer job)

All qwen3p5 LoRA shapes pass from the pyroworks account:
- Shape versions have latestValidated=True and public=True
- Deployment shape versions are public and validated
- Base models are READY and accessible
- Job creation succeeds

The permission issue for AILabs is likely at the account level:
- Accelerator quota (B200/B300 access)
- Service-mode RLOR permission
- Or SDK version mismatch

Run with AILabs API key to reproduce:
  FIREWORKS_API_KEY=<ailabs-key> python test_qwen3_lora_shapes.py -v

Co-authored-by: zhiweiz <morgendave@users.noreply.github.com>
Now uses the actual fireworks.training.sdk to exercise:
1. resolve_training_profile() - the exact API the SDK calls
   (GET /versions?filter=latest_validated=true&pageSize=1)
2. TrainerJobConfig validation with shape ref
3. Job lifecycle via mgr.create() with ?trainingShape= query param

Key findings from reading the SDK source:
- resolve_training_profile() filters versions with
  'latest_validated=true' server-side
- If the server returns no versions, it raises:
  'No latest validated training-shape version was returned'
- Job creation passes the shape ref as a QUERY PARAMETER
  (?trainingShape=...), not in the request body
- The server does shape-to-infra resolution and validation

All 4 qwen3p5-*-lora shapes pass resolution and config validation.
The permission issue for AILabs is server-side — either:
- The 'latest_validated=true' filter returns empty for their account
- Or the job creation with ?trainingShape= fails at server validation

Run with any account's API key to diagnose:
  FIREWORKS_API_KEY=<key> python test_qwen3_lora_shapes.py -v

Co-authored-by: zhiweiz <morgendave@users.noreply.github.com>
Co-authored-by: zhiweiz <morgendave@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant