Fix model registration #529

Kovbo · 2026-01-21T22:02:38Z

When I start a new training job with a new model, it gives me the error:

  File "/home/sky/sky_workdir/.venv/lib/python3.11/site-packages/openai/_base_client.py", line 1597, in request
    raise self._make_status_error_from_response(err.response) from None
openai.NotFoundError: Error code: 404 - {'error': {'message': 'The model `0115` does not exist.', 'type': 'NotFoundError', 'param': None, 'code': 404}}

Looks like it is related to Multi-checkpoint inference for pipelined training

The issue:
It register the checkpoint under model_name@step:

  lora_name = f"{model_name}@{step}"  # "0115@0"                                                               
  lora_modules = [f'{{"name": "{lora_name}", ...}}']

But the training script's client uses model.name (just "0115"), causing:
openai.NotFoundError: The model 0115does not exist.

The server had "0115@0" registered, but the client asked for "0115".

Potential fix:
Register under both names so:

Training works (client uses model_name)
Multi-checkpoint support preserved (can still access model_name@step)

Fix model registration

76e1c2b

Kovbo requested review from bradhilton and corbt January 21, 2026 22:02

bradhilton approved these changes Jan 21, 2026

View reviewed changes

bradhilton merged commit c2f0e39 into main Jan 23, 2026
2 checks passed

bradhilton deleted the fix/new-model-registration branch January 23, 2026 03:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix model registration #529

Fix model registration #529

Uh oh!

Kovbo commented Jan 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix model registration #529

Fix model registration #529

Uh oh!

Conversation

Kovbo commented Jan 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants