Skip to content

Conversation

@Kovbo
Copy link
Collaborator

@Kovbo Kovbo commented Jan 21, 2026

When I start a new training job with a new model, it gives me the error:

  File "/home/sky/sky_workdir/.venv/lib/python3.11/site-packages/openai/_base_client.py", line 1597, in request
    raise self._make_status_error_from_response(err.response) from None
openai.NotFoundError: Error code: 404 - {'error': {'message': 'The model `0115` does not exist.', 'type': 'NotFoundError', 'param': None, 'code': 404}}

Looks like it is related to Multi-checkpoint inference for pipelined training

The issue:
It register the checkpoint under model_name@step:

  lora_name = f"{model_name}@{step}"  # "0115@0"                                                               
  lora_modules = [f'{{"name": "{lora_name}", ...}}']    

But the training script's client uses model.name (just "0115"), causing:
openai.NotFoundError: The model 0115does not exist.

The server had "0115@0" registered, but the client asked for "0115".

Potential fix:
Register under both names so:

  • Training works (client uses model_name)
  • Multi-checkpoint support preserved (can still access model_name@step)

@Kovbo Kovbo requested review from bradhilton and corbt January 21, 2026 22:02
@bradhilton bradhilton merged commit c2f0e39 into main Jan 23, 2026
2 checks passed
@bradhilton bradhilton deleted the fix/new-model-registration branch January 23, 2026 03:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants