Skip to content

Conversation

@sjmiller609
Copy link
Collaborator

@sjmiller609 sjmiller609 commented Jan 30, 2026

  • Use the lib/resources module for checking if a starting or creating VM should be accepted
  • Clean up mdev for stopped instances
  • Fail on standby request if vGPU is enabled
  • include disk io in resources API, accidentally omitted previously

Note

Medium Risk
Touches core instance lifecycle (create/start/stop/restore/standby) and resource admission logic, which can prevent workloads from starting if misconfigured; changes are largely additive and surfaced via explicit 409 errors.

Overview
Instance admission now uses lib/resources for aggregate capacity checks. instances.Manager gains a pluggable ResourceValidator (wired in cmd/api/main.go) and CreateInstance, StartInstance, and RestoreInstance reject requests when CPU/memory/network/disk I/O/GPU capacity is insufficient, surfacing ErrInsufficientResources.

API behavior changes: POST /instances now returns 409 for insufficient resources, StartInstance returns 409 with insufficient_resources, and the OpenAPI client/spec are updated accordingly. The resources endpoint now includes disk I/O capacity/status and per-instance disk_io_bps allocations.

vGPU lifecycle and state rules tightened: standby is blocked for vGPU instances, vGPU mdevs are recreated on start and destroyed/cleared on stop, and allocations tracking now includes DiskIOBps. Legacy aggregate CPU/memory env limits and related instance-side aggregate limit code/tests are removed in favor of oversubscription ratios.

Written by Cursor Bugbot for commit ca5e720. This will update automatically on new commits. Configure here.

@github-actions
Copy link

github-actions bot commented Jan 30, 2026

✱ Stainless preview builds

This PR will update the hypeman SDKs with the following commit message.

feat: Use resources module for input validation
⚠️ hypeman-typescript studio · code

There was a regression in your SDK.
generate ⚠️build ✅lint ✅test ✅

npm install https://pkg.stainless.com/s/hypeman-typescript/8dc08a468794e1fead711c5c1b538d02cd7f49c9/dist.tar.gz
⚠️ hypeman-go studio · code

There was a regression in your SDK.
generate ⚠️lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@af678e8c794307a6bd47476acff3ca42a7a52546
⚠️ hypeman-cli studio · conflict

There was a regression in your SDK.


This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-01-30 20:24:40 UTC

@sjmiller609 sjmiller609 marked this pull request as ready for review January 30, 2026 18:14
@sjmiller609 sjmiller609 changed the title Use resources module for input validation fix: resource limits for starting instances Jan 30, 2026
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Copy link
Contributor

@hiroTamada hiroTamada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall - nice refactor to centralize resource validation in lib/resources instead of the ad-hoc aggregate checks.

Agreeing with Bugbot's findings that should be addressed (can be follow-up PRs):

  1. High: Missing network/diskIO limits on start - startInstance passes 0, 0, 0 for network download/upload and disk I/O instead of the stored values. Restarted instances won't have their bandwidth validated against capacity.

  2. Medium: RestoreInstance missing validation - restoreInstance doesn't call validateResourceAllocation, allowing standby→running transitions to bypass capacity checks.

These are edge cases (start from stopped, restore from standby) but could lead to oversubscription if resource availability changed while the instance was down.

@sjmiller609 sjmiller609 merged commit cbb694a into main Jan 30, 2026
3 of 4 checks passed
@sjmiller609 sjmiller609 deleted the stop-fixes branch January 30, 2026 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants