Skip to content

feat(instances): add ResizeInstance for CPU/memory resizing#183

Merged
poyrazK merged 76 commits intomainfrom
fix/resize-quota-bug
May 2, 2026
Merged

feat(instances): add ResizeInstance for CPU/memory resizing#183
poyrazK merged 76 commits intomainfrom
fix/resize-quota-bug

Conversation

@poyrazK
Copy link
Copy Markdown
Owner

@poyrazK poyrazK commented Apr 26, 2026

Summary

  • Add POST /instances/:id/resize endpoint to resize running/stopped instances to different instance types
  • Multi-backend support: Docker (warm resize via ContainerUpdate) and Libvirt (cold resize via domain XML redefine)
  • Delta-based quota enforcement: CheckQuota for upsize, DecrementUsage for downsize
  • Proper error propagation: QuotaExceeded returns 429, quota decrement failures no longer silently accumulate
  • Context timeouts on Libvirt operations to prevent unbounded waits
  • Comprehensive test coverage: unit, handler, and E2E tests
  • Swagger docs updated with 429 response code

Breaking Change

Response format change (for API consumers):

The success response for POST /instances/:id/resize now includes additional fields:

  • Old response: {"message": "instance resized"}
  • New response: {"message": "instance resized", "instance_type": "...", "status": "..."}

Clients should be updated to handle the new instance_type and status fields.

Test plan

  • All handler tests pass
  • All unit tests pass
  • E2E tests pass (requires Docker)
  • Race detection tests pass
  • CI green on feat/instance-resize

Summary by CodeRabbit

  • New Features

    • Resize success responses now include the requested instance type and instance status.
  • Bug Fixes

    • Improved resize reliability with stronger quota handling, rollback behavior, and error aggregation.
    • Added timeout protection for certain backend resize operations to avoid indefinite waits.
  • Documentation

    • API docs updated: quota failures map to HTTP 429 (Too Many Requests).
    • Docs clarify libvirt-backed resizes may need a brief cold restart while container-backed resizes can be live.
  • Tests

    • Resize tests expanded and adjusted to cover new success/failure and rollback scenarios.

poyrazK added 30 commits April 26, 2026 18:21
Add PermissionInstanceResize RBAC permission and ResizeInstance
method to both ComputeBackend and InstanceService interfaces.
ResizeInstance(ctx, id, cpu, memory) allows changing the
CPU/memory of a running or stopped instance.
Adds ContainerUpdate to dockerClient interface and implements
ResizeInstance which calls docker client.ContainerUpdate with
cpu (NanoCPUs) and memory (bytes) limits for live resize.
Implements ResizeInstance for Libvirt using a cold resize:
stop → update domain XML (memory/vcpu) → start.
Uses regex replacement on domain XML to update <memory> and
<vcpu> elements, then redefines and restarts the domain.
Resolves instance by id/name, validates target instance type,
checks quota delta (new-old resources for vcpus/memory),
calls compute.ResizeInstance, updates instance record,
and logs the resize operation. Decrements quota on downscale.
POST /api/v1/instances/:id/resize accepts { instance_type: "basic-4" }
and calls service.ResizeInstance. Requires PermissionInstanceResize.
Firecracker returns "resize not supported" error.
Noop backend returns nil (no-op).
Add ResizeInstance to MockInstanceService, MockComputeBackend,
instanceServiceMock (handler tests), and mockCompute (libvirt).
Fixes lint failure where k8s test mocks did not implement the new
ResizeInstance method added to ports.InstanceService.
Fixes lint failures in k8s and workers test files where mocks did not
implement the ResizeInstance method added to ports.InstanceService and
ports.ComputeBackend interfaces.
Add comprehensive test coverage for the ResizeInstance feature:

- Unit tests: 12 subtests covering upsize, downsize, same-size,
  by-uuid lookup, not found, type validation, quota enforcement,
  compute errors, record errors, and RBAC failures
- Handler tests: 6 subtests covering success, invalid ID, invalid
  body, empty instance type, not found, and quota exceeded
- E2E tests: upsize, downsize, and invalid type scenarios

E2E tests require a running server with Docker backend.
- Document POST /instances/:id/resize in api-reference.md
- Update FEATURES.md instance lifecycle description
- Add ADR-025 documenting the resize decision (cold vs warm,
  quota delta enforcement, Libvirt vs Docker strategy)
The resize endpoint returns {"message": "..."}, not the instance
data. Verification happens via GET /instances/:id after resize.
basic-4 doesn't exist in the seeded instance_types table.
Resize upsize: basic-2 → standard-1 (1→2 vCPU, 1024→2048MB).
Resize downsize: standard-1 → basic-2 (2→1 vCPU, 2048→1024MB).
Adds instance:resize permission and seeds it to developer role so
that users with developer role can resize instances.

File: internal/repositories/postgres/migrations/107_seed_instance_resize_permission.up.sql
File: internal/repositories/postgres/migrations/107_seed_instance_resize_permission.down.sql
…esize

- Pass vCPU count (cpu/1e9) instead of NanoCPUs to applyDomainResize
- Only call DomainCreate if instance was originally running, preserving
  stopped state after resize

File: internal/repositories/libvirt/adapter.go
File: internal/repositories/firecracker/adapter_noop.go
- Add same-size short-circuit to avoid unnecessary compute calls
- Validate Status and ContainerID before proceeding with resize
- Use deltaMemMB in MB (not GB) to preserve sub-GB changes
- Preserve QuotaExceeded error kind instead of wrapping as Forbidden
- Capture quota errors and log failures after successful resize
- Add rollback: revert compute and undo quota on repo update failure
- Record INSTANCE_RESIZE event for audit trail

File: internal/core/services/instance.go
… coverage

- Remove uuid.Parse block so handler passes raw idStr to service
- Fix binding error to use errors.New instead of errors.Wrap
- Update InvalidID test to verify name-based resize works
- Add response body assertion to Success test

File: internal/handlers/instance_handler.go
File: internal/handlers/instance_handler_test.go
- Add 401/403 responses and minLength:1 to instance_type in swagger.json
- Update unit tests: add eventSvc mock, fix memory quota values (MB not GB),
  add Status field to test instances, add rollback expectations for
  UpdateRecordError test
- Add containerUpdateResp and containerUpdateErr to fakeDockerClient for
  error/warning path testing

File: docs/swagger/swagger.json
File: internal/core/services/instance_unit_test.go
File: internal/repositories/docker/fakes_test.go
applyDomainResize now uses encoding/xml to unmarshal domain XML into
a struct, modify memory/vcpu fields, and marshal back. This is more
robust than regex replacement which breaks on XML format changes.

File: internal/repositories/libvirt/adapter.go
Replaces 4 identical polling loops with a shared helper function that
accepts desired status and timeout. Removes ~70 lines of duplicated
code across TestComputeE2E and TestResizeInstance tests.

File: tests/compute_e2e_test.go
The 107_seed_instance_resize_permission migration incorrectly referenced
a non-existent "permissions" table and used permission_id column that
doesn't exist in role_permissions. The role_permissions table uses a
"permission" TEXT column (not permission_id UUID). This fixes the migration
to match the actual schema defined in 027_create_rbac.up.sql.
The swagger generator removed minLength:1 from instance_type since the
handler validation now validates empty string separately. This brings
docs in sync with the actual generated swagger output.
Two changes to address gocyclo lint failure (>30 complexity threshold):
1. Return error when quota updates fail post-resize instead of just
   logging (early return reduces complexity)
2. The thelper lint was already fixed via t.Helper() in prior session

This keeps ResizeInstance below the 30-complexity guard while preserving
all rollback and error-handling behavior.
…omplexity

Breaks ResizeInstance into focused helpers: resolveInstance,
resolveInstanceTypes, validateResize, executeResize, completeResize.
Each helper handles one concern, keeping ResizeInstance's main flow
at ~10 branches (below 30 threshold).

Also removes unused `desired` parameter from waitForInstanceStatus
since all callers pass domain.StatusRunning (fixes unparam lint).
All callers pass the same 90-second timeout, so hardcode it directly
in the helper to satisfy unparam lint.
The QuotaExceeded_CPU and QuotaExceeded_Memory tests were missing
Compute mock expectations, causing nil pointer dereference panics
when executeResize tried to call Compute.ResizeInstance.
1. Magic constants: Add NanoCPUsPerVCPU (1e9) and BytesPerMB (1024*1024)
   constants, replace raw literals in executeResize, completeResize, and rollback

2. Lookup error masking: Only fall through to UUID lookup when GetByName
   returns a NotFound error (not arbitrary errors)

3. Handler empty-id guard: Add id != "" check in ResizeInstance handler,
   consistent with Stop/Get/Terminate

4. Log key snake_case: Fix memory_ki_b → memory_kib in domain resize log

5. DB rollback: Stop discarding errors from compensating actions; add
   rollbackErrs slice with structured error messages; only attempt quota
   reversal when forward increment succeeded (cpuIncremented/memIncremented
   flags); compose and return original update error + rollback errors

6. applyDomainResize XML: Remove struct-based marshal/unmarshal (stripped
   elements); revert to targeted regex replacements preserving all other
   XML elements/attributes; remove unused domainXML struct and encoding/xml
   import

7. Down migration: Restrict DELETE to same role targeted by up migration
   (role = 'developer') instead of deleting for all roles

8. Undefine/define order: Add rollback DomainDefineXML on define failure
   to prevent permanent domain loss if redefine fails
Copilot AI review requested due to automatic review settings April 29, 2026 17:16
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI review requested due to automatic review settings April 29, 2026 18:18
@poyrazK poyrazK force-pushed the fix/resize-quota-bug branch from 798ed07 to 532a03a Compare April 29, 2026 18:18
@poyrazK poyrazK force-pushed the fix/resize-quota-bug branch from 532a03a to 588f74f Compare April 29, 2026 18:30
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI review requested due to automatic review settings April 30, 2026 09:32
# Conflicts:
#	internal/repositories/libvirt/adapter.go
@poyrazK poyrazK force-pushed the fix/resize-quota-bug branch from 6337712 to 9924cac Compare April 30, 2026 09:34
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


// 3. DB update
inst.InstanceType = newInstanceType
inst.Version++
Comment on lines 827 to +830
deltaMemMB := newIT.MemoryMB - oldIT.MemoryMB
memoryGB := deltaMemMB / 1024

// 1. Quota changes first — fail fast before any VM state change
Comment on lines 256 to 270
if wasRunning {
if err := a.client.DomainCreate(ctx, newDom); err != nil {
if rbErr := a.rollbackFromSnapshot(ctx, id, snapshotName, domXML); rbErr != nil {
return fmt.Errorf("failed to start domain after resize (instance_id=%s): %w", id, errors.Join(err, rbErr))
// Rollback: undefine the new definition and restore the original domain.
undefineErr := a.client.DomainUndefine(ctx, newDom)
if undefineErr != nil {
a.logger.Error("failed to undefine new domain after DomainCreate failure", "domain", id, "error", undefineErr)
}
restoredDom, rollbackErr := a.client.DomainDefineXML(ctx, domXML)
if rollbackErr != nil {
return fmt.Errorf("failed to start domain after resize (instance_id=%s), rollback also failed: original error: %w; rollback error: %w", id, err, rollbackErr)
}
if restartErr := a.client.DomainCreate(ctx, restoredDom); restartErr != nil {
return fmt.Errorf("failed to start domain after resize (instance_id=%s), rollback redef succeeded but restart failed: %w", id, restartErr)
}
return fmt.Errorf("failed to start domain after resize: %w", err)
Comment thread tests/compute_e2e_test.go
Comment on lines 154 to 160
t.Run("TerminateInstance", func(t *testing.T) {
if !instanceReady {
t.Skip("Instance did not reach RUNNING state, skipping terminate")
}
resp := deleteRequest(t, client, fmt.Sprintf(testutil.TestRouteFormat, testutil.TestBaseURL, testutil.TestRouteInstances, instanceID), token)
defer func() { _ = resp.Body.Close() }()

Comment thread tests/compute_e2e_test.go
Comment on lines 256 to 263
// 5. Terminate Instance
t.Run("TerminateInstance", func(t *testing.T) {
if !instanceReady {
t.Skip("Instance did not reach RUNNING state, skipping terminate")
}
resp := deleteRequest(t, client, fmt.Sprintf(testutil.TestRouteFormat, testutil.TestBaseURL, testutil.TestRouteInstances, instanceID), token)
defer func() { _ = resp.Body.Close() }()

poyrazK added 2 commits April 30, 2026 12:59
- Add ResizeInstanceResponse struct with message, instance_type, and status fields
- Update swagger annotation to use new response schema
- Regenerate swagger docs
Copilot AI review requested due to automatic review settings April 30, 2026 10:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@poyrazK poyrazK force-pushed the fix/resize-quota-bug branch from 4e65712 to 66a5495 Compare April 30, 2026 12:40
poyrazK added 3 commits April 30, 2026 15:44
# Conflicts:
#	internal/api/setup/router.go
#	internal/core/services/notify_unit_test.go
#	internal/repositories/libvirt/adapter.go
#	internal/repositories/noop/adapters.go
Copilot AI review requested due to automatic review settings May 2, 2026 10:47
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Owner Author

@poyrazK poyrazK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay to merge

@poyrazK poyrazK merged commit fbfd467 into main May 2, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants