Skip to content

feat: snapshot-before-resize for Libvirt cold resize#182

Merged
poyrazK merged 14 commits intomainfrom
feat/snapshot-before-resize
Apr 29, 2026
Merged

feat: snapshot-before-resize for Libvirt cold resize#182
poyrazK merged 14 commits intomainfrom
feat/snapshot-before-resize

Conversation

@poyrazK
Copy link
Copy Markdown
Owner

@poyrazK poyrazK commented Apr 25, 2026

Summary

Add snapshot-before-resize safety net for Libvirt/KVM cold resize operations.

  • Add CreateSnapshot, RestoreSnapshot, DeleteSnapshot to ComputeBackend interface
  • Libvirt adapter implements QCOW2 volume-level snapshots for rollback on resize failure
  • Docker/Firecracker backends implement as no-ops (hot resize doesn't need snapshots)
  • ResizeInstance in Libvirt adapter creates a pre-resize snapshot, rolls back on failure, deletes on success

Test plan

  • Unit tests pass
  • Build compiles
  • Manual test with Libvirt instance (cold resize path)
  • Verify rollback works when resize fails mid-way

Related

ADR-025 flagged cold resize risk; this implements the snapshot-before-resize pattern.

Summary by CodeRabbit

  • New Features

    • Added instance resizing capability via new POST /instances/:id/resize endpoint
    • Supports changing CPU and memory allocation for running and stopped instances
    • Includes quota enforcement and permission-based access control
  • Documentation

    • Updated API reference and architecture documentation with resize operation details
    • Added RBAC permission for instance resize operations
  • Tests

    • Added comprehensive unit and end-to-end tests for resize functionality

Copilot AI review requested due to automatic review settings April 25, 2026 18:32
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

Warning

Rate limit exceeded

@poyrazK has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 33 minutes and 18 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bbe3a384-a9dd-4ca0-9bd6-3de4804d1afe

📥 Commits

Reviewing files that changed from the base of the PR and between e455525 and e786c4e.

📒 Files selected for processing (14)
  • docs/adr/ADR-025-instance-resize.md
  • internal/core/ports/compute.go
  • internal/core/services/mock_compute_test.go
  • internal/platform/resilient_compute.go
  • internal/platform/resilient_compute_test.go
  • internal/repositories/docker/adapter.go
  • internal/repositories/firecracker/adapter.go
  • internal/repositories/firecracker/adapter_noop.go
  • internal/repositories/libvirt/adapter.go
  • internal/repositories/libvirt/adapter_unit_test.go
  • internal/repositories/libvirt/lb_proxy_test.go
  • internal/repositories/noop/adapters.go
  • internal/workers/database_failover_worker_test.go
  • internal/workers/pipeline_worker_test.go
📝 Walkthrough

Walkthrough

This PR implements instance resizing functionality, adding a POST /instances/:id/resize API endpoint that allows changing CPU/memory allocation for running or stopped instances. Implementation includes service layer logic with quota enforcement and rollback, compute backend adapters for Docker and Libvirt, RBAC authorization, database migrations, and comprehensive test coverage.

Changes

Cohort / File(s) Summary
Documentation
docs/FEATURES.md, docs/api-reference.md, docs/adr/ADR-025-instance-resize.md, docs/swagger/docs.go, docs/swagger/swagger.json, docs/swagger/swagger.yaml
Introduces API documentation for instance resizing endpoint, including request/response schemas, error codes, RBAC permission (instance:resize), and architectural decision record detailing cold resize for Libvirt, warm resize for Docker, quota enforcement, and rollback procedures.
RBAC Authorization
internal/core/domain/rbac.go
Adds PermissionInstanceResize constant for instance resize authorization.
Service Layer
internal/core/services/instance.go, internal/core/services/instance_unit_test.go
Implements ResizeInstance method with RBAC checks, instance resolution, instance-type validation, quota enforcement (delta-based for upsize/downsize), compute resize delegation, persistent storage update, and comprehensive rollback on failure. Includes 569 lines of unit tests covering success paths, quota scenarios, error cases, and event recording.
Port Definitions
internal/core/ports/compute.go, internal/core/ports/instance.go
Extends ComputeBackend interface with ResizeInstance and snapshot lifecycle methods (CreateSnapshot, RestoreSnapshot, DeleteSnapshot); extends InstanceService with ResizeInstance method.
HTTP Layer
internal/api/setup/router.go, internal/handlers/instance_handler.go, internal/handlers/instance_handler_test.go
Registers POST /instances/:id/resize route with PermissionInstanceResize guard; implements handler validating required instance_type field and delegating to service; includes 113 lines of handler tests covering success, validation, and error paths.
Resilient Adapter
internal/platform/resilient_compute.go, internal/platform/resilient_compute_test.go
Wraps ResizeInstance and snapshot methods with bulkhead, circuit breaker, and timeout protection via callProtected pipeline.
Backend Implementations
internal/repositories/docker/adapter.go, internal/repositories/docker/fakes_test.go, internal/repositories/libvirt/adapter.go, internal/repositories/libvirt/adapter_unit_test.go, internal/repositories/firecracker/adapter.go, internal/repositories/firecracker/adapter_noop.go
Docker: implements resize via ContainerUpdate and snapshot methods as no-ops. Libvirt: implements cold-resize workflow with domain XML patching, snapshot archiving, rollback-on-failure, and 418 lines of unit tests. Firecracker: all methods return "not supported" errors.
No-op Backends
internal/repositories/noop/adapters.go, internal/repositories/k8s/mocks_test.go, internal/repositories/k8s/kubeadm_provisioner_test.go, internal/repositories/libvirt/lb_proxy_test.go
Adds resize/snapshot methods to no-op compute backend and test mocks; removes unused NoopExecutionLedger type and durable queue methods.
Test Infrastructure
internal/core/services/mock_compute_test.go, internal/workers/database_failover_worker_test.go, internal/workers/healing_worker_test.go, internal/workers/pipeline_worker_test.go
Expands mock implementations of ComputeBackend and InstanceService to support resize and snapshot operations via testify framework.
Database Migrations
internal/repositories/postgres/migrations/107_seed_instance_resize_permission.up.sql, internal/repositories/postgres/migrations/107_seed_instance_resize_permission.down.sql
Idempotent migration granting developer role the instance:resize permission.
HTTP Utilities
pkg/httputil/response.go
Maps errors.QuotaExceeded to HTTP 429 (Too Many Requests) status code.
E2E Tests
tests/compute_e2e_test.go
Refactors instance readiness polling into shared helper; adds three new resize E2E tests for upsize, downsize, and invalid type validation with quota-aware assertions.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Handler as HTTP Handler
    participant Service as InstanceService
    participant Repo as Repository
    participant Backend as ComputeBackend
    
    Client->>Handler: POST /instances/:id/resize<br/>{instance_type}
    activate Handler
    Handler->>Handler: Validate request
    Handler->>Service: ResizeInstance(ctx, idOrName, newType)
    deactivate Handler
    
    activate Service
    Service->>Service: Check RBAC permission
    Service->>Repo: Resolve instance by UUID/name
    activate Repo
    Repo-->>Service: Instance
    deactivate Repo
    
    Service->>Repo: Load old & new InstanceType
    activate Repo
    Repo-->>Service: InstanceType records
    deactivate Repo
    
    alt Type unchanged
        Service-->>Client: No-op (same type)
    else Type changed
        Service->>Backend: ResizeInstance(ctx, id, cpu, memory)
        activate Backend
        Backend-->>Service: error or success
        deactivate Backend
        
        alt Compute succeeded
            Service->>Repo: Check/update quota (delta)
            activate Repo
            Repo-->>Service: QuotaExceeded or success
            deactivate Repo
            
            alt Quota ok
                Service->>Repo: Persist new InstanceType
                activate Repo
                Repo-->>Service: error or success
                deactivate Repo
                
                alt Persistence failed
                    Service->>Backend: Rollback compute
                    activate Backend
                    Backend-->>Service: -
                    deactivate Backend
                    Service-->>Client: Error (500)
                else Persistence ok
                    Service->>Service: Record event & audit
                    Service-->>Client: Success (200)
                end
            else Quota exceeded
                Service->>Backend: Rollback compute
                activate Backend
                Backend-->>Service: -
                deactivate Backend
                Service-->>Client: Error (429)
            end
        else Compute failed
            Service-->>Client: Error (500)
        end
    end
    deactivate Service
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 A resize for instances bold,
CPU dance and memory retold,
Cold snapshots, warm containers gleam,
Quota guards protect the dream,
Rollback whispers if things go wrong! 📦

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: snapshot-before-resize for Libvirt cold resize' directly summarizes the main change: introducing snapshot-before-resize functionality for Libvirt cold resize operations, which is a core feature added throughout the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/snapshot-before-resize

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 33 minutes and 18 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI review requested due to automatic review settings April 26, 2026 21:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI review requested due to automatic review settings April 28, 2026 10:40
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🧹 Nitpick comments (5)
tests/compute_e2e_test.go (1)

199-227: Consider verifying instance type via GET after successful resize.

The resize test handles both 200 (success) and 429 (quota exceeded), which is appropriate. However, when the response is 200, you're parsing res.Data.InstanceType from the resize response body. The resize handler returns gin.H{"message": "instance resized"} (per the handler implementation), which doesn't include instance_type.

The assertion on line 219 would check against an empty string, not the actual updated type. You already have VerifyResize (lines 230-246) that correctly fetches via GET, so consider removing the inline assertion or updating the handler to return the instance data.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/compute_e2e_test.go` around lines 199 - 227, The inline assertion in
the t.Run("Resize") test is wrong because the resize handler returns
gin.H{"message":"instance resized"} (no instance_type), so when resp.StatusCode
== http.StatusOK stop decoding into res.Data.InstanceType and instead call the
existing VerifyResize helper to fetch the instance via GET and assert the
instance type, or simply remove the decode/assert block; update the success
branch to invoke VerifyResize(testutil.TestBaseURL, testutil.TestRouteInstances,
instanceID, token, "standard-1") (or equivalent) so the test verifies the actual
instance type via GET rather than parsing the resize response body.
internal/handlers/instance_handler_test.go (1)

578-594: Test name/assertion mismatch: uses errors.Forbidden but named "QuotaExceeded".

The test uses errors.Forbidden which maps to HTTP 403, but the test is named "QuotaExceeded". Looking at pkg/httputil/response.go, errors.QuotaExceeded maps to 429 (TooManyRequests).

The E2E test (TestResizeInstance) expects 429 for quota exceeded scenarios. For consistency:

Option 1: Fix error type to match test name
 	t.Run("QuotaExceeded", func(t *testing.T) {
 		// ...
-		mockSvc.On("ResizeInstance", mock.Anything, id.String(), "basic-4").Return(errors.New(errors.Forbidden, "insufficient quota")).Once()
+		mockSvc.On("ResizeInstance", mock.Anything, id.String(), "basic-4").Return(errors.New(errors.QuotaExceeded, "insufficient quota")).Once()
 		// ...
-		assert.Equal(t, http.StatusForbidden, w.Code)
+		assert.Equal(t, http.StatusTooManyRequests, w.Code)
 	})
Option 2: Rename test to match actual behavior
-	t.Run("QuotaExceeded", func(t *testing.T) {
+	t.Run("Forbidden", func(t *testing.T) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/handlers/instance_handler_test.go` around lines 578 - 594, The test
"QuotaExceeded" in internal/handlers/instance_handler_test.go calls the service
mock for ResizeInstance and returns errors.Forbidden but the test name and E2E
expectations expect a quota-exceeded error (errors.QuotaExceeded -> HTTP 429);
update the mocked return in the ResizeInstance expectation to return
errors.QuotaExceeded (preserving the message) so the handler test, the
ResizeInstance route, and the E2E TestResizeInstance are consistent with a 429
outcome.
docs/swagger/swagger.yaml (1)

2334-2340: Tighten validation for instance_type.

required still permits "". Add minLength: 1 so SDKs/clients fail fast on empty values.

Suggested diff
   httphandlers.ResizeInstanceRequest:
     properties:
       instance_type:
+        minLength: 1
         type: string
     required:
     - instance_type
     type: object
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/swagger/swagger.yaml` around lines 2334 - 2340, The OpenAPI schema for
httphandlers.ResizeInstanceRequest currently requires instance_type but allows
an empty string; update the instance_type property in the
httphandlers.ResizeInstanceRequest schema to include "minLength: 1" so empty
strings are rejected (i.e., add minLength: 1 under the instance_type property to
tighten validation).
docs/swagger/swagger.json (2)

4005-4030: Document 403 Forbidden for RBAC-denied resize requests.

Because this PR introduces instance:resize, clients can hit authorization failures even with valid input. Adding 403 here improves API contract accuracy.

📌 Suggested doc update
                 "responses": {
                     "200": {
                         "description": "OK",
                         "schema": {
                             "$ref": "#/definitions/httputil.Response"
                         }
                     },
+                    "403": {
+                        "description": "Forbidden",
+                        "schema": {
+                            "$ref": "#/definitions/httputil.Response"
+                        }
+                    },
                     "400": {
                         "description": "Bad Request",
                         "schema": {
                             "$ref": "#/definitions/httputil.Response"
                         }
                     },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/swagger/swagger.json` around lines 4005 - 4030, Add a 403 response entry
to the responses object for the instance resize endpoint in
docs/swagger/swagger.json to document RBAC-denied requests: update the
"responses" block (the same block that currently lists "200","400","404","500")
to include a "403" key with "description": "Forbidden" and the same schema
reference ("$ref": "#/definitions/httputil.Response") so the API contract
reflects authorization failures introduced by the new instance:resize
permission.

11222-11232: Consider tightening instance_type schema metadata.

Adding a short description and basic constraints (for example minLength) would make generated docs/SDK validation clearer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/swagger/swagger.json` around lines 11222 - 11232, The schema for
httphandlers.ResizeInstanceRequest exposes instance_type as an unconstrained
string; add metadata to tighten it by adding a "description" (e.g., "Target
instance type identifier") and basic validation constraints such as "minLength":
1 and an optional "pattern" or "enum" if known; update the
httphandlers.ResizeInstanceRequest -> properties.instance_type entry to include
these fields so generated docs and SDKs surface clear validation and intent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/adr/ADR-025-instance-resize.md`:
- Around line 42-50: ADR-025 conflicts with current code: after a successful
resize the code in internal/core/services/instance.go collects quotaErrs from
IncrementUsage/DecrementUsage and returns an internal error, whereas the ADR
says failures should only be logged. Either update the ADR text to state that
quota update failures cause the service to return an internal error (reference
quotaErrs, IncrementUsage, DecrementUsage) or change the implementation in
internal/core/services/instance.go to swallow quotaErrs (log them) and not
return an error after a successful resize so behavior matches the ADR. Make the
change consistently and update the ADR or code comments to reference the final
decision.

In `@docs/swagger/swagger.yaml`:
- Around line 5010-5050: The OpenAPI path /instances/{id}/resize is missing
documented 401/403 responses for auth failures; update the operation to include
"401" and "403" responses (both using the existing schema reference
'#/definitions/httputil.Response') so clients can handle Unauthorized/Forbidden
cases and align with RBAC-protected endpoints.

In `@internal/core/services/instance_unit_test.go`:
- Around line 1738-1787: The test reveals that ResizeInstance/completeResize
increments vCPU usage (tenantSvc.IncrementUsage) before checking memory quota
(tenantSvc.CheckQuota) and on memory quota failure does not call
tenantSvc.DecrementUsage to roll back the prior vCPU increment; update the
implementation in ResizeInstance/completeResize so that after a failed
tenantSvc.CheckQuota( ..., "memory", deltaMemMB) it calls
tenantSvc.DecrementUsage(ctx, tenantID, "vcpus", deltaVCPUs) (or a generic
rollback helper) to undo the earlier IncrementUsage, and add corresponding error
handling/logging to ensure the decrement error is surfaced or logged while
returning the original quota error.
- Around line 1689-1736: The test reveals compute resize is performed in
executeResize before completeResize calls tenantSvc.CheckQuota, leaving no
rollback on quota failure; fix by either moving the quota validation to run
before executeResize or (preferred) adding a compute rollback in completeResize:
if tenantSvc.CheckQuota(...) returns an error, call compute.ResizeInstance(ctx,
oldContainerID, oldVcpusBytes, oldMemoryBytes) to restore previous size
(matching the existing repo.Update rollback pattern), log the rollback and then
return the quota error so state remains consistent; ensure you reference
executeResize, completeResize, tenantSvc.CheckQuota, compute.ResizeInstance and
repo.Update when making the change.

In `@internal/core/services/instance.go`:
- Around line 756-760: The resize flow currently calls s.executeResize(...)
before quota validation in s.completeResize(...), risking a mutated backend if
quota fails; move or replicate the quota check to run before mutating the
backend: call the same quota validation (the CheckQuota logic used inside
completeResize for upsizes) with tenantID and newInstanceType/newIT prior to
calling s.executeResize(ctx, target, newIT), and only proceed to s.executeResize
when that pre-check succeeds (or perform a DB quota reservation prior to
executeResize and release on failure) so the backend is never mutated when quota
would ultimately reject the operation.
- Around line 891-896: The code currently logs each quota update error
(s.logger.Error with "quota update failed after resize") but then returns an
Internal error when quotaErrs > 0, which incorrectly signals the resize failed;
change this to treat quota updates as best-effort: remove the errors.Wrap return
and instead log a single aggregated warning including tenantID and quotaErrs,
optionally emit an audit/metric indicating quota drift, and return nil so the
successful resize (and its success event/audit path) proceeds; keep the existing
per-error logging but do not abort after s.repo.Update has already committed.
- Around line 732-735: The current code collapses all errors from
resolveInstance into a NotFound error; change the logic in the caller of
resolveInstance so that if err != nil you return the original err (or wrap it
preserving its type/message) and only return errors.New(errors.NotFound,
"instance not found") when err == nil and inst == nil. Reference resolveInstance
and the surrounding function handling the inst, err tuple and ensure you do not
overwrite repository/internal errors with a 404.
- Around line 817-879: The memory delta is in MB but tenant quotas are in GB;
convert deltaMemMB to GB before calling tenantSvc methods and when rolling back.
Compute a deltaMemGB (e.g. deltaMemGB := deltaMemMB / 1024, use appropriate
rounding if needed) and replace uses of deltaMemMB with deltaMemGB in calls to
s.tenantSvc.CheckQuota, s.tenantSvc.IncrementUsage, s.tenantSvc.DecrementUsage
and in corresponding rollback/rollbackErr messages (references: deltaMemMB,
memIncremented, the memory quota branch and the rollback block where memory
decrement/increment and error messages are constructed). Ensure both positive
and negative memory flows use the converted GB value.

In `@internal/platform/resilient_compute.go`:
- Around line 140-162: The ResizeInstance, CreateSnapshot, RestoreSnapshot, and
DeleteSnapshot wrappers in ResilientCompute use r.opts.CallTimeout (short 2m)
which can prematurely time out long Libvirt cold-resize flows; update each to
call r.callProtected with the long timeout bucket (e.g. r.opts.LongCallTimeout
or the configured long-duration field on opts) instead of r.opts.CallTimeout so
these operations use the extended timeout; locate the methods ResizeInstance,
CreateSnapshot, RestoreSnapshot, DeleteSnapshot and replace the timeout argument
passed into callProtected accordingly.

In `@internal/repositories/libvirt/adapter_unit_test.go`:
- Around line 882-925: The test currently creates snapshotPath but never wires
it into RestoreSnapshot because execCommand always returns a no-op "true" and
the test asserts an error; change the execCommand stub in the LibvirtAdapter
used by this test so that when called with name == "tar" and arg[1] == "xzf" it
actually executes the real tar binary against the created snapshotPath (e.g.
return exec.Command("tar", "xzf", snapshotPath, "-C", tmpDir)) so the archive
you built is actually extracted, then change the final assertion from
require.Error(t, err) to require.NoError(t, err) to verify the success path of
RestoreSnapshot; update references to tmpDir, snapshotPath, execCommand and
RestoreSnapshot in this test accordingly.

In `@internal/repositories/libvirt/adapter.go`:
- Around line 29-33: The three compiled regex globals (memoryRe, currentMemRe,
vcpuRe) should be moved off package scope into fields on LibvirtAdapter (e.g.,
MemoryRe, CurrentMemRe, VcpuRe) and initialized inside NewLibvirtAdapter so they
are still compiled once per adapter instance; update any code that references
memoryRe/currentMemRe/vcpuRe to use the new LibvirtAdapter fields and remove the
package-level vars entirely to comply with the “no globals” rule.
- Around line 205-235: The rollbackFromSnapshot helper is being reused for
distinct failure phases but each phase needs different recovery steps; update
the resize flow around applyDomainResize, DomainUndefine, DomainDefineXML, and
DomainCreate to use phase-specific rollback functions (e.g.,
rollbackBeforeUndefine, rollbackAfterUndefine, rollbackAfterDefine) or extend
rollbackFromSnapshot with a phase enum so you can: (1) before undefine — avoid
redefining when original domain still exists and instead ensure volumes are
untouched; (2) after undefine — recreate/define the original domain by using
lookup by UUID/name and recreate volumes if needed (handle DomainLookupByName
failure), and (3) after new XML is defined — if later steps fail, first
undefine/remove the partially defined new domain before restoring the original
XML; in all rollbacks ensure the VM is restarted when wasRunning was true and
surface distinct errors combining original error and rollback error. Reference
applyDomainResize, rollbackFromSnapshot, DomainUndefine, DomainDefineXML,
DomainCreate, and the new phase-specific rollback helpers when making changes.
- Around line 253-260: The ReplaceAllString calls for memoryRe, currentMemRe,
and vcpuRe are replacing entire elements (dropping attributes like placement,
current, cpuset, unit); update the replacements to preserve matched attributes
and only change the element text node — for example, use capture groups in the
regexes (or ReplaceAllStringFunc) to retain the opening tag/attributes and
closing tag and substitute just the numeric content so memoryRe, currentMemRe
and vcpuRe rewrite only the inner value (still using memoryKiB and vcpus) rather
than overwriting the whole element.

In `@internal/repositories/noop/adapters.go`:
- Around line 404-410: The NoopTaskQueue implementation removed the
durable-queue methods and therefore no longer satisfies ports.DurableTaskQueue;
restore the missing methods (EnsureGroup, Receive, Ack, Nack, ReclaimStale) on
the NoopTaskQueue type so it can be injected where durable behavior is expected
(e.g., where provision_worker.go calls those methods). Implement these methods
as no-ops or simple stubs consistent with existing Enqueue/Dequeue behavior
(return appropriate zero values and nil errors) and ensure method signatures
exactly match those declared in ports.DurableTaskQueue so the type again
satisfies the interface.

In `@tests/compute_e2e_test.go`:
- Around line 257-316: TestResizeInstanceDownsize currently launches the
instance with "instance_type": "basic-2" and then resizes to the same "basic-2",
so it never verifies an actual downsize; update the LaunchInstance payload in
TestResizeInstanceDownsize to use a larger type (e.g., "standard-1") and keep
the Resize payload targeting "basic-2", so the test actually exercises a
downsize flow for the instance created by instanceID and the existing
waitForInstanceStatus/Resize/Terminate steps.

---

Nitpick comments:
In `@docs/swagger/swagger.json`:
- Around line 4005-4030: Add a 403 response entry to the responses object for
the instance resize endpoint in docs/swagger/swagger.json to document
RBAC-denied requests: update the "responses" block (the same block that
currently lists "200","400","404","500") to include a "403" key with
"description": "Forbidden" and the same schema reference ("$ref":
"#/definitions/httputil.Response") so the API contract reflects authorization
failures introduced by the new instance:resize permission.
- Around line 11222-11232: The schema for httphandlers.ResizeInstanceRequest
exposes instance_type as an unconstrained string; add metadata to tighten it by
adding a "description" (e.g., "Target instance type identifier") and basic
validation constraints such as "minLength": 1 and an optional "pattern" or
"enum" if known; update the httphandlers.ResizeInstanceRequest ->
properties.instance_type entry to include these fields so generated docs and
SDKs surface clear validation and intent.

In `@docs/swagger/swagger.yaml`:
- Around line 2334-2340: The OpenAPI schema for
httphandlers.ResizeInstanceRequest currently requires instance_type but allows
an empty string; update the instance_type property in the
httphandlers.ResizeInstanceRequest schema to include "minLength: 1" so empty
strings are rejected (i.e., add minLength: 1 under the instance_type property to
tighten validation).

In `@internal/handlers/instance_handler_test.go`:
- Around line 578-594: The test "QuotaExceeded" in
internal/handlers/instance_handler_test.go calls the service mock for
ResizeInstance and returns errors.Forbidden but the test name and E2E
expectations expect a quota-exceeded error (errors.QuotaExceeded -> HTTP 429);
update the mocked return in the ResizeInstance expectation to return
errors.QuotaExceeded (preserving the message) so the handler test, the
ResizeInstance route, and the E2E TestResizeInstance are consistent with a 429
outcome.

In `@tests/compute_e2e_test.go`:
- Around line 199-227: The inline assertion in the t.Run("Resize") test is wrong
because the resize handler returns gin.H{"message":"instance resized"} (no
instance_type), so when resp.StatusCode == http.StatusOK stop decoding into
res.Data.InstanceType and instead call the existing VerifyResize helper to fetch
the instance via GET and assert the instance type, or simply remove the
decode/assert block; update the success branch to invoke
VerifyResize(testutil.TestBaseURL, testutil.TestRouteInstances, instanceID,
token, "standard-1") (or equivalent) so the test verifies the actual instance
type via GET rather than parsing the resize response body.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 97154a85-0867-404e-b52b-ee76f338259e

📥 Commits

Reviewing files that changed from the base of the PR and between deb6ff0 and e455525.

📒 Files selected for processing (34)
  • docs/FEATURES.md
  • docs/adr/ADR-025-instance-resize.md
  • docs/api-reference.md
  • docs/swagger/docs.go
  • docs/swagger/swagger.json
  • docs/swagger/swagger.yaml
  • internal/api/setup/router.go
  • internal/core/domain/rbac.go
  • internal/core/ports/compute.go
  • internal/core/ports/instance.go
  • internal/core/services/instance.go
  • internal/core/services/instance_unit_test.go
  • internal/core/services/mock_compute_test.go
  • internal/handlers/instance_handler.go
  • internal/handlers/instance_handler_test.go
  • internal/platform/resilient_compute.go
  • internal/platform/resilient_compute_test.go
  • internal/repositories/docker/adapter.go
  • internal/repositories/docker/fakes_test.go
  • internal/repositories/firecracker/adapter.go
  • internal/repositories/firecracker/adapter_noop.go
  • internal/repositories/k8s/kubeadm_provisioner_test.go
  • internal/repositories/k8s/mocks_test.go
  • internal/repositories/libvirt/adapter.go
  • internal/repositories/libvirt/adapter_unit_test.go
  • internal/repositories/libvirt/lb_proxy_test.go
  • internal/repositories/noop/adapters.go
  • internal/repositories/postgres/migrations/107_seed_instance_resize_permission.down.sql
  • internal/repositories/postgres/migrations/107_seed_instance_resize_permission.up.sql
  • internal/workers/database_failover_worker_test.go
  • internal/workers/healing_worker_test.go
  • internal/workers/pipeline_worker_test.go
  • pkg/httputil/response.go
  • tests/compute_e2e_test.go

Comment on lines +42 to +50
### Quota Enforcement

The service calculates a **delta** between old and new instance types:

- If `deltaCPU > 0` or `deltaMem > 0` (upsize): calls `tenantSvc.CheckQuota` before proceeding
- If downsize (`delta < 0`): quota check is skipped (releasing resources back to the pool)
- If same size (`delta == 0`): no quota interaction

After a successful resize, usage counters are updated with the delta (`IncrementUsage` for upsize, `DecrementUsage` for downsize). Failures in usage updates are logged but not propagated — a future background reconciliation worker could correct drift.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

ADR and implementation disagree on quota-update failures.

This section says quota usage update failures are only logged, but internal/core/services/instance.go currently returns an internal error after a successful resize when quotaErrs is non-empty. The ADR should match the actual response behavior, or the service should be updated to match the ADR.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/adr/ADR-025-instance-resize.md` around lines 42 - 50, ADR-025 conflicts
with current code: after a successful resize the code in
internal/core/services/instance.go collects quotaErrs from
IncrementUsage/DecrementUsage and returns an internal error, whereas the ADR
says failures should only be logged. Either update the ADR text to state that
quota update failures cause the service to return an internal error (reference
quotaErrs, IncrementUsage, DecrementUsage) or change the implementation in
internal/core/services/instance.go to swallow quotaErrs (log them) and not
return an error after a successful resize so behavior matches the ADR. Make the
change consistently and update the ADR or code comments to reference the final
decision.

Comment thread docs/swagger/swagger.yaml
Comment on lines +5010 to +5050
/instances/{id}/resize:
post:
consumes:
- application/json
description: Change the instance type (CPU/memory) of an existing instance
parameters:
- description: Instance ID
in: path
name: id
required: true
type: string
- description: Resize request
in: body
name: request
required: true
schema:
$ref: '#/definitions/httphandlers.ResizeInstanceRequest'
produces:
- application/json
responses:
"200":
description: OK
schema:
$ref: '#/definitions/httputil.Response'
"400":
description: Bad Request
schema:
$ref: '#/definitions/httputil.Response'
"404":
description: Not Found
schema:
$ref: '#/definitions/httputil.Response'
"500":
description: Internal Server Error
schema:
$ref: '#/definitions/httputil.Response'
security:
- APIKeyAuth: []
summary: Resize an instance
tags:
- instances
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Document auth failures for the resize endpoint.

This operation is secured, but 401/403 are not documented. Adding them improves client behavior and aligns with RBAC-protected endpoints.

Suggested diff
       responses:
         "200":
           description: OK
           schema:
             $ref: '#/definitions/httputil.Response'
+        "401":
+          description: Unauthorized
+          schema:
+            $ref: '#/definitions/httputil.Response'
+        "403":
+          description: Forbidden
+          schema:
+            $ref: '#/definitions/httputil.Response'
         "400":
           description: Bad Request
           schema:
             $ref: '#/definitions/httputil.Response'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/swagger/swagger.yaml` around lines 5010 - 5050, The OpenAPI path
/instances/{id}/resize is missing documented 401/403 responses for auth
failures; update the operation to include "401" and "403" responses (both using
the existing schema reference '#/definitions/httputil.Response') so clients can
handle Unauthorized/Forbidden cases and align with RBAC-protected endpoints.

Comment thread internal/core/services/instance_unit_test.go
Comment on lines +1738 to +1787
t.Run("QuotaExceeded_Memory", func(t *testing.T) {
repo := new(MockInstanceRepo)
typeRepo := new(MockInstanceTypeRepo)
rbacSvc := new(MockRBACService)
tenantSvc := new(MockTenantService)
compute := new(MockComputeBackend)

svc := services.NewInstanceService(services.InstanceServiceParams{
Repo: repo,
InstanceTypeRepo: typeRepo,
RBAC: rbacSvc,
TenantSvc: tenantSvc,
Compute: compute,
Logger: slog.Default(),
})

ctx := context.Background()
userID := uuid.New()
tenantID := uuid.New()
instanceID := uuid.New()
ctx = appcontext.WithUserID(ctx, userID)
ctx = appcontext.WithTenantID(ctx, tenantID)

inst := &domain.Instance{
ID: instanceID,
UserID: userID,
TenantID: tenantID,
Status: domain.StatusRunning,
InstanceType: "basic-2",
ContainerID: "cid-1",
}

oldType := &domain.InstanceType{ID: "basic-2", VCPUs: 2, MemoryMB: 2048}
newType := &domain.InstanceType{ID: "basic-4", VCPUs: 4, MemoryMB: 4096}

rbacSvc.On("Authorize", mock.Anything, userID, tenantID, domain.PermissionInstanceResize, "test-inst").Return(nil).Once()
repo.On("GetByName", mock.Anything, "test-inst").Return(inst, nil).Once()
typeRepo.On("GetByID", mock.Anything, "basic-2").Return(oldType, nil).Once()
typeRepo.On("GetByID", mock.Anything, "basic-4").Return(newType, nil).Once()
compute.On("ResizeInstance", mock.Anything, "cid-1", int64(4e9), int64(4096*1024*1024)).Return(nil).Once()
tenantSvc.On("CheckQuota", mock.Anything, tenantID, "vcpus", 2).Return(nil).Once()
tenantSvc.On("IncrementUsage", mock.Anything, tenantID, "vcpus", 2).Return(nil).Once()
tenantSvc.On("CheckQuota", mock.Anything, tenantID, "memory", 2048).Return(fmt.Errorf("insufficient memory quota")).Once()

err := svc.ResizeInstance(ctx, "test-inst", "basic-4")

require.Error(t, err)
assert.Contains(t, err.Error(), "insufficient memory quota")
mock.AssertExpectationsForObjects(t, repo, typeRepo, rbacSvc, tenantSvc, compute)
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Test exposes missing vCPU rollback when memory quota check fails.

In this test, vCPU IncrementUsage succeeds (line 1779), then memory CheckQuota fails (line 1780). There's no DecrementUsage mock for vcpus, indicating the implementation doesn't roll back the vCPU increment when memory quota check fails.

Looking at context snippet 1 (completeResize), the rollback logic only handles IncrementUsage failures, not CheckQuota failures:

if err := s.tenantSvc.CheckQuota(ctx, tenantID, "memory", deltaMemMB); err != nil {
    return err  // No rollback of vCPU increment
}

The test correctly reflects implementation behavior, but this may be a quota accounting leak.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/core/services/instance_unit_test.go` around lines 1738 - 1787, The
test reveals that ResizeInstance/completeResize increments vCPU usage
(tenantSvc.IncrementUsage) before checking memory quota (tenantSvc.CheckQuota)
and on memory quota failure does not call tenantSvc.DecrementUsage to roll back
the prior vCPU increment; update the implementation in
ResizeInstance/completeResize so that after a failed tenantSvc.CheckQuota( ...,
"memory", deltaMemMB) it calls tenantSvc.DecrementUsage(ctx, tenantID, "vcpus",
deltaVCPUs) (or a generic rollback helper) to undo the earlier IncrementUsage,
and add corresponding error handling/logging to ensure the decrement error is
surfaced or logged while returning the original quota error.

Comment on lines +732 to +735
inst, err := s.resolveInstance(ctx, idOrName)
if err != nil || inst == nil {
return errors.New(errors.NotFound, "instance not found")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't collapse repository failures into NotFound.

This rewrites every resolveInstance failure as instance not found. If the repo returns an internal/storage error, the service will now hide it as a 404 instead of surfacing the real failure.

Suggested fix
 	inst, err := s.resolveInstance(ctx, idOrName)
-	if err != nil || inst == nil {
+	if err != nil {
+		return err
+	}
+	if inst == nil {
 		return errors.New(errors.NotFound, "instance not found")
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
inst, err := s.resolveInstance(ctx, idOrName)
if err != nil || inst == nil {
return errors.New(errors.NotFound, "instance not found")
}
inst, err := s.resolveInstance(ctx, idOrName)
if err != nil {
return err
}
if inst == nil {
return errors.New(errors.NotFound, "instance not found")
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/core/services/instance.go` around lines 732 - 735, The current code
collapses all errors from resolveInstance into a NotFound error; change the
logic in the caller of resolveInstance so that if err != nil you return the
original err (or wrap it preserving its type/message) and only return
errors.New(errors.NotFound, "instance not found") when err == nil and inst ==
nil. Reference resolveInstance and the surrounding function handling the inst,
err tuple and ensure you do not overwrite repository/internal errors with a 404.

Comment thread internal/repositories/libvirt/adapter.go Outdated
Comment thread internal/repositories/libvirt/adapter.go
Comment on lines +253 to +260
// Replace <memory unit="KiB">...</memory> or <memory>...</memory>
result = memoryRe.ReplaceAllString(result, fmt.Sprintf(`<memory unit="KiB">%d</memory>`, memoryKiB))

// Replace <currentMemory unit="KiB">...</currentMemory> or <currentMemory>...</currentMemory>
result = currentMemRe.ReplaceAllString(result, fmt.Sprintf(`<currentMemory unit="KiB">%d</currentMemory>`, memoryKiB))

// Replace <vcpu placement="static">...</vcpu> or <vcpu>...</vcpu>
result = vcpuRe.ReplaceAllString(result, fmt.Sprintf(`<vcpu>%d</vcpu>`, vcpus))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

These replacements strip existing libvirt attributes.

Each ReplaceAllString rewrites the whole node, so attributes like placement, current, cpuset, or an existing unit are discarded. That can change semantics or make a previously valid domain XML fail to redefine. Update only the text node or preserve the matched attributes when rewriting these elements.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/repositories/libvirt/adapter.go` around lines 253 - 260, The
ReplaceAllString calls for memoryRe, currentMemRe, and vcpuRe are replacing
entire elements (dropping attributes like placement, current, cpuset, unit);
update the replacements to preserve matched attributes and only change the
element text node — for example, use capture groups in the regexes (or
ReplaceAllStringFunc) to retain the opening tag/attributes and closing tag and
substitute just the numeric content so memoryRe, currentMemRe and vcpuRe rewrite
only the inner value (still using memoryKiB and vcpus) rather than overwriting
the whole element.

Comment on lines +404 to 410
// NoopTaskQueue is a no-op task queue.
type NoopTaskQueue struct{}

func (q *NoopTaskQueue) Enqueue(ctx context.Context, queue string, payload interface{}) error {
return nil
}
func (q *NoopTaskQueue) Dequeue(ctx context.Context, queue string) (string, error) { return "", nil }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

NoopTaskQueue no longer satisfies the durable queue contract.

ports.DurableTaskQueue in internal/core/ports/task_queue.go:33-61 still requires EnsureGroup, Receive, Ack, Nack, and ReclaimStale, and internal/workers/provision_worker.go:65-245 still calls those methods on the injected queue. Dropping them here means this noop can no longer stand in for the durable queue in tests or system wiring.

Proposed fix
 type NoopTaskQueue struct{}
 
 func (q *NoopTaskQueue) Enqueue(ctx context.Context, queue string, payload interface{}) error {
 	return nil
 }
 func (q *NoopTaskQueue) Dequeue(ctx context.Context, queue string) (string, error) { return "", nil }
+func (q *NoopTaskQueue) EnsureGroup(ctx context.Context, queueName, groupName string) error {
+	return nil
+}
+func (q *NoopTaskQueue) Receive(ctx context.Context, queueName, groupName, consumerName string) (*ports.DurableMessage, error) {
+	return nil, nil
+}
+func (q *NoopTaskQueue) Ack(ctx context.Context, queueName, groupName, messageID string) error {
+	return nil
+}
+func (q *NoopTaskQueue) Nack(ctx context.Context, queueName, groupName, messageID string) error {
+	return nil
+}
+func (q *NoopTaskQueue) ReclaimStale(ctx context.Context, queueName, groupName, consumerName string, minIdleMs int64, count int64) ([]ports.DurableMessage, error) {
+	return nil, nil
+}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// NoopTaskQueue is a no-op task queue.
type NoopTaskQueue struct{}
func (q *NoopTaskQueue) Enqueue(ctx context.Context, queue string, payload interface{}) error {
return nil
}
func (q *NoopTaskQueue) Dequeue(ctx context.Context, queue string) (string, error) { return "", nil }
// NoopTaskQueue is a no-op task queue.
type NoopTaskQueue struct{}
func (q *NoopTaskQueue) Enqueue(ctx context.Context, queue string, payload interface{}) error {
return nil
}
func (q *NoopTaskQueue) Dequeue(ctx context.Context, queue string) (string, error) { return "", nil }
func (q *NoopTaskQueue) EnsureGroup(ctx context.Context, queueName, groupName string) error {
return nil
}
func (q *NoopTaskQueue) Receive(ctx context.Context, queueName, groupName, consumerName string) (*ports.DurableMessage, error) {
return nil, nil
}
func (q *NoopTaskQueue) Ack(ctx context.Context, queueName, groupName, messageID string) error {
return nil
}
func (q *NoopTaskQueue) Nack(ctx context.Context, queueName, groupName, messageID string) error {
return nil
}
func (q *NoopTaskQueue) ReclaimStale(ctx context.Context, queueName, groupName, consumerName string, minIdleMs int64, count int64) ([]ports.DurableMessage, error) {
return nil, nil
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/repositories/noop/adapters.go` around lines 404 - 410, The
NoopTaskQueue implementation removed the durable-queue methods and therefore no
longer satisfies ports.DurableTaskQueue; restore the missing methods
(EnsureGroup, Receive, Ack, Nack, ReclaimStale) on the NoopTaskQueue type so it
can be injected where durable behavior is expected (e.g., where
provision_worker.go calls those methods). Implement these methods as no-ops or
simple stubs consistent with existing Enqueue/Dequeue behavior (return
appropriate zero values and nil errors) and ensure method signatures exactly
match those declared in ports.DurableTaskQueue so the type again satisfies the
interface.

Comment thread tests/compute_e2e_test.go
Comment on lines +257 to +316
func TestResizeInstanceDownsize(t *testing.T) {
if err := waitForServer(); err != nil {
t.Fatalf("Failing Resize Downsize E2E test: %v", err)
}

client := &http.Client{Timeout: 60 * time.Second}
token := registerAndLogin(t, client, "resize-down-tester@thecloud.local", "Resize Down Tester")

var instanceID string
instanceName := fmt.Sprintf("e2e-resize-down-%d-%s", time.Now().UnixNano()%1000, uuid.New().String())

// 1. Launch Instance with basic-2 type
t.Run("LaunchInstance", func(t *testing.T) {
payload := map[string]string{
"name": instanceName,
"image": "nginx:alpine",
"instance_type": "basic-2",
"ports": "0:80",
}
resp := postRequest(t, client, testutil.TestBaseURL+testutil.TestRouteInstances, token, payload)
defer func() { _ = resp.Body.Close() }()

require.Equal(t, http.StatusAccepted, resp.StatusCode)

var res struct {
Data struct {
ID string `json:"id"`
} `json:"data"`
}
require.NoError(t, json.NewDecoder(resp.Body).Decode(&res))
instanceID = res.Data.ID
})

// 2. Wait for Running
t.Run("WaitForRunning", func(t *testing.T) {
lastStatus := waitForInstanceStatus(t, client, token, instanceID)
if lastStatus != domain.StatusRunning {
t.Skipf("Instance did not reach running state within timeout. Last status: %s", lastStatus)
}
})

// 3. Downsize to basic-2
t.Run("Resize", func(t *testing.T) {
payload := map[string]string{
"instance_type": "basic-2",
}
resp := postRequest(t, client, fmt.Sprintf("%s%s/%s/resize", testutil.TestBaseURL, testutil.TestRouteInstances, instanceID), token, payload)
defer func() { _ = resp.Body.Close() }()

assert.Equal(t, http.StatusOK, resp.StatusCode)
})

// 4. Terminate
t.Run("TerminateInstance", func(t *testing.T) {
resp := deleteRequest(t, client, fmt.Sprintf(testutil.TestRouteFormat, testutil.TestBaseURL, testutil.TestRouteInstances, instanceID), token)
defer func() { _ = resp.Body.Close() }()

assert.Equal(t, http.StatusOK, resp.StatusCode)
})
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Test does not actually test downsize—instance type unchanged.

The test is named TestResizeInstanceDownsize but:

  • Instance is launched with "instance_type": "basic-2" (line 273)
  • Resize requests "instance_type": "basic-2" (line 301)

This tests a same-size no-op, not a downsize. For a proper downsize test, launch with a larger type (e.g., standard-1) and resize to basic-2.

Proposed fix for actual downsize test
 func TestResizeInstanceDownsize(t *testing.T) {
 	// ...
 
-	// 1. Launch Instance with basic-2 type
+	// 1. Launch Instance with standard-1 type (larger)
 	t.Run("LaunchInstance", func(t *testing.T) {
 		payload := map[string]string{
 			"name":         instanceName,
 			"image":        "nginx:alpine",
-			"instance_type": "basic-2",
+			"instance_type": "standard-1",
 			"ports":        "0:80",
 		}
 		// ...
 	})
 
 	// ...
 
-	// 3. Downsize to basic-2
+	// 3. Downsize from standard-1 to basic-2
 	t.Run("Resize", func(t *testing.T) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func TestResizeInstanceDownsize(t *testing.T) {
if err := waitForServer(); err != nil {
t.Fatalf("Failing Resize Downsize E2E test: %v", err)
}
client := &http.Client{Timeout: 60 * time.Second}
token := registerAndLogin(t, client, "resize-down-tester@thecloud.local", "Resize Down Tester")
var instanceID string
instanceName := fmt.Sprintf("e2e-resize-down-%d-%s", time.Now().UnixNano()%1000, uuid.New().String())
// 1. Launch Instance with basic-2 type
t.Run("LaunchInstance", func(t *testing.T) {
payload := map[string]string{
"name": instanceName,
"image": "nginx:alpine",
"instance_type": "basic-2",
"ports": "0:80",
}
resp := postRequest(t, client, testutil.TestBaseURL+testutil.TestRouteInstances, token, payload)
defer func() { _ = resp.Body.Close() }()
require.Equal(t, http.StatusAccepted, resp.StatusCode)
var res struct {
Data struct {
ID string `json:"id"`
} `json:"data"`
}
require.NoError(t, json.NewDecoder(resp.Body).Decode(&res))
instanceID = res.Data.ID
})
// 2. Wait for Running
t.Run("WaitForRunning", func(t *testing.T) {
lastStatus := waitForInstanceStatus(t, client, token, instanceID)
if lastStatus != domain.StatusRunning {
t.Skipf("Instance did not reach running state within timeout. Last status: %s", lastStatus)
}
})
// 3. Downsize to basic-2
t.Run("Resize", func(t *testing.T) {
payload := map[string]string{
"instance_type": "basic-2",
}
resp := postRequest(t, client, fmt.Sprintf("%s%s/%s/resize", testutil.TestBaseURL, testutil.TestRouteInstances, instanceID), token, payload)
defer func() { _ = resp.Body.Close() }()
assert.Equal(t, http.StatusOK, resp.StatusCode)
})
// 4. Terminate
t.Run("TerminateInstance", func(t *testing.T) {
resp := deleteRequest(t, client, fmt.Sprintf(testutil.TestRouteFormat, testutil.TestBaseURL, testutil.TestRouteInstances, instanceID), token)
defer func() { _ = resp.Body.Close() }()
assert.Equal(t, http.StatusOK, resp.StatusCode)
})
}
func TestResizeInstanceDownsize(t *testing.T) {
if err := waitForServer(); err != nil {
t.Fatalf("Failing Resize Downsize E2E test: %v", err)
}
client := &http.Client{Timeout: 60 * time.Second}
token := registerAndLogin(t, client, "resize-down-tester@thecloud.local", "Resize Down Tester")
var instanceID string
instanceName := fmt.Sprintf("e2e-resize-down-%d-%s", time.Now().UnixNano()%1000, uuid.New().String())
// 1. Launch Instance with standard-1 type (larger)
t.Run("LaunchInstance", func(t *testing.T) {
payload := map[string]string{
"name": instanceName,
"image": "nginx:alpine",
"instance_type": "standard-1",
"ports": "0:80",
}
resp := postRequest(t, client, testutil.TestBaseURL+testutil.TestRouteInstances, token, payload)
defer func() { _ = resp.Body.Close() }()
require.Equal(t, http.StatusAccepted, resp.StatusCode)
var res struct {
Data struct {
ID string `json:"id"`
} `json:"data"`
}
require.NoError(t, json.NewDecoder(resp.Body).Decode(&res))
instanceID = res.Data.ID
})
// 2. Wait for Running
t.Run("WaitForRunning", func(t *testing.T) {
lastStatus := waitForInstanceStatus(t, client, token, instanceID)
if lastStatus != domain.StatusRunning {
t.Skipf("Instance did not reach running state within timeout. Last status: %s", lastStatus)
}
})
// 3. Downsize from standard-1 to basic-2
t.Run("Resize", func(t *testing.T) {
payload := map[string]string{
"instance_type": "basic-2",
}
resp := postRequest(t, client, fmt.Sprintf("%s%s/%s/resize", testutil.TestBaseURL, testutil.TestRouteInstances, instanceID), token, payload)
defer func() { _ = resp.Body.Close() }()
assert.Equal(t, http.StatusOK, resp.StatusCode)
})
// 4. Terminate
t.Run("TerminateInstance", func(t *testing.T) {
resp := deleteRequest(t, client, fmt.Sprintf(testutil.TestRouteFormat, testutil.TestBaseURL, testutil.TestRouteInstances, instanceID), token)
defer func() { _ = resp.Body.Close() }()
assert.Equal(t, http.StatusOK, resp.StatusCode)
})
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/compute_e2e_test.go` around lines 257 - 316, TestResizeInstanceDownsize
currently launches the instance with "instance_type": "basic-2" and then resizes
to the same "basic-2", so it never verifies an actual downsize; update the
LaunchInstance payload in TestResizeInstanceDownsize to use a larger type (e.g.,
"standard-1") and keep the Resize payload targeting "basic-2", so the test
actually exercises a downsize flow for the instance created by instanceID and
the existing waitForInstanceStatus/Resize/Terminate steps.

poyrazK added 10 commits April 28, 2026 15:17
Add CreateSnapshot, RestoreSnapshot, and DeleteSnapshot methods to
ComputeBackend interface. Libvirt adapter implements these via QCOW2
volume snapshots. Docker/Firecracker backends implement as no-ops since
hot resize doesn't need snapshots.

ResizeInstance in Libvirt adapter now:
1. Creates pre-resize snapshot before making changes
2. On failure: rolls back from snapshot and redefines domain
3. On success: deletes snapshot to avoid accumulation

Also update ResilientCompute wrapper, all mock backends in tests,
and fix nil pointer dereference bug in rollback path.
- rollbackFromSnapshot now calls DomainDefineXML after RestoreSnapshot
  to properly redefine the domain with original XML after DomainUndefine
- Use UUID-based snapshot names to avoid second-precision collisions
- Add unit tests for CreateSnapshot, RestoreSnapshot, DeleteSnapshot
- Add unit tests for ResizeInstance rollback on DomainDefineXML and
  DomainUndefine failures
- Add unit tests for rollbackFromSnapshot success and failure paths
…aky tests

- Check rollbackFromSnapshot errors at all 3 call sites in ResizeInstance
  (was ignoring errors at applyDomainResize and DomainUndefine failure paths)
- Skip RollbackFromSnapshot/success and RollbackFromSnapshot/redefine_fails_after_restore
  since they require real tar archive contents that the execCommand mock cannot provide
  (rollback path is covered by ResizeInstance_RollbackOnFailure tests)
…regexes

- Line 232-233 (DomainCreate failure): check rollbackFromSnapshot error and
  include it in the returned error. This is the 4th rollback call site in
  ResizeInstance — all now checked consistently.
- Pre-compile 3 regexes (memoryRe, currentMemRe, vcpuRe) at package level
  instead of re-compiling on every applyDomainResize call.
snapshotPath(id, name string) now returns the snapshot archive path,
replacing 3 identical inline format strings in CreateSnapshot,
RestoreSnapshot, and DeleteSnapshot.
Document the snapshot creation step (step 4) and the rollback
mechanism that restores volume and redefines domain on failure.
Also document edge cases: snapshot creation failure is non-fatal,
and rollback itself can fail.
- TestLibvirtAdapter_ApplyDomainResize: 5 subtests covering regex replacement
  (both replaced, missing both elements, memory without unit attr,
  resize-to-same-values error, vcpu with placement attr)
- TestLibvirtAdapter_ResizeInstance_RollbackOnFailure: add
  rollback_on_DomainCreate_failure subtest (was previously untested)

Removed incorrect SnapshotCreationFails_ResizeProceeds test from
instance_unit_test.go — InstanceService.ResizeInstance does not call
CreateSnapshot (snapshot creation is adapter-internal).
@poyrazK poyrazK force-pushed the feat/snapshot-before-resize branch from d4ee00d to 043a052 Compare April 28, 2026 15:06
Copy link
Copy Markdown
Owner Author

@poyrazK poyrazK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay to merge

@poyrazK poyrazK merged commit 5e30fdc into main Apr 29, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants