feat: snapshot-before-resize for Libvirt cold resize by poyrazK · Pull Request #182 · poyrazK/thecloud

poyrazK · 2026-04-25T18:32:35Z

Summary

Add snapshot-before-resize safety net for Libvirt/KVM cold resize operations.

Add CreateSnapshot, RestoreSnapshot, DeleteSnapshot to ComputeBackend interface
Libvirt adapter implements QCOW2 volume-level snapshots for rollback on resize failure
Docker/Firecracker backends implement as no-ops (hot resize doesn't need snapshots)
ResizeInstance in Libvirt adapter creates a pre-resize snapshot, rolls back on failure, deletes on success

Test plan

Unit tests pass
Build compiles
Manual test with Libvirt instance (cold resize path)
Verify rollback works when resize fails mid-way

Summary by CodeRabbit

New Features
- Added instance resizing capability via new POST /instances/:id/resize endpoint
- Supports changing CPU and memory allocation for running and stopped instances
- Includes quota enforcement and permission-based access control
Documentation
- Updated API reference and architecture documentation with resize operation details
- Added RBAC permission for instance resize operations
Tests
- Added comprehensive unit and end-to-end tests for resize functionality

coderabbitai · 2026-04-25T18:32:41Z

Warning

Rate limit exceeded

@poyrazK has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 33 minutes and 18 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bbe3a384-a9dd-4ca0-9bd6-3de4804d1afe

📥 Commits

Reviewing files that changed from the base of the PR and between e455525 and e786c4e.

📒 Files selected for processing (14)

docs/adr/ADR-025-instance-resize.md
internal/core/ports/compute.go
internal/core/services/mock_compute_test.go
internal/platform/resilient_compute.go
internal/platform/resilient_compute_test.go
internal/repositories/docker/adapter.go
internal/repositories/firecracker/adapter.go
internal/repositories/firecracker/adapter_noop.go
internal/repositories/libvirt/adapter.go
internal/repositories/libvirt/adapter_unit_test.go
internal/repositories/libvirt/lb_proxy_test.go
internal/repositories/noop/adapters.go
internal/workers/database_failover_worker_test.go
internal/workers/pipeline_worker_test.go

📝 Walkthrough

Walkthrough

This PR implements instance resizing functionality, adding a POST /instances/:id/resize API endpoint that allows changing CPU/memory allocation for running or stopped instances. Implementation includes service layer logic with quota enforcement and rollback, compute backend adapters for Docker and Libvirt, RBAC authorization, database migrations, and comprehensive test coverage.

Changes

Cohort / File(s)	Summary
Documentation `docs/FEATURES.md`, `docs/api-reference.md`, `docs/adr/ADR-025-instance-resize.md`, `docs/swagger/docs.go`, `docs/swagger/swagger.json`, `docs/swagger/swagger.yaml`	Introduces API documentation for instance resizing endpoint, including request/response schemas, error codes, RBAC permission (`instance:resize`), and architectural decision record detailing cold resize for Libvirt, warm resize for Docker, quota enforcement, and rollback procedures.
RBAC Authorization `internal/core/domain/rbac.go`	Adds `PermissionInstanceResize` constant for instance resize authorization.
Service Layer `internal/core/services/instance.go`, `internal/core/services/instance_unit_test.go`	Implements `ResizeInstance` method with RBAC checks, instance resolution, instance-type validation, quota enforcement (delta-based for upsize/downsize), compute resize delegation, persistent storage update, and comprehensive rollback on failure. Includes 569 lines of unit tests covering success paths, quota scenarios, error cases, and event recording.
Port Definitions `internal/core/ports/compute.go`, `internal/core/ports/instance.go`	Extends `ComputeBackend` interface with `ResizeInstance` and snapshot lifecycle methods (`CreateSnapshot`, `RestoreSnapshot`, `DeleteSnapshot`); extends `InstanceService` with `ResizeInstance` method.
HTTP Layer `internal/api/setup/router.go`, `internal/handlers/instance_handler.go`, `internal/handlers/instance_handler_test.go`	Registers `POST /instances/:id/resize` route with `PermissionInstanceResize` guard; implements handler validating required `instance_type` field and delegating to service; includes 113 lines of handler tests covering success, validation, and error paths.
Resilient Adapter `internal/platform/resilient_compute.go`, `internal/platform/resilient_compute_test.go`	Wraps `ResizeInstance` and snapshot methods with bulkhead, circuit breaker, and timeout protection via `callProtected` pipeline.
Backend Implementations `internal/repositories/docker/adapter.go`, `internal/repositories/docker/fakes_test.go`, `internal/repositories/libvirt/adapter.go`, `internal/repositories/libvirt/adapter_unit_test.go`, `internal/repositories/firecracker/adapter.go`, `internal/repositories/firecracker/adapter_noop.go`	Docker: implements resize via `ContainerUpdate` and snapshot methods as no-ops. Libvirt: implements cold-resize workflow with domain XML patching, snapshot archiving, rollback-on-failure, and 418 lines of unit tests. Firecracker: all methods return "not supported" errors.
No-op Backends `internal/repositories/noop/adapters.go`, `internal/repositories/k8s/mocks_test.go`, `internal/repositories/k8s/kubeadm_provisioner_test.go`, `internal/repositories/libvirt/lb_proxy_test.go`	Adds resize/snapshot methods to no-op compute backend and test mocks; removes unused `NoopExecutionLedger` type and durable queue methods.
Test Infrastructure `internal/core/services/mock_compute_test.go`, `internal/workers/database_failover_worker_test.go`, `internal/workers/healing_worker_test.go`, `internal/workers/pipeline_worker_test.go`	Expands mock implementations of `ComputeBackend` and `InstanceService` to support resize and snapshot operations via testify framework.
Database Migrations `internal/repositories/postgres/migrations/107_seed_instance_resize_permission.up.sql`, `internal/repositories/postgres/migrations/107_seed_instance_resize_permission.down.sql`	Idempotent migration granting `developer` role the `instance:resize` permission.
HTTP Utilities `pkg/httputil/response.go`	Maps `errors.QuotaExceeded` to HTTP 429 (Too Many Requests) status code.
E2E Tests `tests/compute_e2e_test.go`	Refactors instance readiness polling into shared helper; adds three new resize E2E tests for upsize, downsize, and invalid type validation with quota-aware assertions.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Handler as HTTP Handler
    participant Service as InstanceService
    participant Repo as Repository
    participant Backend as ComputeBackend
    
    Client->>Handler: POST /instances/:id/resize<br/>{instance_type}
    activate Handler
    Handler->>Handler: Validate request
    Handler->>Service: ResizeInstance(ctx, idOrName, newType)
    deactivate Handler
    
    activate Service
    Service->>Service: Check RBAC permission
    Service->>Repo: Resolve instance by UUID/name
    activate Repo
    Repo-->>Service: Instance
    deactivate Repo
    
    Service->>Repo: Load old & new InstanceType
    activate Repo
    Repo-->>Service: InstanceType records
    deactivate Repo
    
    alt Type unchanged
        Service-->>Client: No-op (same type)
    else Type changed
        Service->>Backend: ResizeInstance(ctx, id, cpu, memory)
        activate Backend
        Backend-->>Service: error or success
        deactivate Backend
        
        alt Compute succeeded
            Service->>Repo: Check/update quota (delta)
            activate Repo
            Repo-->>Service: QuotaExceeded or success
            deactivate Repo
            
            alt Quota ok
                Service->>Repo: Persist new InstanceType
                activate Repo
                Repo-->>Service: error or success
                deactivate Repo
                
                alt Persistence failed
                    Service->>Backend: Rollback compute
                    activate Backend
                    Backend-->>Service: -
                    deactivate Backend
                    Service-->>Client: Error (500)
                else Persistence ok
                    Service->>Service: Record event & audit
                    Service-->>Client: Success (200)
                end
            else Quota exceeded
                Service->>Backend: Rollback compute
                activate Backend
                Backend-->>Service: -
                deactivate Backend
                Service-->>Client: Error (429)
            end
        else Compute failed
            Service-->>Client: Error (500)
        end
    end
    deactivate Service

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

fix(ha): PR #101 HA implementation bug fixes #154: Introduces internal/platform/resilient_compute.go base adapter that this PR extends with ResizeInstance and snapshot method wrappers using the existing callProtected pipeline infrastructure.

Poem

🐰 A resize for instances bold,
CPU dance and memory retold,
Cold snapshots, warm containers gleam,
Quota guards protect the dream,
Rollback whispers if things go wrong! 📦

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: snapshot-before-resize for Libvirt cold resize' directly summarizes the main change: introducing snapshot-before-resize functionality for Libvirt cold resize operations, which is a core feature added throughout the PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/snapshot-before-resize

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 33 minutes and 18 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

coderabbitai

Actionable comments posted: 15

🧹 Nitpick comments (5)

tests/compute_e2e_test.go (1)

199-227: Consider verifying instance type via GET after successful resize.

The resize test handles both 200 (success) and 429 (quota exceeded), which is appropriate. However, when the response is 200, you're parsing res.Data.InstanceType from the resize response body. The resize handler returns gin.H{"message": "instance resized"} (per the handler implementation), which doesn't include instance_type.

The assertion on line 219 would check against an empty string, not the actual updated type. You already have VerifyResize (lines 230-246) that correctly fetches via GET, so consider removing the inline assertion or updating the handler to return the instance data.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/compute_e2e_test.go` around lines 199 - 227, The inline assertion in
the t.Run("Resize") test is wrong because the resize handler returns
gin.H{"message":"instance resized"} (no instance_type), so when resp.StatusCode
== http.StatusOK stop decoding into res.Data.InstanceType and instead call the
existing VerifyResize helper to fetch the instance via GET and assert the
instance type, or simply remove the decode/assert block; update the success
branch to invoke VerifyResize(testutil.TestBaseURL, testutil.TestRouteInstances,
instanceID, token, "standard-1") (or equivalent) so the test verifies the actual
instance type via GET rather than parsing the resize response body.

internal/handlers/instance_handler_test.go (1)

578-594: Test name/assertion mismatch: uses errors.Forbidden but named "QuotaExceeded".

The test uses errors.Forbidden which maps to HTTP 403, but the test is named "QuotaExceeded". Looking at pkg/httputil/response.go, errors.QuotaExceeded maps to 429 (TooManyRequests).

The E2E test (TestResizeInstance) expects 429 for quota exceeded scenarios. For consistency:

Option 1: Fix error type to match test name

 	t.Run("QuotaExceeded", func(t *testing.T) {
 		// ...
-		mockSvc.On("ResizeInstance", mock.Anything, id.String(), "basic-4").Return(errors.New(errors.Forbidden, "insufficient quota")).Once()
+		mockSvc.On("ResizeInstance", mock.Anything, id.String(), "basic-4").Return(errors.New(errors.QuotaExceeded, "insufficient quota")).Once()
 		// ...
-		assert.Equal(t, http.StatusForbidden, w.Code)
+		assert.Equal(t, http.StatusTooManyRequests, w.Code)
 	})

Option 2: Rename test to match actual behavior

-	t.Run("QuotaExceeded", func(t *testing.T) {
+	t.Run("Forbidden", func(t *testing.T) {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/handlers/instance_handler_test.go` around lines 578 - 594, The test
"QuotaExceeded" in internal/handlers/instance_handler_test.go calls the service
mock for ResizeInstance and returns errors.Forbidden but the test name and E2E
expectations expect a quota-exceeded error (errors.QuotaExceeded -> HTTP 429);
update the mocked return in the ResizeInstance expectation to return
errors.QuotaExceeded (preserving the message) so the handler test, the
ResizeInstance route, and the E2E TestResizeInstance are consistent with a 429
outcome.

docs/swagger/swagger.yaml (1)

2334-2340: Tighten validation for instance_type.

required still permits "". Add minLength: 1 so SDKs/clients fail fast on empty values.

Suggested diff

   httphandlers.ResizeInstanceRequest:
     properties:
       instance_type:
+        minLength: 1
         type: string
     required:
     - instance_type
     type: object

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/swagger/swagger.yaml` around lines 2334 - 2340, The OpenAPI schema for
httphandlers.ResizeInstanceRequest currently requires instance_type but allows
an empty string; update the instance_type property in the
httphandlers.ResizeInstanceRequest schema to include "minLength: 1" so empty
strings are rejected (i.e., add minLength: 1 under the instance_type property to
tighten validation).

docs/swagger/swagger.json (2)

4005-4030: Document 403 Forbidden for RBAC-denied resize requests.

Because this PR introduces instance:resize, clients can hit authorization failures even with valid input. Adding 403 here improves API contract accuracy.

📌 Suggested doc update

                 "responses": {
                     "200": {
                         "description": "OK",
                         "schema": {
                             "$ref": "#/definitions/httputil.Response"
                         }
                     },
+                    "403": {
+                        "description": "Forbidden",
+                        "schema": {
+                            "$ref": "#/definitions/httputil.Response"
+                        }
+                    },
                     "400": {
                         "description": "Bad Request",
                         "schema": {
                             "$ref": "#/definitions/httputil.Response"
                         }
                     },

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/swagger/swagger.json` around lines 4005 - 4030, Add a 403 response entry
to the responses object for the instance resize endpoint in
docs/swagger/swagger.json to document RBAC-denied requests: update the
"responses" block (the same block that currently lists "200","400","404","500")
to include a "403" key with "description": "Forbidden" and the same schema
reference ("$ref": "#/definitions/httputil.Response") so the API contract
reflects authorization failures introduced by the new instance:resize
permission.

11222-11232: Consider tightening instance_type schema metadata.

Adding a short description and basic constraints (for example minLength) would make generated docs/SDK validation clearer.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/swagger/swagger.json` around lines 11222 - 11232, The schema for
httphandlers.ResizeInstanceRequest exposes instance_type as an unconstrained
string; add metadata to tighten it by adding a "description" (e.g., "Target
instance type identifier") and basic validation constraints such as "minLength":
1 and an optional "pattern" or "enum" if known; update the
httphandlers.ResizeInstanceRequest -> properties.instance_type entry to include
these fields so generated docs and SDKs surface clear validation and intent.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/adr/ADR-025-instance-resize.md`:
- Around line 42-50: ADR-025 conflicts with current code: after a successful
resize the code in internal/core/services/instance.go collects quotaErrs from
IncrementUsage/DecrementUsage and returns an internal error, whereas the ADR
says failures should only be logged. Either update the ADR text to state that
quota update failures cause the service to return an internal error (reference
quotaErrs, IncrementUsage, DecrementUsage) or change the implementation in
internal/core/services/instance.go to swallow quotaErrs (log them) and not
return an error after a successful resize so behavior matches the ADR. Make the
change consistently and update the ADR or code comments to reference the final
decision.

In `@docs/swagger/swagger.yaml`:
- Around line 5010-5050: The OpenAPI path /instances/{id}/resize is missing
documented 401/403 responses for auth failures; update the operation to include
"401" and "403" responses (both using the existing schema reference
'#/definitions/httputil.Response') so clients can handle Unauthorized/Forbidden
cases and align with RBAC-protected endpoints.

In `@internal/core/services/instance_unit_test.go`:
- Around line 1738-1787: The test reveals that ResizeInstance/completeResize
increments vCPU usage (tenantSvc.IncrementUsage) before checking memory quota
(tenantSvc.CheckQuota) and on memory quota failure does not call
tenantSvc.DecrementUsage to roll back the prior vCPU increment; update the
implementation in ResizeInstance/completeResize so that after a failed
tenantSvc.CheckQuota( ..., "memory", deltaMemMB) it calls
tenantSvc.DecrementUsage(ctx, tenantID, "vcpus", deltaVCPUs) (or a generic
rollback helper) to undo the earlier IncrementUsage, and add corresponding error
handling/logging to ensure the decrement error is surfaced or logged while
returning the original quota error.
- Around line 1689-1736: The test reveals compute resize is performed in
executeResize before completeResize calls tenantSvc.CheckQuota, leaving no
rollback on quota failure; fix by either moving the quota validation to run
before executeResize or (preferred) adding a compute rollback in completeResize:
if tenantSvc.CheckQuota(...) returns an error, call compute.ResizeInstance(ctx,
oldContainerID, oldVcpusBytes, oldMemoryBytes) to restore previous size
(matching the existing repo.Update rollback pattern), log the rollback and then
return the quota error so state remains consistent; ensure you reference
executeResize, completeResize, tenantSvc.CheckQuota, compute.ResizeInstance and
repo.Update when making the change.

In `@internal/core/services/instance.go`:
- Around line 756-760: The resize flow currently calls s.executeResize(...)
before quota validation in s.completeResize(...), risking a mutated backend if
quota fails; move or replicate the quota check to run before mutating the
backend: call the same quota validation (the CheckQuota logic used inside
completeResize for upsizes) with tenantID and newInstanceType/newIT prior to
calling s.executeResize(ctx, target, newIT), and only proceed to s.executeResize
when that pre-check succeeds (or perform a DB quota reservation prior to
executeResize and release on failure) so the backend is never mutated when quota
would ultimately reject the operation.
- Around line 891-896: The code currently logs each quota update error
(s.logger.Error with "quota update failed after resize") but then returns an
Internal error when quotaErrs > 0, which incorrectly signals the resize failed;
change this to treat quota updates as best-effort: remove the errors.Wrap return
and instead log a single aggregated warning including tenantID and quotaErrs,
optionally emit an audit/metric indicating quota drift, and return nil so the
successful resize (and its success event/audit path) proceeds; keep the existing
per-error logging but do not abort after s.repo.Update has already committed.
- Around line 732-735: The current code collapses all errors from
resolveInstance into a NotFound error; change the logic in the caller of
resolveInstance so that if err != nil you return the original err (or wrap it
preserving its type/message) and only return errors.New(errors.NotFound,
"instance not found") when err == nil and inst == nil. Reference resolveInstance
and the surrounding function handling the inst, err tuple and ensure you do not
overwrite repository/internal errors with a 404.
- Around line 817-879: The memory delta is in MB but tenant quotas are in GB;
convert deltaMemMB to GB before calling tenantSvc methods and when rolling back.
Compute a deltaMemGB (e.g. deltaMemGB := deltaMemMB / 1024, use appropriate
rounding if needed) and replace uses of deltaMemMB with deltaMemGB in calls to
s.tenantSvc.CheckQuota, s.tenantSvc.IncrementUsage, s.tenantSvc.DecrementUsage
and in corresponding rollback/rollbackErr messages (references: deltaMemMB,
memIncremented, the memory quota branch and the rollback block where memory
decrement/increment and error messages are constructed). Ensure both positive
and negative memory flows use the converted GB value.

In `@internal/platform/resilient_compute.go`:
- Around line 140-162: The ResizeInstance, CreateSnapshot, RestoreSnapshot, and
DeleteSnapshot wrappers in ResilientCompute use r.opts.CallTimeout (short 2m)
which can prematurely time out long Libvirt cold-resize flows; update each to
call r.callProtected with the long timeout bucket (e.g. r.opts.LongCallTimeout
or the configured long-duration field on opts) instead of r.opts.CallTimeout so
these operations use the extended timeout; locate the methods ResizeInstance,
CreateSnapshot, RestoreSnapshot, DeleteSnapshot and replace the timeout argument
passed into callProtected accordingly.

In `@internal/repositories/libvirt/adapter_unit_test.go`:
- Around line 882-925: The test currently creates snapshotPath but never wires
it into RestoreSnapshot because execCommand always returns a no-op "true" and
the test asserts an error; change the execCommand stub in the LibvirtAdapter
used by this test so that when called with name == "tar" and arg[1] == "xzf" it
actually executes the real tar binary against the created snapshotPath (e.g.
return exec.Command("tar", "xzf", snapshotPath, "-C", tmpDir)) so the archive
you built is actually extracted, then change the final assertion from
require.Error(t, err) to require.NoError(t, err) to verify the success path of
RestoreSnapshot; update references to tmpDir, snapshotPath, execCommand and
RestoreSnapshot in this test accordingly.

In `@internal/repositories/libvirt/adapter.go`:
- Around line 29-33: The three compiled regex globals (memoryRe, currentMemRe,
vcpuRe) should be moved off package scope into fields on LibvirtAdapter (e.g.,
MemoryRe, CurrentMemRe, VcpuRe) and initialized inside NewLibvirtAdapter so they
are still compiled once per adapter instance; update any code that references
memoryRe/currentMemRe/vcpuRe to use the new LibvirtAdapter fields and remove the
package-level vars entirely to comply with the “no globals” rule.
- Around line 205-235: The rollbackFromSnapshot helper is being reused for
distinct failure phases but each phase needs different recovery steps; update
the resize flow around applyDomainResize, DomainUndefine, DomainDefineXML, and
DomainCreate to use phase-specific rollback functions (e.g.,
rollbackBeforeUndefine, rollbackAfterUndefine, rollbackAfterDefine) or extend
rollbackFromSnapshot with a phase enum so you can: (1) before undefine — avoid
redefining when original domain still exists and instead ensure volumes are
untouched; (2) after undefine — recreate/define the original domain by using
lookup by UUID/name and recreate volumes if needed (handle DomainLookupByName
failure), and (3) after new XML is defined — if later steps fail, first
undefine/remove the partially defined new domain before restoring the original
XML; in all rollbacks ensure the VM is restarted when wasRunning was true and
surface distinct errors combining original error and rollback error. Reference
applyDomainResize, rollbackFromSnapshot, DomainUndefine, DomainDefineXML,
DomainCreate, and the new phase-specific rollback helpers when making changes.
- Around line 253-260: The ReplaceAllString calls for memoryRe, currentMemRe,
and vcpuRe are replacing entire elements (dropping attributes like placement,
current, cpuset, unit); update the replacements to preserve matched attributes
and only change the element text node — for example, use capture groups in the
regexes (or ReplaceAllStringFunc) to retain the opening tag/attributes and
closing tag and substitute just the numeric content so memoryRe, currentMemRe
and vcpuRe rewrite only the inner value (still using memoryKiB and vcpus) rather
than overwriting the whole element.

In `@internal/repositories/noop/adapters.go`:
- Around line 404-410: The NoopTaskQueue implementation removed the
durable-queue methods and therefore no longer satisfies ports.DurableTaskQueue;
restore the missing methods (EnsureGroup, Receive, Ack, Nack, ReclaimStale) on
the NoopTaskQueue type so it can be injected where durable behavior is expected
(e.g., where provision_worker.go calls those methods). Implement these methods
as no-ops or simple stubs consistent with existing Enqueue/Dequeue behavior
(return appropriate zero values and nil errors) and ensure method signatures
exactly match those declared in ports.DurableTaskQueue so the type again
satisfies the interface.

In `@tests/compute_e2e_test.go`:
- Around line 257-316: TestResizeInstanceDownsize currently launches the
instance with "instance_type": "basic-2" and then resizes to the same "basic-2",
so it never verifies an actual downsize; update the LaunchInstance payload in
TestResizeInstanceDownsize to use a larger type (e.g., "standard-1") and keep
the Resize payload targeting "basic-2", so the test actually exercises a
downsize flow for the instance created by instanceID and the existing
waitForInstanceStatus/Resize/Terminate steps.

---

Nitpick comments:
In `@docs/swagger/swagger.json`:
- Around line 4005-4030: Add a 403 response entry to the responses object for
the instance resize endpoint in docs/swagger/swagger.json to document
RBAC-denied requests: update the "responses" block (the same block that
currently lists "200","400","404","500") to include a "403" key with
"description": "Forbidden" and the same schema reference ("$ref":
"#/definitions/httputil.Response") so the API contract reflects authorization
failures introduced by the new instance:resize permission.
- Around line 11222-11232: The schema for httphandlers.ResizeInstanceRequest
exposes instance_type as an unconstrained string; add metadata to tighten it by
adding a "description" (e.g., "Target instance type identifier") and basic
validation constraints such as "minLength": 1 and an optional "pattern" or
"enum" if known; update the httphandlers.ResizeInstanceRequest ->
properties.instance_type entry to include these fields so generated docs and
SDKs surface clear validation and intent.

In `@docs/swagger/swagger.yaml`:
- Around line 2334-2340: The OpenAPI schema for
httphandlers.ResizeInstanceRequest currently requires instance_type but allows
an empty string; update the instance_type property in the
httphandlers.ResizeInstanceRequest schema to include "minLength: 1" so empty
strings are rejected (i.e., add minLength: 1 under the instance_type property to
tighten validation).

In `@internal/handlers/instance_handler_test.go`:
- Around line 578-594: The test "QuotaExceeded" in
internal/handlers/instance_handler_test.go calls the service mock for
ResizeInstance and returns errors.Forbidden but the test name and E2E
expectations expect a quota-exceeded error (errors.QuotaExceeded -> HTTP 429);
update the mocked return in the ResizeInstance expectation to return
errors.QuotaExceeded (preserving the message) so the handler test, the
ResizeInstance route, and the E2E TestResizeInstance are consistent with a 429
outcome.

In `@tests/compute_e2e_test.go`:
- Around line 199-227: The inline assertion in the t.Run("Resize") test is wrong
because the resize handler returns gin.H{"message":"instance resized"} (no
instance_type), so when resp.StatusCode == http.StatusOK stop decoding into
res.Data.InstanceType and instead call the existing VerifyResize helper to fetch
the instance via GET and assert the instance type, or simply remove the
decode/assert block; update the success branch to invoke
VerifyResize(testutil.TestBaseURL, testutil.TestRouteInstances, instanceID,
token, "standard-1") (or equivalent) so the test verifies the actual instance
type via GET rather than parsing the resize response body.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 97154a85-0867-404e-b52b-ee76f338259e

📥 Commits

Reviewing files that changed from the base of the PR and between deb6ff0 and e455525.

📒 Files selected for processing (34)

docs/FEATURES.md
docs/adr/ADR-025-instance-resize.md
docs/api-reference.md
docs/swagger/docs.go
docs/swagger/swagger.json
docs/swagger/swagger.yaml
internal/api/setup/router.go
internal/core/domain/rbac.go
internal/core/ports/compute.go
internal/core/ports/instance.go
internal/core/services/instance.go
internal/core/services/instance_unit_test.go
internal/core/services/mock_compute_test.go
internal/handlers/instance_handler.go
internal/handlers/instance_handler_test.go
internal/platform/resilient_compute.go
internal/platform/resilient_compute_test.go
internal/repositories/docker/adapter.go
internal/repositories/docker/fakes_test.go
internal/repositories/firecracker/adapter.go
internal/repositories/firecracker/adapter_noop.go
internal/repositories/k8s/kubeadm_provisioner_test.go
internal/repositories/k8s/mocks_test.go
internal/repositories/libvirt/adapter.go
internal/repositories/libvirt/adapter_unit_test.go
internal/repositories/libvirt/lb_proxy_test.go
internal/repositories/noop/adapters.go
internal/repositories/postgres/migrations/107_seed_instance_resize_permission.down.sql
internal/repositories/postgres/migrations/107_seed_instance_resize_permission.up.sql
internal/workers/database_failover_worker_test.go
internal/workers/healing_worker_test.go
internal/workers/pipeline_worker_test.go
pkg/httputil/response.go
tests/compute_e2e_test.go

coderabbitai · 2026-04-28T10:50:15Z

+### Quota Enforcement
+
+The service calculates a **delta** between old and new instance types:
+
+- If `deltaCPU > 0` or `deltaMem > 0` (upsize): calls `tenantSvc.CheckQuota` before proceeding
+- If downsize (`delta < 0`): quota check is skipped (releasing resources back to the pool)
+- If same size (`delta == 0`): no quota interaction
+
+After a successful resize, usage counters are updated with the delta (`IncrementUsage` for upsize, `DecrementUsage` for downsize). Failures in usage updates are logged but not propagated — a future background reconciliation worker could correct drift.


⚠️ Potential issue | 🟡 Minor

ADR and implementation disagree on quota-update failures.

This section says quota usage update failures are only logged, but internal/core/services/instance.go currently returns an internal error after a successful resize when quotaErrs is non-empty. The ADR should match the actual response behavior, or the service should be updated to match the ADR.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/adr/ADR-025-instance-resize.md` around lines 42 - 50, ADR-025 conflicts with current code: after a successful resize the code in internal/core/services/instance.go collects quotaErrs from IncrementUsage/DecrementUsage and returns an internal error, whereas the ADR says failures should only be logged. Either update the ADR text to state that quota update failures cause the service to return an internal error (reference quotaErrs, IncrementUsage, DecrementUsage) or change the implementation in internal/core/services/instance.go to swallow quotaErrs (log them) and not return an error after a successful resize so behavior matches the ADR. Make the change consistently and update the ADR or code comments to reference the final decision.

coderabbitai · 2026-04-28T10:50:15Z

+  /instances/{id}/resize:
+    post:
+      consumes:
+      - application/json
+      description: Change the instance type (CPU/memory) of an existing instance
+      parameters:
+      - description: Instance ID
+        in: path
+        name: id
+        required: true
+        type: string
+      - description: Resize request
+        in: body
+        name: request
+        required: true
+        schema:
+          $ref: '#/definitions/httphandlers.ResizeInstanceRequest'
+      produces:
+      - application/json
+      responses:
+        "200":
+          description: OK
+          schema:
+            $ref: '#/definitions/httputil.Response'
+        "400":
+          description: Bad Request
+          schema:
+            $ref: '#/definitions/httputil.Response'
+        "404":
+          description: Not Found
+          schema:
+            $ref: '#/definitions/httputil.Response'
+        "500":
+          description: Internal Server Error
+          schema:
+            $ref: '#/definitions/httputil.Response'
+      security:
+      - APIKeyAuth: []
+      summary: Resize an instance
+      tags:
+      - instances


⚠️ Potential issue | 🟡 Minor

Document auth failures for the resize endpoint.

This operation is secured, but 401/403 are not documented. Adding them improves client behavior and aligns with RBAC-protected endpoints.

Suggested diff

responses: "200": description: OK schema: $ref: '#/definitions/httputil.Response' + "401": + description: Unauthorized + schema: + $ref: '#/definitions/httputil.Response' + "403": + description: Forbidden + schema: + $ref: '#/definitions/httputil.Response' "400": description: Bad Request schema: $ref: '#/definitions/httputil.Response'

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/swagger/swagger.yaml` around lines 5010 - 5050, The OpenAPI path /instances/{id}/resize is missing documented 401/403 responses for auth failures; update the operation to include "401" and "403" responses (both using the existing schema reference '#/definitions/httputil.Response') so clients can handle Unauthorized/Forbidden cases and align with RBAC-protected endpoints.

coderabbitai · 2026-04-28T10:50:15Z

+	t.Run("QuotaExceeded_Memory", func(t *testing.T) {
+		repo := new(MockInstanceRepo)
+		typeRepo := new(MockInstanceTypeRepo)
+		rbacSvc := new(MockRBACService)
+		tenantSvc := new(MockTenantService)
+		compute := new(MockComputeBackend)
+
+		svc := services.NewInstanceService(services.InstanceServiceParams{
+			Repo:             repo,
+			InstanceTypeRepo: typeRepo,
+			RBAC:             rbacSvc,
+			TenantSvc:        tenantSvc,
+			Compute:          compute,
+			Logger:           slog.Default(),
+		})
+
+		ctx := context.Background()
+		userID := uuid.New()
+		tenantID := uuid.New()
+		instanceID := uuid.New()
+		ctx = appcontext.WithUserID(ctx, userID)
+		ctx = appcontext.WithTenantID(ctx, tenantID)
+
+		inst := &domain.Instance{
+			ID:           instanceID,
+			UserID:       userID,
+			TenantID:     tenantID,
+			Status:       domain.StatusRunning,
+			InstanceType: "basic-2",
+			ContainerID:  "cid-1",
+		}
+
+		oldType := &domain.InstanceType{ID: "basic-2", VCPUs: 2, MemoryMB: 2048}
+		newType := &domain.InstanceType{ID: "basic-4", VCPUs: 4, MemoryMB: 4096}
+
+		rbacSvc.On("Authorize", mock.Anything, userID, tenantID, domain.PermissionInstanceResize, "test-inst").Return(nil).Once()
+		repo.On("GetByName", mock.Anything, "test-inst").Return(inst, nil).Once()
+		typeRepo.On("GetByID", mock.Anything, "basic-2").Return(oldType, nil).Once()
+		typeRepo.On("GetByID", mock.Anything, "basic-4").Return(newType, nil).Once()
+		compute.On("ResizeInstance", mock.Anything, "cid-1", int64(4e9), int64(4096*1024*1024)).Return(nil).Once()
+		tenantSvc.On("CheckQuota", mock.Anything, tenantID, "vcpus", 2).Return(nil).Once()
+		tenantSvc.On("IncrementUsage", mock.Anything, tenantID, "vcpus", 2).Return(nil).Once()
+		tenantSvc.On("CheckQuota", mock.Anything, tenantID, "memory", 2048).Return(fmt.Errorf("insufficient memory quota")).Once()
+
+		err := svc.ResizeInstance(ctx, "test-inst", "basic-4")
+
+		require.Error(t, err)
+		assert.Contains(t, err.Error(), "insufficient memory quota")
+		mock.AssertExpectationsForObjects(t, repo, typeRepo, rbacSvc, tenantSvc, compute)
+	})


⚠️ Potential issue | 🟡 Minor

Test exposes missing vCPU rollback when memory quota check fails.

In this test, vCPU IncrementUsage succeeds (line 1779), then memory CheckQuota fails (line 1780). There's no DecrementUsage mock for vcpus, indicating the implementation doesn't roll back the vCPU increment when memory quota check fails.

Looking at context snippet 1 (completeResize), the rollback logic only handles IncrementUsage failures, not CheckQuota failures:

if err := s.tenantSvc.CheckQuota(ctx, tenantID, "memory", deltaMemMB); err != nil { return err // No rollback of vCPU increment }

The test correctly reflects implementation behavior, but this may be a quota accounting leak.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/core/services/instance_unit_test.go` around lines 1738 - 1787, The test reveals that ResizeInstance/completeResize increments vCPU usage (tenantSvc.IncrementUsage) before checking memory quota (tenantSvc.CheckQuota) and on memory quota failure does not call tenantSvc.DecrementUsage to roll back the prior vCPU increment; update the implementation in ResizeInstance/completeResize so that after a failed tenantSvc.CheckQuota( ..., "memory", deltaMemMB) it calls tenantSvc.DecrementUsage(ctx, tenantID, "vcpus", deltaVCPUs) (or a generic rollback helper) to undo the earlier IncrementUsage, and add corresponding error handling/logging to ensure the decrement error is surfaced or logged while returning the original quota error.

coderabbitai · 2026-04-28T10:50:15Z

+	inst, err := s.resolveInstance(ctx, idOrName)
+	if err != nil || inst == nil {
+		return errors.New(errors.NotFound, "instance not found")
+	}


⚠️ Potential issue | 🟠 Major

Don't collapse repository failures into NotFound.

This rewrites every resolveInstance failure as instance not found. If the repo returns an internal/storage error, the service will now hide it as a 404 instead of surfacing the real failure.

Suggested fix

inst, err := s.resolveInstance(ctx, idOrName) - if err != nil || inst == nil { + if err != nil { + return err + } + if inst == nil { return errors.New(errors.NotFound, "instance not found") }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

inst, err := s.resolveInstance(ctx, idOrName)

if err != nil || inst == nil {

return errors.New(errors.NotFound, "instance not found")

}

inst, err := s.resolveInstance(ctx, idOrName)

if err != nil {

return err

}

if inst == nil {

return errors.New(errors.NotFound, "instance not found")

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/core/services/instance.go` around lines 732 - 735, The current code collapses all errors from resolveInstance into a NotFound error; change the logic in the caller of resolveInstance so that if err != nil you return the original err (or wrap it preserving its type/message) and only return errors.New(errors.NotFound, "instance not found") when err == nil and inst == nil. Reference resolveInstance and the surrounding function handling the inst, err tuple and ensure you do not overwrite repository/internal errors with a 404.

coderabbitai · 2026-04-28T10:50:15Z

+	// Replace <memory unit="KiB">...</memory> or <memory>...</memory>
+	result = memoryRe.ReplaceAllString(result, fmt.Sprintf(`<memory unit="KiB">%d</memory>`, memoryKiB))
+
+	// Replace <currentMemory unit="KiB">...</currentMemory> or <currentMemory>...</currentMemory>
+	result = currentMemRe.ReplaceAllString(result, fmt.Sprintf(`<currentMemory unit="KiB">%d</currentMemory>`, memoryKiB))
+
+	// Replace <vcpu placement="static">...</vcpu> or <vcpu>...</vcpu>
+	result = vcpuRe.ReplaceAllString(result, fmt.Sprintf(`<vcpu>%d</vcpu>`, vcpus))


⚠️ Potential issue | 🟠 Major

These replacements strip existing libvirt attributes.

Each ReplaceAllString rewrites the whole node, so attributes like placement, current, cpuset, or an existing unit are discarded. That can change semantics or make a previously valid domain XML fail to redefine. Update only the text node or preserve the matched attributes when rewriting these elements.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/repositories/libvirt/adapter.go` around lines 253 - 260, The ReplaceAllString calls for memoryRe, currentMemRe, and vcpuRe are replacing entire elements (dropping attributes like placement, current, cpuset, unit); update the replacements to preserve matched attributes and only change the element text node — for example, use capture groups in the regexes (or ReplaceAllStringFunc) to retain the opening tag/attributes and closing tag and substitute just the numeric content so memoryRe, currentMemRe and vcpuRe rewrite only the inner value (still using memoryKiB and vcpus) rather than overwriting the whole element.

coderabbitai · 2026-04-28T10:50:15Z

+// NoopTaskQueue is a no-op task queue.
 type NoopTaskQueue struct{}

 func (q *NoopTaskQueue) Enqueue(ctx context.Context, queue string, payload interface{}) error {
 	return nil
 }
 func (q *NoopTaskQueue) Dequeue(ctx context.Context, queue string) (string, error) { return "", nil }


⚠️ Potential issue | 🟠 Major

NoopTaskQueue no longer satisfies the durable queue contract.

ports.DurableTaskQueue in internal/core/ports/task_queue.go:33-61 still requires EnsureGroup, Receive, Ack, Nack, and ReclaimStale, and internal/workers/provision_worker.go:65-245 still calls those methods on the injected queue. Dropping them here means this noop can no longer stand in for the durable queue in tests or system wiring.

Proposed fix

type NoopTaskQueue struct{} func (q *NoopTaskQueue) Enqueue(ctx context.Context, queue string, payload interface{}) error { return nil } func (q *NoopTaskQueue) Dequeue(ctx context.Context, queue string) (string, error) { return "", nil } +func (q *NoopTaskQueue) EnsureGroup(ctx context.Context, queueName, groupName string) error { + return nil +} +func (q *NoopTaskQueue) Receive(ctx context.Context, queueName, groupName, consumerName string) (*ports.DurableMessage, error) { + return nil, nil +} +func (q *NoopTaskQueue) Ack(ctx context.Context, queueName, groupName, messageID string) error { + return nil +} +func (q *NoopTaskQueue) Nack(ctx context.Context, queueName, groupName, messageID string) error { + return nil +} +func (q *NoopTaskQueue) ReclaimStale(ctx context.Context, queueName, groupName, consumerName string, minIdleMs int64, count int64) ([]ports.DurableMessage, error) { + return nil, nil +}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// NoopTaskQueue is a no-op task queue.

type NoopTaskQueue struct{}

func (q *NoopTaskQueue) Enqueue(ctx context.Context, queue string, payload interface{}) error {

return nil

}

func (q *NoopTaskQueue) Dequeue(ctx context.Context, queue string) (string, error) { return "", nil }

// NoopTaskQueue is a no-op task queue.

type NoopTaskQueue struct{}

func (q *NoopTaskQueue) Enqueue(ctx context.Context, queue string, payload interface{}) error {

return nil

}

func (q *NoopTaskQueue) Dequeue(ctx context.Context, queue string) (string, error) { return "", nil }

func (q *NoopTaskQueue) EnsureGroup(ctx context.Context, queueName, groupName string) error {

return nil

}

func (q *NoopTaskQueue) Receive(ctx context.Context, queueName, groupName, consumerName string) (*ports.DurableMessage, error) {

return nil, nil

}

func (q *NoopTaskQueue) Ack(ctx context.Context, queueName, groupName, messageID string) error {

return nil

}

func (q *NoopTaskQueue) Nack(ctx context.Context, queueName, groupName, messageID string) error {

return nil

}

func (q *NoopTaskQueue) ReclaimStale(ctx context.Context, queueName, groupName, consumerName string, minIdleMs int64, count int64) ([]ports.DurableMessage, error) {

return nil, nil

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/repositories/noop/adapters.go` around lines 404 - 410, The NoopTaskQueue implementation removed the durable-queue methods and therefore no longer satisfies ports.DurableTaskQueue; restore the missing methods (EnsureGroup, Receive, Ack, Nack, ReclaimStale) on the NoopTaskQueue type so it can be injected where durable behavior is expected (e.g., where provision_worker.go calls those methods). Implement these methods as no-ops or simple stubs consistent with existing Enqueue/Dequeue behavior (return appropriate zero values and nil errors) and ensure method signatures exactly match those declared in ports.DurableTaskQueue so the type again satisfies the interface.

coderabbitai · 2026-04-28T10:50:15Z

+func TestResizeInstanceDownsize(t *testing.T) {
+	if err := waitForServer(); err != nil {
+		t.Fatalf("Failing Resize Downsize E2E test: %v", err)
+	}
+
+	client := &http.Client{Timeout: 60 * time.Second}
+	token := registerAndLogin(t, client, "resize-down-tester@thecloud.local", "Resize Down Tester")
+
+	var instanceID string
+	instanceName := fmt.Sprintf("e2e-resize-down-%d-%s", time.Now().UnixNano()%1000, uuid.New().String())
+
+	// 1. Launch Instance with basic-2 type
+	t.Run("LaunchInstance", func(t *testing.T) {
+		payload := map[string]string{
+			"name":         instanceName,
+			"image":        "nginx:alpine",
+			"instance_type": "basic-2",
+			"ports":        "0:80",
+		}
+		resp := postRequest(t, client, testutil.TestBaseURL+testutil.TestRouteInstances, token, payload)
+		defer func() { _ = resp.Body.Close() }()
+
+		require.Equal(t, http.StatusAccepted, resp.StatusCode)
+
+		var res struct {
+			Data struct {
+				ID string `json:"id"`
+			} `json:"data"`
+		}
+		require.NoError(t, json.NewDecoder(resp.Body).Decode(&res))
+		instanceID = res.Data.ID
+	})
+
+	// 2. Wait for Running
+	t.Run("WaitForRunning", func(t *testing.T) {
+		lastStatus := waitForInstanceStatus(t, client, token, instanceID)
+		if lastStatus != domain.StatusRunning {
+			t.Skipf("Instance did not reach running state within timeout. Last status: %s", lastStatus)
+		}
+	})
+
+	// 3. Downsize to basic-2
+	t.Run("Resize", func(t *testing.T) {
+		payload := map[string]string{
+			"instance_type": "basic-2",
+		}
+		resp := postRequest(t, client, fmt.Sprintf("%s%s/%s/resize", testutil.TestBaseURL, testutil.TestRouteInstances, instanceID), token, payload)
+		defer func() { _ = resp.Body.Close() }()
+
+		assert.Equal(t, http.StatusOK, resp.StatusCode)
+	})
+
+	// 4. Terminate
+	t.Run("TerminateInstance", func(t *testing.T) {
+		resp := deleteRequest(t, client, fmt.Sprintf(testutil.TestRouteFormat, testutil.TestBaseURL, testutil.TestRouteInstances, instanceID), token)
+		defer func() { _ = resp.Body.Close() }()
+
+		assert.Equal(t, http.StatusOK, resp.StatusCode)
+	})
+}


⚠️ Potential issue | 🟡 Minor

Test does not actually test downsize—instance type unchanged.

The test is named TestResizeInstanceDownsize but:

Instance is launched with "instance_type": "basic-2" (line 273)

Resize requests "instance_type": "basic-2" (line 301)

This tests a same-size no-op, not a downsize. For a proper downsize test, launch with a larger type (e.g., standard-1) and resize to basic-2.

Proposed fix for actual downsize test

func TestResizeInstanceDownsize(t *testing.T) { // ... - // 1. Launch Instance with basic-2 type + // 1. Launch Instance with standard-1 type (larger) t.Run("LaunchInstance", func(t *testing.T) { payload := map[string]string{ "name": instanceName, "image": "nginx:alpine", - "instance_type": "basic-2", + "instance_type": "standard-1", "ports": "0:80", } // ... }) // ... - // 3. Downsize to basic-2 + // 3. Downsize from standard-1 to basic-2 t.Run("Resize", func(t *testing.T) {

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func TestResizeInstanceDownsize(t *testing.T) {

if err := waitForServer(); err != nil {

t.Fatalf("Failing Resize Downsize E2E test: %v", err)

}

client := &http.Client{Timeout: 60 * time.Second}

token := registerAndLogin(t, client, "resize-down-tester@thecloud.local", "Resize Down Tester")

var instanceID string

instanceName := fmt.Sprintf("e2e-resize-down-%d-%s", time.Now().UnixNano()%1000, uuid.New().String())

// 1. Launch Instance with basic-2 type

t.Run("LaunchInstance", func(t *testing.T) {

payload := map[string]string{

"name": instanceName,

"image": "nginx:alpine",

"instance_type": "basic-2",

"ports": "0:80",

}

resp := postRequest(t, client, testutil.TestBaseURL+testutil.TestRouteInstances, token, payload)

defer func() { _ = resp.Body.Close() }()

require.Equal(t, http.StatusAccepted, resp.StatusCode)

var res struct {

Data struct {

ID string `json:"id"`

} `json:"data"`

}

require.NoError(t, json.NewDecoder(resp.Body).Decode(&res))

instanceID = res.Data.ID

})

// 2. Wait for Running

t.Run("WaitForRunning", func(t *testing.T) {

lastStatus := waitForInstanceStatus(t, client, token, instanceID)

if lastStatus != domain.StatusRunning {

t.Skipf("Instance did not reach running state within timeout. Last status: %s", lastStatus)

}

})

// 3. Downsize to basic-2

t.Run("Resize", func(t *testing.T) {

payload := map[string]string{

"instance_type": "basic-2",

}

resp := postRequest(t, client, fmt.Sprintf("%s%s/%s/resize", testutil.TestBaseURL, testutil.TestRouteInstances, instanceID), token, payload)

defer func() { _ = resp.Body.Close() }()

assert.Equal(t, http.StatusOK, resp.StatusCode)

})

// 4. Terminate

t.Run("TerminateInstance", func(t *testing.T) {

resp := deleteRequest(t, client, fmt.Sprintf(testutil.TestRouteFormat, testutil.TestBaseURL, testutil.TestRouteInstances, instanceID), token)

defer func() { _ = resp.Body.Close() }()

assert.Equal(t, http.StatusOK, resp.StatusCode)

})

}

func TestResizeInstanceDownsize(t *testing.T) {

if err := waitForServer(); err != nil {

t.Fatalf("Failing Resize Downsize E2E test: %v", err)

}

client := &http.Client{Timeout: 60 * time.Second}

token := registerAndLogin(t, client, "resize-down-tester@thecloud.local", "Resize Down Tester")

var instanceID string

instanceName := fmt.Sprintf("e2e-resize-down-%d-%s", time.Now().UnixNano()%1000, uuid.New().String())

// 1. Launch Instance with standard-1 type (larger)

t.Run("LaunchInstance", func(t *testing.T) {

payload := map[string]string{

"name": instanceName,

"image": "nginx:alpine",

"instance_type": "standard-1",

"ports": "0:80",

}

resp := postRequest(t, client, testutil.TestBaseURL+testutil.TestRouteInstances, token, payload)

defer func() { _ = resp.Body.Close() }()

require.Equal(t, http.StatusAccepted, resp.StatusCode)

var res struct {

Data struct {

ID string `json:"id"`

} `json:"data"`

}

require.NoError(t, json.NewDecoder(resp.Body).Decode(&res))

instanceID = res.Data.ID

})

// 2. Wait for Running

t.Run("WaitForRunning", func(t *testing.T) {

lastStatus := waitForInstanceStatus(t, client, token, instanceID)

if lastStatus != domain.StatusRunning {

t.Skipf("Instance did not reach running state within timeout. Last status: %s", lastStatus)

}

})

// 3. Downsize from standard-1 to basic-2

t.Run("Resize", func(t *testing.T) {

payload := map[string]string{

"instance_type": "basic-2",

}

resp := postRequest(t, client, fmt.Sprintf("%s%s/%s/resize", testutil.TestBaseURL, testutil.TestRouteInstances, instanceID), token, payload)

defer func() { _ = resp.Body.Close() }()

assert.Equal(t, http.StatusOK, resp.StatusCode)

})

// 4. Terminate

t.Run("TerminateInstance", func(t *testing.T) {

resp := deleteRequest(t, client, fmt.Sprintf(testutil.TestRouteFormat, testutil.TestBaseURL, testutil.TestRouteInstances, instanceID), token)

defer func() { _ = resp.Body.Close() }()

assert.Equal(t, http.StatusOK, resp.StatusCode)

})

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/compute_e2e_test.go` around lines 257 - 316, TestResizeInstanceDownsize currently launches the instance with "instance_type": "basic-2" and then resizes to the same "basic-2", so it never verifies an actual downsize; update the LaunchInstance payload in TestResizeInstanceDownsize to use a larger type (e.g., "standard-1") and keep the Resize payload targeting "basic-2", so the test actually exercises a downsize flow for the instance created by instanceID and the existing waitForInstanceStatus/Resize/Terminate steps.

Add CreateSnapshot, RestoreSnapshot, and DeleteSnapshot methods to ComputeBackend interface. Libvirt adapter implements these via QCOW2 volume snapshots. Docker/Firecracker backends implement as no-ops since hot resize doesn't need snapshots. ResizeInstance in Libvirt adapter now: 1. Creates pre-resize snapshot before making changes 2. On failure: rolls back from snapshot and redefines domain 3. On success: deletes snapshot to avoid accumulation Also update ResilientCompute wrapper, all mock backends in tests, and fix nil pointer dereference bug in rollback path.

- rollbackFromSnapshot now calls DomainDefineXML after RestoreSnapshot to properly redefine the domain with original XML after DomainUndefine - Use UUID-based snapshot names to avoid second-precision collisions - Add unit tests for CreateSnapshot, RestoreSnapshot, DeleteSnapshot - Add unit tests for ResizeInstance rollback on DomainDefineXML and DomainUndefine failures - Add unit tests for rollbackFromSnapshot success and failure paths

…aky tests - Check rollbackFromSnapshot errors at all 3 call sites in ResizeInstance (was ignoring errors at applyDomainResize and DomainUndefine failure paths) - Skip RollbackFromSnapshot/success and RollbackFromSnapshot/redefine_fails_after_restore since they require real tar archive contents that the execCommand mock cannot provide (rollback path is covered by ResizeInstance_RollbackOnFailure tests)

…regexes - Line 232-233 (DomainCreate failure): check rollbackFromSnapshot error and include it in the returned error. This is the 4th rollback call site in ResizeInstance — all now checked consistently. - Pre-compile 3 regexes (memoryRe, currentMemRe, vcpuRe) at package level instead of re-compiling on every applyDomainResize call.

snapshotPath(id, name string) now returns the snapshot archive path, replacing 3 identical inline format strings in CreateSnapshot, RestoreSnapshot, and DeleteSnapshot.

Document the snapshot creation step (step 4) and the rollback mechanism that restores volume and redefines domain on failure. Also document edge cases: snapshot creation failure is non-fatal, and rollback itself can fail.

- TestLibvirtAdapter_ApplyDomainResize: 5 subtests covering regex replacement (both replaced, missing both elements, memory without unit attr, resize-to-same-values error, vcpu with placement attr) - TestLibvirtAdapter_ResizeInstance_RollbackOnFailure: add rollback_on_DomainCreate_failure subtest (was previously untested) Removed incorrect SnapshotCreationFails_ResizeProceeds test from instance_unit_test.go — InstanceService.ResizeInstance does not call CreateSnapshot (snapshot creation is adapter-internal).

…lint)

poyrazK

It's okay to merge

Copilot AI review requested due to automatic review settings April 25, 2026 18:32

Copilot started reviewing on behalf of poyrazK April 25, 2026 18:33 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 26, 2026 21:28

Copilot started reviewing on behalf of poyrazK April 26, 2026 21:28 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 28, 2026 10:40

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

poyrazK added 10 commits April 28, 2026 15:17

refactor: extract snapshotPath helper to eliminate DRY violation

8f5c6b7

snapshotPath(id, name string) now returns the snapshot archive path, replacing 3 identical inline format strings in CreateSnapshot, RestoreSnapshot, and DeleteSnapshot.

docs: update ADR-025 with snapshot-before-resize safety net

b0b654b

Document the snapshot creation step (step 4) and the rollback mechanism that restores volume and redefines domain on failure. Also document edge cases: snapshot creation failure is non-fatal, and rollback itself can fail.

ci: retest after e455525

003bd9a

ci: trigger re-test on new commits

fe0c0d5

fix: add regex fields to LibvirtAdapter in tests for nil pointer safety

043a052

poyrazK force-pushed the feat/snapshot-before-resize branch from d4ee00d to 043a052 Compare April 28, 2026 15:06

poyrazK added 4 commits April 28, 2026 18:17

fix: use %v not %s for error values in rollback error messages (error…

97357b0

…lint)

fix: use errors.Join for multiple errors in rollback (errorlint)

d7119ea

fix: update DomainCreate failure test to use errors.Join format

0c8a95d

chore: remove dead TestLibvirtAdapter_RollbackFromSnapshot

e786c4e

poyrazK commented Apr 29, 2026

View reviewed changes

poyrazK merged commit 5e30fdc into main Apr 29, 2026
22 checks passed

coderabbitai Bot mentioned this pull request Apr 29, 2026

feat(instances): add ResizeInstance for CPU/memory resizing #183

Merged

5 tasks

Conversation

poyrazK commented Apr 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

poyrazK left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

poyrazK commented Apr 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 25, 2026 •

edited

Loading