feat: snapshot-before-resize for Libvirt cold resize#182
Conversation
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (14)
📝 WalkthroughWalkthroughThis PR implements instance resizing functionality, adding a Changes
Sequence DiagramsequenceDiagram
participant Client
participant Handler as HTTP Handler
participant Service as InstanceService
participant Repo as Repository
participant Backend as ComputeBackend
Client->>Handler: POST /instances/:id/resize<br/>{instance_type}
activate Handler
Handler->>Handler: Validate request
Handler->>Service: ResizeInstance(ctx, idOrName, newType)
deactivate Handler
activate Service
Service->>Service: Check RBAC permission
Service->>Repo: Resolve instance by UUID/name
activate Repo
Repo-->>Service: Instance
deactivate Repo
Service->>Repo: Load old & new InstanceType
activate Repo
Repo-->>Service: InstanceType records
deactivate Repo
alt Type unchanged
Service-->>Client: No-op (same type)
else Type changed
Service->>Backend: ResizeInstance(ctx, id, cpu, memory)
activate Backend
Backend-->>Service: error or success
deactivate Backend
alt Compute succeeded
Service->>Repo: Check/update quota (delta)
activate Repo
Repo-->>Service: QuotaExceeded or success
deactivate Repo
alt Quota ok
Service->>Repo: Persist new InstanceType
activate Repo
Repo-->>Service: error or success
deactivate Repo
alt Persistence failed
Service->>Backend: Rollback compute
activate Backend
Backend-->>Service: -
deactivate Backend
Service-->>Client: Error (500)
else Persistence ok
Service->>Service: Record event & audit
Service-->>Client: Success (200)
end
else Quota exceeded
Service->>Backend: Rollback compute
activate Backend
Backend-->>Service: -
deactivate Backend
Service-->>Client: Error (429)
end
else Compute failed
Service-->>Client: Error (500)
end
end
deactivate Service
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 33 minutes and 18 seconds.Comment |
There was a problem hiding this comment.
Actionable comments posted: 15
🧹 Nitpick comments (5)
tests/compute_e2e_test.go (1)
199-227: Consider verifying instance type via GET after successful resize.The resize test handles both 200 (success) and 429 (quota exceeded), which is appropriate. However, when the response is 200, you're parsing
res.Data.InstanceTypefrom the resize response body. The resize handler returnsgin.H{"message": "instance resized"}(per the handler implementation), which doesn't includeinstance_type.The assertion on line 219 would check against an empty string, not the actual updated type. You already have
VerifyResize(lines 230-246) that correctly fetches via GET, so consider removing the inline assertion or updating the handler to return the instance data.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/compute_e2e_test.go` around lines 199 - 227, The inline assertion in the t.Run("Resize") test is wrong because the resize handler returns gin.H{"message":"instance resized"} (no instance_type), so when resp.StatusCode == http.StatusOK stop decoding into res.Data.InstanceType and instead call the existing VerifyResize helper to fetch the instance via GET and assert the instance type, or simply remove the decode/assert block; update the success branch to invoke VerifyResize(testutil.TestBaseURL, testutil.TestRouteInstances, instanceID, token, "standard-1") (or equivalent) so the test verifies the actual instance type via GET rather than parsing the resize response body.internal/handlers/instance_handler_test.go (1)
578-594: Test name/assertion mismatch: useserrors.Forbiddenbut named "QuotaExceeded".The test uses
errors.Forbiddenwhich maps to HTTP 403, but the test is named "QuotaExceeded". Looking atpkg/httputil/response.go,errors.QuotaExceededmaps to 429 (TooManyRequests).The E2E test (
TestResizeInstance) expects 429 for quota exceeded scenarios. For consistency:Option 1: Fix error type to match test name
t.Run("QuotaExceeded", func(t *testing.T) { // ... - mockSvc.On("ResizeInstance", mock.Anything, id.String(), "basic-4").Return(errors.New(errors.Forbidden, "insufficient quota")).Once() + mockSvc.On("ResizeInstance", mock.Anything, id.String(), "basic-4").Return(errors.New(errors.QuotaExceeded, "insufficient quota")).Once() // ... - assert.Equal(t, http.StatusForbidden, w.Code) + assert.Equal(t, http.StatusTooManyRequests, w.Code) })Option 2: Rename test to match actual behavior
- t.Run("QuotaExceeded", func(t *testing.T) { + t.Run("Forbidden", func(t *testing.T) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/handlers/instance_handler_test.go` around lines 578 - 594, The test "QuotaExceeded" in internal/handlers/instance_handler_test.go calls the service mock for ResizeInstance and returns errors.Forbidden but the test name and E2E expectations expect a quota-exceeded error (errors.QuotaExceeded -> HTTP 429); update the mocked return in the ResizeInstance expectation to return errors.QuotaExceeded (preserving the message) so the handler test, the ResizeInstance route, and the E2E TestResizeInstance are consistent with a 429 outcome.docs/swagger/swagger.yaml (1)
2334-2340: Tighten validation forinstance_type.
requiredstill permits"". AddminLength: 1so SDKs/clients fail fast on empty values.Suggested diff
httphandlers.ResizeInstanceRequest: properties: instance_type: + minLength: 1 type: string required: - instance_type type: object🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/swagger/swagger.yaml` around lines 2334 - 2340, The OpenAPI schema for httphandlers.ResizeInstanceRequest currently requires instance_type but allows an empty string; update the instance_type property in the httphandlers.ResizeInstanceRequest schema to include "minLength: 1" so empty strings are rejected (i.e., add minLength: 1 under the instance_type property to tighten validation).docs/swagger/swagger.json (2)
4005-4030: Document403 Forbiddenfor RBAC-denied resize requests.Because this PR introduces
instance:resize, clients can hit authorization failures even with valid input. Adding403here improves API contract accuracy.📌 Suggested doc update
"responses": { "200": { "description": "OK", "schema": { "$ref": "#/definitions/httputil.Response" } }, + "403": { + "description": "Forbidden", + "schema": { + "$ref": "#/definitions/httputil.Response" + } + }, "400": { "description": "Bad Request", "schema": { "$ref": "#/definitions/httputil.Response" } },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/swagger/swagger.json` around lines 4005 - 4030, Add a 403 response entry to the responses object for the instance resize endpoint in docs/swagger/swagger.json to document RBAC-denied requests: update the "responses" block (the same block that currently lists "200","400","404","500") to include a "403" key with "description": "Forbidden" and the same schema reference ("$ref": "#/definitions/httputil.Response") so the API contract reflects authorization failures introduced by the new instance:resize permission.
11222-11232: Consider tighteninginstance_typeschema metadata.Adding a short description and basic constraints (for example
minLength) would make generated docs/SDK validation clearer.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/swagger/swagger.json` around lines 11222 - 11232, The schema for httphandlers.ResizeInstanceRequest exposes instance_type as an unconstrained string; add metadata to tighten it by adding a "description" (e.g., "Target instance type identifier") and basic validation constraints such as "minLength": 1 and an optional "pattern" or "enum" if known; update the httphandlers.ResizeInstanceRequest -> properties.instance_type entry to include these fields so generated docs and SDKs surface clear validation and intent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/adr/ADR-025-instance-resize.md`:
- Around line 42-50: ADR-025 conflicts with current code: after a successful
resize the code in internal/core/services/instance.go collects quotaErrs from
IncrementUsage/DecrementUsage and returns an internal error, whereas the ADR
says failures should only be logged. Either update the ADR text to state that
quota update failures cause the service to return an internal error (reference
quotaErrs, IncrementUsage, DecrementUsage) or change the implementation in
internal/core/services/instance.go to swallow quotaErrs (log them) and not
return an error after a successful resize so behavior matches the ADR. Make the
change consistently and update the ADR or code comments to reference the final
decision.
In `@docs/swagger/swagger.yaml`:
- Around line 5010-5050: The OpenAPI path /instances/{id}/resize is missing
documented 401/403 responses for auth failures; update the operation to include
"401" and "403" responses (both using the existing schema reference
'#/definitions/httputil.Response') so clients can handle Unauthorized/Forbidden
cases and align with RBAC-protected endpoints.
In `@internal/core/services/instance_unit_test.go`:
- Around line 1738-1787: The test reveals that ResizeInstance/completeResize
increments vCPU usage (tenantSvc.IncrementUsage) before checking memory quota
(tenantSvc.CheckQuota) and on memory quota failure does not call
tenantSvc.DecrementUsage to roll back the prior vCPU increment; update the
implementation in ResizeInstance/completeResize so that after a failed
tenantSvc.CheckQuota( ..., "memory", deltaMemMB) it calls
tenantSvc.DecrementUsage(ctx, tenantID, "vcpus", deltaVCPUs) (or a generic
rollback helper) to undo the earlier IncrementUsage, and add corresponding error
handling/logging to ensure the decrement error is surfaced or logged while
returning the original quota error.
- Around line 1689-1736: The test reveals compute resize is performed in
executeResize before completeResize calls tenantSvc.CheckQuota, leaving no
rollback on quota failure; fix by either moving the quota validation to run
before executeResize or (preferred) adding a compute rollback in completeResize:
if tenantSvc.CheckQuota(...) returns an error, call compute.ResizeInstance(ctx,
oldContainerID, oldVcpusBytes, oldMemoryBytes) to restore previous size
(matching the existing repo.Update rollback pattern), log the rollback and then
return the quota error so state remains consistent; ensure you reference
executeResize, completeResize, tenantSvc.CheckQuota, compute.ResizeInstance and
repo.Update when making the change.
In `@internal/core/services/instance.go`:
- Around line 756-760: The resize flow currently calls s.executeResize(...)
before quota validation in s.completeResize(...), risking a mutated backend if
quota fails; move or replicate the quota check to run before mutating the
backend: call the same quota validation (the CheckQuota logic used inside
completeResize for upsizes) with tenantID and newInstanceType/newIT prior to
calling s.executeResize(ctx, target, newIT), and only proceed to s.executeResize
when that pre-check succeeds (or perform a DB quota reservation prior to
executeResize and release on failure) so the backend is never mutated when quota
would ultimately reject the operation.
- Around line 891-896: The code currently logs each quota update error
(s.logger.Error with "quota update failed after resize") but then returns an
Internal error when quotaErrs > 0, which incorrectly signals the resize failed;
change this to treat quota updates as best-effort: remove the errors.Wrap return
and instead log a single aggregated warning including tenantID and quotaErrs,
optionally emit an audit/metric indicating quota drift, and return nil so the
successful resize (and its success event/audit path) proceeds; keep the existing
per-error logging but do not abort after s.repo.Update has already committed.
- Around line 732-735: The current code collapses all errors from
resolveInstance into a NotFound error; change the logic in the caller of
resolveInstance so that if err != nil you return the original err (or wrap it
preserving its type/message) and only return errors.New(errors.NotFound,
"instance not found") when err == nil and inst == nil. Reference resolveInstance
and the surrounding function handling the inst, err tuple and ensure you do not
overwrite repository/internal errors with a 404.
- Around line 817-879: The memory delta is in MB but tenant quotas are in GB;
convert deltaMemMB to GB before calling tenantSvc methods and when rolling back.
Compute a deltaMemGB (e.g. deltaMemGB := deltaMemMB / 1024, use appropriate
rounding if needed) and replace uses of deltaMemMB with deltaMemGB in calls to
s.tenantSvc.CheckQuota, s.tenantSvc.IncrementUsage, s.tenantSvc.DecrementUsage
and in corresponding rollback/rollbackErr messages (references: deltaMemMB,
memIncremented, the memory quota branch and the rollback block where memory
decrement/increment and error messages are constructed). Ensure both positive
and negative memory flows use the converted GB value.
In `@internal/platform/resilient_compute.go`:
- Around line 140-162: The ResizeInstance, CreateSnapshot, RestoreSnapshot, and
DeleteSnapshot wrappers in ResilientCompute use r.opts.CallTimeout (short 2m)
which can prematurely time out long Libvirt cold-resize flows; update each to
call r.callProtected with the long timeout bucket (e.g. r.opts.LongCallTimeout
or the configured long-duration field on opts) instead of r.opts.CallTimeout so
these operations use the extended timeout; locate the methods ResizeInstance,
CreateSnapshot, RestoreSnapshot, DeleteSnapshot and replace the timeout argument
passed into callProtected accordingly.
In `@internal/repositories/libvirt/adapter_unit_test.go`:
- Around line 882-925: The test currently creates snapshotPath but never wires
it into RestoreSnapshot because execCommand always returns a no-op "true" and
the test asserts an error; change the execCommand stub in the LibvirtAdapter
used by this test so that when called with name == "tar" and arg[1] == "xzf" it
actually executes the real tar binary against the created snapshotPath (e.g.
return exec.Command("tar", "xzf", snapshotPath, "-C", tmpDir)) so the archive
you built is actually extracted, then change the final assertion from
require.Error(t, err) to require.NoError(t, err) to verify the success path of
RestoreSnapshot; update references to tmpDir, snapshotPath, execCommand and
RestoreSnapshot in this test accordingly.
In `@internal/repositories/libvirt/adapter.go`:
- Around line 29-33: The three compiled regex globals (memoryRe, currentMemRe,
vcpuRe) should be moved off package scope into fields on LibvirtAdapter (e.g.,
MemoryRe, CurrentMemRe, VcpuRe) and initialized inside NewLibvirtAdapter so they
are still compiled once per adapter instance; update any code that references
memoryRe/currentMemRe/vcpuRe to use the new LibvirtAdapter fields and remove the
package-level vars entirely to comply with the “no globals” rule.
- Around line 205-235: The rollbackFromSnapshot helper is being reused for
distinct failure phases but each phase needs different recovery steps; update
the resize flow around applyDomainResize, DomainUndefine, DomainDefineXML, and
DomainCreate to use phase-specific rollback functions (e.g.,
rollbackBeforeUndefine, rollbackAfterUndefine, rollbackAfterDefine) or extend
rollbackFromSnapshot with a phase enum so you can: (1) before undefine — avoid
redefining when original domain still exists and instead ensure volumes are
untouched; (2) after undefine — recreate/define the original domain by using
lookup by UUID/name and recreate volumes if needed (handle DomainLookupByName
failure), and (3) after new XML is defined — if later steps fail, first
undefine/remove the partially defined new domain before restoring the original
XML; in all rollbacks ensure the VM is restarted when wasRunning was true and
surface distinct errors combining original error and rollback error. Reference
applyDomainResize, rollbackFromSnapshot, DomainUndefine, DomainDefineXML,
DomainCreate, and the new phase-specific rollback helpers when making changes.
- Around line 253-260: The ReplaceAllString calls for memoryRe, currentMemRe,
and vcpuRe are replacing entire elements (dropping attributes like placement,
current, cpuset, unit); update the replacements to preserve matched attributes
and only change the element text node — for example, use capture groups in the
regexes (or ReplaceAllStringFunc) to retain the opening tag/attributes and
closing tag and substitute just the numeric content so memoryRe, currentMemRe
and vcpuRe rewrite only the inner value (still using memoryKiB and vcpus) rather
than overwriting the whole element.
In `@internal/repositories/noop/adapters.go`:
- Around line 404-410: The NoopTaskQueue implementation removed the
durable-queue methods and therefore no longer satisfies ports.DurableTaskQueue;
restore the missing methods (EnsureGroup, Receive, Ack, Nack, ReclaimStale) on
the NoopTaskQueue type so it can be injected where durable behavior is expected
(e.g., where provision_worker.go calls those methods). Implement these methods
as no-ops or simple stubs consistent with existing Enqueue/Dequeue behavior
(return appropriate zero values and nil errors) and ensure method signatures
exactly match those declared in ports.DurableTaskQueue so the type again
satisfies the interface.
In `@tests/compute_e2e_test.go`:
- Around line 257-316: TestResizeInstanceDownsize currently launches the
instance with "instance_type": "basic-2" and then resizes to the same "basic-2",
so it never verifies an actual downsize; update the LaunchInstance payload in
TestResizeInstanceDownsize to use a larger type (e.g., "standard-1") and keep
the Resize payload targeting "basic-2", so the test actually exercises a
downsize flow for the instance created by instanceID and the existing
waitForInstanceStatus/Resize/Terminate steps.
---
Nitpick comments:
In `@docs/swagger/swagger.json`:
- Around line 4005-4030: Add a 403 response entry to the responses object for
the instance resize endpoint in docs/swagger/swagger.json to document
RBAC-denied requests: update the "responses" block (the same block that
currently lists "200","400","404","500") to include a "403" key with
"description": "Forbidden" and the same schema reference ("$ref":
"#/definitions/httputil.Response") so the API contract reflects authorization
failures introduced by the new instance:resize permission.
- Around line 11222-11232: The schema for httphandlers.ResizeInstanceRequest
exposes instance_type as an unconstrained string; add metadata to tighten it by
adding a "description" (e.g., "Target instance type identifier") and basic
validation constraints such as "minLength": 1 and an optional "pattern" or
"enum" if known; update the httphandlers.ResizeInstanceRequest ->
properties.instance_type entry to include these fields so generated docs and
SDKs surface clear validation and intent.
In `@docs/swagger/swagger.yaml`:
- Around line 2334-2340: The OpenAPI schema for
httphandlers.ResizeInstanceRequest currently requires instance_type but allows
an empty string; update the instance_type property in the
httphandlers.ResizeInstanceRequest schema to include "minLength: 1" so empty
strings are rejected (i.e., add minLength: 1 under the instance_type property to
tighten validation).
In `@internal/handlers/instance_handler_test.go`:
- Around line 578-594: The test "QuotaExceeded" in
internal/handlers/instance_handler_test.go calls the service mock for
ResizeInstance and returns errors.Forbidden but the test name and E2E
expectations expect a quota-exceeded error (errors.QuotaExceeded -> HTTP 429);
update the mocked return in the ResizeInstance expectation to return
errors.QuotaExceeded (preserving the message) so the handler test, the
ResizeInstance route, and the E2E TestResizeInstance are consistent with a 429
outcome.
In `@tests/compute_e2e_test.go`:
- Around line 199-227: The inline assertion in the t.Run("Resize") test is wrong
because the resize handler returns gin.H{"message":"instance resized"} (no
instance_type), so when resp.StatusCode == http.StatusOK stop decoding into
res.Data.InstanceType and instead call the existing VerifyResize helper to fetch
the instance via GET and assert the instance type, or simply remove the
decode/assert block; update the success branch to invoke
VerifyResize(testutil.TestBaseURL, testutil.TestRouteInstances, instanceID,
token, "standard-1") (or equivalent) so the test verifies the actual instance
type via GET rather than parsing the resize response body.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 97154a85-0867-404e-b52b-ee76f338259e
📒 Files selected for processing (34)
docs/FEATURES.mddocs/adr/ADR-025-instance-resize.mddocs/api-reference.mddocs/swagger/docs.godocs/swagger/swagger.jsondocs/swagger/swagger.yamlinternal/api/setup/router.gointernal/core/domain/rbac.gointernal/core/ports/compute.gointernal/core/ports/instance.gointernal/core/services/instance.gointernal/core/services/instance_unit_test.gointernal/core/services/mock_compute_test.gointernal/handlers/instance_handler.gointernal/handlers/instance_handler_test.gointernal/platform/resilient_compute.gointernal/platform/resilient_compute_test.gointernal/repositories/docker/adapter.gointernal/repositories/docker/fakes_test.gointernal/repositories/firecracker/adapter.gointernal/repositories/firecracker/adapter_noop.gointernal/repositories/k8s/kubeadm_provisioner_test.gointernal/repositories/k8s/mocks_test.gointernal/repositories/libvirt/adapter.gointernal/repositories/libvirt/adapter_unit_test.gointernal/repositories/libvirt/lb_proxy_test.gointernal/repositories/noop/adapters.gointernal/repositories/postgres/migrations/107_seed_instance_resize_permission.down.sqlinternal/repositories/postgres/migrations/107_seed_instance_resize_permission.up.sqlinternal/workers/database_failover_worker_test.gointernal/workers/healing_worker_test.gointernal/workers/pipeline_worker_test.gopkg/httputil/response.gotests/compute_e2e_test.go
| ### Quota Enforcement | ||
|
|
||
| The service calculates a **delta** between old and new instance types: | ||
|
|
||
| - If `deltaCPU > 0` or `deltaMem > 0` (upsize): calls `tenantSvc.CheckQuota` before proceeding | ||
| - If downsize (`delta < 0`): quota check is skipped (releasing resources back to the pool) | ||
| - If same size (`delta == 0`): no quota interaction | ||
|
|
||
| After a successful resize, usage counters are updated with the delta (`IncrementUsage` for upsize, `DecrementUsage` for downsize). Failures in usage updates are logged but not propagated — a future background reconciliation worker could correct drift. |
There was a problem hiding this comment.
ADR and implementation disagree on quota-update failures.
This section says quota usage update failures are only logged, but internal/core/services/instance.go currently returns an internal error after a successful resize when quotaErrs is non-empty. The ADR should match the actual response behavior, or the service should be updated to match the ADR.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/adr/ADR-025-instance-resize.md` around lines 42 - 50, ADR-025 conflicts
with current code: after a successful resize the code in
internal/core/services/instance.go collects quotaErrs from
IncrementUsage/DecrementUsage and returns an internal error, whereas the ADR
says failures should only be logged. Either update the ADR text to state that
quota update failures cause the service to return an internal error (reference
quotaErrs, IncrementUsage, DecrementUsage) or change the implementation in
internal/core/services/instance.go to swallow quotaErrs (log them) and not
return an error after a successful resize so behavior matches the ADR. Make the
change consistently and update the ADR or code comments to reference the final
decision.
| /instances/{id}/resize: | ||
| post: | ||
| consumes: | ||
| - application/json | ||
| description: Change the instance type (CPU/memory) of an existing instance | ||
| parameters: | ||
| - description: Instance ID | ||
| in: path | ||
| name: id | ||
| required: true | ||
| type: string | ||
| - description: Resize request | ||
| in: body | ||
| name: request | ||
| required: true | ||
| schema: | ||
| $ref: '#/definitions/httphandlers.ResizeInstanceRequest' | ||
| produces: | ||
| - application/json | ||
| responses: | ||
| "200": | ||
| description: OK | ||
| schema: | ||
| $ref: '#/definitions/httputil.Response' | ||
| "400": | ||
| description: Bad Request | ||
| schema: | ||
| $ref: '#/definitions/httputil.Response' | ||
| "404": | ||
| description: Not Found | ||
| schema: | ||
| $ref: '#/definitions/httputil.Response' | ||
| "500": | ||
| description: Internal Server Error | ||
| schema: | ||
| $ref: '#/definitions/httputil.Response' | ||
| security: | ||
| - APIKeyAuth: [] | ||
| summary: Resize an instance | ||
| tags: | ||
| - instances |
There was a problem hiding this comment.
Document auth failures for the resize endpoint.
This operation is secured, but 401/403 are not documented. Adding them improves client behavior and aligns with RBAC-protected endpoints.
Suggested diff
responses:
"200":
description: OK
schema:
$ref: '#/definitions/httputil.Response'
+ "401":
+ description: Unauthorized
+ schema:
+ $ref: '#/definitions/httputil.Response'
+ "403":
+ description: Forbidden
+ schema:
+ $ref: '#/definitions/httputil.Response'
"400":
description: Bad Request
schema:
$ref: '#/definitions/httputil.Response'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/swagger/swagger.yaml` around lines 5010 - 5050, The OpenAPI path
/instances/{id}/resize is missing documented 401/403 responses for auth
failures; update the operation to include "401" and "403" responses (both using
the existing schema reference '#/definitions/httputil.Response') so clients can
handle Unauthorized/Forbidden cases and align with RBAC-protected endpoints.
| t.Run("QuotaExceeded_Memory", func(t *testing.T) { | ||
| repo := new(MockInstanceRepo) | ||
| typeRepo := new(MockInstanceTypeRepo) | ||
| rbacSvc := new(MockRBACService) | ||
| tenantSvc := new(MockTenantService) | ||
| compute := new(MockComputeBackend) | ||
|
|
||
| svc := services.NewInstanceService(services.InstanceServiceParams{ | ||
| Repo: repo, | ||
| InstanceTypeRepo: typeRepo, | ||
| RBAC: rbacSvc, | ||
| TenantSvc: tenantSvc, | ||
| Compute: compute, | ||
| Logger: slog.Default(), | ||
| }) | ||
|
|
||
| ctx := context.Background() | ||
| userID := uuid.New() | ||
| tenantID := uuid.New() | ||
| instanceID := uuid.New() | ||
| ctx = appcontext.WithUserID(ctx, userID) | ||
| ctx = appcontext.WithTenantID(ctx, tenantID) | ||
|
|
||
| inst := &domain.Instance{ | ||
| ID: instanceID, | ||
| UserID: userID, | ||
| TenantID: tenantID, | ||
| Status: domain.StatusRunning, | ||
| InstanceType: "basic-2", | ||
| ContainerID: "cid-1", | ||
| } | ||
|
|
||
| oldType := &domain.InstanceType{ID: "basic-2", VCPUs: 2, MemoryMB: 2048} | ||
| newType := &domain.InstanceType{ID: "basic-4", VCPUs: 4, MemoryMB: 4096} | ||
|
|
||
| rbacSvc.On("Authorize", mock.Anything, userID, tenantID, domain.PermissionInstanceResize, "test-inst").Return(nil).Once() | ||
| repo.On("GetByName", mock.Anything, "test-inst").Return(inst, nil).Once() | ||
| typeRepo.On("GetByID", mock.Anything, "basic-2").Return(oldType, nil).Once() | ||
| typeRepo.On("GetByID", mock.Anything, "basic-4").Return(newType, nil).Once() | ||
| compute.On("ResizeInstance", mock.Anything, "cid-1", int64(4e9), int64(4096*1024*1024)).Return(nil).Once() | ||
| tenantSvc.On("CheckQuota", mock.Anything, tenantID, "vcpus", 2).Return(nil).Once() | ||
| tenantSvc.On("IncrementUsage", mock.Anything, tenantID, "vcpus", 2).Return(nil).Once() | ||
| tenantSvc.On("CheckQuota", mock.Anything, tenantID, "memory", 2048).Return(fmt.Errorf("insufficient memory quota")).Once() | ||
|
|
||
| err := svc.ResizeInstance(ctx, "test-inst", "basic-4") | ||
|
|
||
| require.Error(t, err) | ||
| assert.Contains(t, err.Error(), "insufficient memory quota") | ||
| mock.AssertExpectationsForObjects(t, repo, typeRepo, rbacSvc, tenantSvc, compute) | ||
| }) |
There was a problem hiding this comment.
Test exposes missing vCPU rollback when memory quota check fails.
In this test, vCPU IncrementUsage succeeds (line 1779), then memory CheckQuota fails (line 1780). There's no DecrementUsage mock for vcpus, indicating the implementation doesn't roll back the vCPU increment when memory quota check fails.
Looking at context snippet 1 (completeResize), the rollback logic only handles IncrementUsage failures, not CheckQuota failures:
if err := s.tenantSvc.CheckQuota(ctx, tenantID, "memory", deltaMemMB); err != nil {
return err // No rollback of vCPU increment
}The test correctly reflects implementation behavior, but this may be a quota accounting leak.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/core/services/instance_unit_test.go` around lines 1738 - 1787, The
test reveals that ResizeInstance/completeResize increments vCPU usage
(tenantSvc.IncrementUsage) before checking memory quota (tenantSvc.CheckQuota)
and on memory quota failure does not call tenantSvc.DecrementUsage to roll back
the prior vCPU increment; update the implementation in
ResizeInstance/completeResize so that after a failed tenantSvc.CheckQuota( ...,
"memory", deltaMemMB) it calls tenantSvc.DecrementUsage(ctx, tenantID, "vcpus",
deltaVCPUs) (or a generic rollback helper) to undo the earlier IncrementUsage,
and add corresponding error handling/logging to ensure the decrement error is
surfaced or logged while returning the original quota error.
| inst, err := s.resolveInstance(ctx, idOrName) | ||
| if err != nil || inst == nil { | ||
| return errors.New(errors.NotFound, "instance not found") | ||
| } |
There was a problem hiding this comment.
Don't collapse repository failures into NotFound.
This rewrites every resolveInstance failure as instance not found. If the repo returns an internal/storage error, the service will now hide it as a 404 instead of surfacing the real failure.
Suggested fix
inst, err := s.resolveInstance(ctx, idOrName)
- if err != nil || inst == nil {
+ if err != nil {
+ return err
+ }
+ if inst == nil {
return errors.New(errors.NotFound, "instance not found")
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| inst, err := s.resolveInstance(ctx, idOrName) | |
| if err != nil || inst == nil { | |
| return errors.New(errors.NotFound, "instance not found") | |
| } | |
| inst, err := s.resolveInstance(ctx, idOrName) | |
| if err != nil { | |
| return err | |
| } | |
| if inst == nil { | |
| return errors.New(errors.NotFound, "instance not found") | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/core/services/instance.go` around lines 732 - 735, The current code
collapses all errors from resolveInstance into a NotFound error; change the
logic in the caller of resolveInstance so that if err != nil you return the
original err (or wrap it preserving its type/message) and only return
errors.New(errors.NotFound, "instance not found") when err == nil and inst ==
nil. Reference resolveInstance and the surrounding function handling the inst,
err tuple and ensure you do not overwrite repository/internal errors with a 404.
| // Replace <memory unit="KiB">...</memory> or <memory>...</memory> | ||
| result = memoryRe.ReplaceAllString(result, fmt.Sprintf(`<memory unit="KiB">%d</memory>`, memoryKiB)) | ||
|
|
||
| // Replace <currentMemory unit="KiB">...</currentMemory> or <currentMemory>...</currentMemory> | ||
| result = currentMemRe.ReplaceAllString(result, fmt.Sprintf(`<currentMemory unit="KiB">%d</currentMemory>`, memoryKiB)) | ||
|
|
||
| // Replace <vcpu placement="static">...</vcpu> or <vcpu>...</vcpu> | ||
| result = vcpuRe.ReplaceAllString(result, fmt.Sprintf(`<vcpu>%d</vcpu>`, vcpus)) |
There was a problem hiding this comment.
These replacements strip existing libvirt attributes.
Each ReplaceAllString rewrites the whole node, so attributes like placement, current, cpuset, or an existing unit are discarded. That can change semantics or make a previously valid domain XML fail to redefine. Update only the text node or preserve the matched attributes when rewriting these elements.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/repositories/libvirt/adapter.go` around lines 253 - 260, The
ReplaceAllString calls for memoryRe, currentMemRe, and vcpuRe are replacing
entire elements (dropping attributes like placement, current, cpuset, unit);
update the replacements to preserve matched attributes and only change the
element text node — for example, use capture groups in the regexes (or
ReplaceAllStringFunc) to retain the opening tag/attributes and closing tag and
substitute just the numeric content so memoryRe, currentMemRe and vcpuRe rewrite
only the inner value (still using memoryKiB and vcpus) rather than overwriting
the whole element.
| // NoopTaskQueue is a no-op task queue. | ||
| type NoopTaskQueue struct{} | ||
|
|
||
| func (q *NoopTaskQueue) Enqueue(ctx context.Context, queue string, payload interface{}) error { | ||
| return nil | ||
| } | ||
| func (q *NoopTaskQueue) Dequeue(ctx context.Context, queue string) (string, error) { return "", nil } |
There was a problem hiding this comment.
NoopTaskQueue no longer satisfies the durable queue contract.
ports.DurableTaskQueue in internal/core/ports/task_queue.go:33-61 still requires EnsureGroup, Receive, Ack, Nack, and ReclaimStale, and internal/workers/provision_worker.go:65-245 still calls those methods on the injected queue. Dropping them here means this noop can no longer stand in for the durable queue in tests or system wiring.
Proposed fix
type NoopTaskQueue struct{}
func (q *NoopTaskQueue) Enqueue(ctx context.Context, queue string, payload interface{}) error {
return nil
}
func (q *NoopTaskQueue) Dequeue(ctx context.Context, queue string) (string, error) { return "", nil }
+func (q *NoopTaskQueue) EnsureGroup(ctx context.Context, queueName, groupName string) error {
+ return nil
+}
+func (q *NoopTaskQueue) Receive(ctx context.Context, queueName, groupName, consumerName string) (*ports.DurableMessage, error) {
+ return nil, nil
+}
+func (q *NoopTaskQueue) Ack(ctx context.Context, queueName, groupName, messageID string) error {
+ return nil
+}
+func (q *NoopTaskQueue) Nack(ctx context.Context, queueName, groupName, messageID string) error {
+ return nil
+}
+func (q *NoopTaskQueue) ReclaimStale(ctx context.Context, queueName, groupName, consumerName string, minIdleMs int64, count int64) ([]ports.DurableMessage, error) {
+ return nil, nil
+}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // NoopTaskQueue is a no-op task queue. | |
| type NoopTaskQueue struct{} | |
| func (q *NoopTaskQueue) Enqueue(ctx context.Context, queue string, payload interface{}) error { | |
| return nil | |
| } | |
| func (q *NoopTaskQueue) Dequeue(ctx context.Context, queue string) (string, error) { return "", nil } | |
| // NoopTaskQueue is a no-op task queue. | |
| type NoopTaskQueue struct{} | |
| func (q *NoopTaskQueue) Enqueue(ctx context.Context, queue string, payload interface{}) error { | |
| return nil | |
| } | |
| func (q *NoopTaskQueue) Dequeue(ctx context.Context, queue string) (string, error) { return "", nil } | |
| func (q *NoopTaskQueue) EnsureGroup(ctx context.Context, queueName, groupName string) error { | |
| return nil | |
| } | |
| func (q *NoopTaskQueue) Receive(ctx context.Context, queueName, groupName, consumerName string) (*ports.DurableMessage, error) { | |
| return nil, nil | |
| } | |
| func (q *NoopTaskQueue) Ack(ctx context.Context, queueName, groupName, messageID string) error { | |
| return nil | |
| } | |
| func (q *NoopTaskQueue) Nack(ctx context.Context, queueName, groupName, messageID string) error { | |
| return nil | |
| } | |
| func (q *NoopTaskQueue) ReclaimStale(ctx context.Context, queueName, groupName, consumerName string, minIdleMs int64, count int64) ([]ports.DurableMessage, error) { | |
| return nil, nil | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/repositories/noop/adapters.go` around lines 404 - 410, The
NoopTaskQueue implementation removed the durable-queue methods and therefore no
longer satisfies ports.DurableTaskQueue; restore the missing methods
(EnsureGroup, Receive, Ack, Nack, ReclaimStale) on the NoopTaskQueue type so it
can be injected where durable behavior is expected (e.g., where
provision_worker.go calls those methods). Implement these methods as no-ops or
simple stubs consistent with existing Enqueue/Dequeue behavior (return
appropriate zero values and nil errors) and ensure method signatures exactly
match those declared in ports.DurableTaskQueue so the type again satisfies the
interface.
| func TestResizeInstanceDownsize(t *testing.T) { | ||
| if err := waitForServer(); err != nil { | ||
| t.Fatalf("Failing Resize Downsize E2E test: %v", err) | ||
| } | ||
|
|
||
| client := &http.Client{Timeout: 60 * time.Second} | ||
| token := registerAndLogin(t, client, "resize-down-tester@thecloud.local", "Resize Down Tester") | ||
|
|
||
| var instanceID string | ||
| instanceName := fmt.Sprintf("e2e-resize-down-%d-%s", time.Now().UnixNano()%1000, uuid.New().String()) | ||
|
|
||
| // 1. Launch Instance with basic-2 type | ||
| t.Run("LaunchInstance", func(t *testing.T) { | ||
| payload := map[string]string{ | ||
| "name": instanceName, | ||
| "image": "nginx:alpine", | ||
| "instance_type": "basic-2", | ||
| "ports": "0:80", | ||
| } | ||
| resp := postRequest(t, client, testutil.TestBaseURL+testutil.TestRouteInstances, token, payload) | ||
| defer func() { _ = resp.Body.Close() }() | ||
|
|
||
| require.Equal(t, http.StatusAccepted, resp.StatusCode) | ||
|
|
||
| var res struct { | ||
| Data struct { | ||
| ID string `json:"id"` | ||
| } `json:"data"` | ||
| } | ||
| require.NoError(t, json.NewDecoder(resp.Body).Decode(&res)) | ||
| instanceID = res.Data.ID | ||
| }) | ||
|
|
||
| // 2. Wait for Running | ||
| t.Run("WaitForRunning", func(t *testing.T) { | ||
| lastStatus := waitForInstanceStatus(t, client, token, instanceID) | ||
| if lastStatus != domain.StatusRunning { | ||
| t.Skipf("Instance did not reach running state within timeout. Last status: %s", lastStatus) | ||
| } | ||
| }) | ||
|
|
||
| // 3. Downsize to basic-2 | ||
| t.Run("Resize", func(t *testing.T) { | ||
| payload := map[string]string{ | ||
| "instance_type": "basic-2", | ||
| } | ||
| resp := postRequest(t, client, fmt.Sprintf("%s%s/%s/resize", testutil.TestBaseURL, testutil.TestRouteInstances, instanceID), token, payload) | ||
| defer func() { _ = resp.Body.Close() }() | ||
|
|
||
| assert.Equal(t, http.StatusOK, resp.StatusCode) | ||
| }) | ||
|
|
||
| // 4. Terminate | ||
| t.Run("TerminateInstance", func(t *testing.T) { | ||
| resp := deleteRequest(t, client, fmt.Sprintf(testutil.TestRouteFormat, testutil.TestBaseURL, testutil.TestRouteInstances, instanceID), token) | ||
| defer func() { _ = resp.Body.Close() }() | ||
|
|
||
| assert.Equal(t, http.StatusOK, resp.StatusCode) | ||
| }) | ||
| } |
There was a problem hiding this comment.
Test does not actually test downsize—instance type unchanged.
The test is named TestResizeInstanceDownsize but:
- Instance is launched with
"instance_type": "basic-2"(line 273) - Resize requests
"instance_type": "basic-2"(line 301)
This tests a same-size no-op, not a downsize. For a proper downsize test, launch with a larger type (e.g., standard-1) and resize to basic-2.
Proposed fix for actual downsize test
func TestResizeInstanceDownsize(t *testing.T) {
// ...
- // 1. Launch Instance with basic-2 type
+ // 1. Launch Instance with standard-1 type (larger)
t.Run("LaunchInstance", func(t *testing.T) {
payload := map[string]string{
"name": instanceName,
"image": "nginx:alpine",
- "instance_type": "basic-2",
+ "instance_type": "standard-1",
"ports": "0:80",
}
// ...
})
// ...
- // 3. Downsize to basic-2
+ // 3. Downsize from standard-1 to basic-2
t.Run("Resize", func(t *testing.T) {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func TestResizeInstanceDownsize(t *testing.T) { | |
| if err := waitForServer(); err != nil { | |
| t.Fatalf("Failing Resize Downsize E2E test: %v", err) | |
| } | |
| client := &http.Client{Timeout: 60 * time.Second} | |
| token := registerAndLogin(t, client, "resize-down-tester@thecloud.local", "Resize Down Tester") | |
| var instanceID string | |
| instanceName := fmt.Sprintf("e2e-resize-down-%d-%s", time.Now().UnixNano()%1000, uuid.New().String()) | |
| // 1. Launch Instance with basic-2 type | |
| t.Run("LaunchInstance", func(t *testing.T) { | |
| payload := map[string]string{ | |
| "name": instanceName, | |
| "image": "nginx:alpine", | |
| "instance_type": "basic-2", | |
| "ports": "0:80", | |
| } | |
| resp := postRequest(t, client, testutil.TestBaseURL+testutil.TestRouteInstances, token, payload) | |
| defer func() { _ = resp.Body.Close() }() | |
| require.Equal(t, http.StatusAccepted, resp.StatusCode) | |
| var res struct { | |
| Data struct { | |
| ID string `json:"id"` | |
| } `json:"data"` | |
| } | |
| require.NoError(t, json.NewDecoder(resp.Body).Decode(&res)) | |
| instanceID = res.Data.ID | |
| }) | |
| // 2. Wait for Running | |
| t.Run("WaitForRunning", func(t *testing.T) { | |
| lastStatus := waitForInstanceStatus(t, client, token, instanceID) | |
| if lastStatus != domain.StatusRunning { | |
| t.Skipf("Instance did not reach running state within timeout. Last status: %s", lastStatus) | |
| } | |
| }) | |
| // 3. Downsize to basic-2 | |
| t.Run("Resize", func(t *testing.T) { | |
| payload := map[string]string{ | |
| "instance_type": "basic-2", | |
| } | |
| resp := postRequest(t, client, fmt.Sprintf("%s%s/%s/resize", testutil.TestBaseURL, testutil.TestRouteInstances, instanceID), token, payload) | |
| defer func() { _ = resp.Body.Close() }() | |
| assert.Equal(t, http.StatusOK, resp.StatusCode) | |
| }) | |
| // 4. Terminate | |
| t.Run("TerminateInstance", func(t *testing.T) { | |
| resp := deleteRequest(t, client, fmt.Sprintf(testutil.TestRouteFormat, testutil.TestBaseURL, testutil.TestRouteInstances, instanceID), token) | |
| defer func() { _ = resp.Body.Close() }() | |
| assert.Equal(t, http.StatusOK, resp.StatusCode) | |
| }) | |
| } | |
| func TestResizeInstanceDownsize(t *testing.T) { | |
| if err := waitForServer(); err != nil { | |
| t.Fatalf("Failing Resize Downsize E2E test: %v", err) | |
| } | |
| client := &http.Client{Timeout: 60 * time.Second} | |
| token := registerAndLogin(t, client, "resize-down-tester@thecloud.local", "Resize Down Tester") | |
| var instanceID string | |
| instanceName := fmt.Sprintf("e2e-resize-down-%d-%s", time.Now().UnixNano()%1000, uuid.New().String()) | |
| // 1. Launch Instance with standard-1 type (larger) | |
| t.Run("LaunchInstance", func(t *testing.T) { | |
| payload := map[string]string{ | |
| "name": instanceName, | |
| "image": "nginx:alpine", | |
| "instance_type": "standard-1", | |
| "ports": "0:80", | |
| } | |
| resp := postRequest(t, client, testutil.TestBaseURL+testutil.TestRouteInstances, token, payload) | |
| defer func() { _ = resp.Body.Close() }() | |
| require.Equal(t, http.StatusAccepted, resp.StatusCode) | |
| var res struct { | |
| Data struct { | |
| ID string `json:"id"` | |
| } `json:"data"` | |
| } | |
| require.NoError(t, json.NewDecoder(resp.Body).Decode(&res)) | |
| instanceID = res.Data.ID | |
| }) | |
| // 2. Wait for Running | |
| t.Run("WaitForRunning", func(t *testing.T) { | |
| lastStatus := waitForInstanceStatus(t, client, token, instanceID) | |
| if lastStatus != domain.StatusRunning { | |
| t.Skipf("Instance did not reach running state within timeout. Last status: %s", lastStatus) | |
| } | |
| }) | |
| // 3. Downsize from standard-1 to basic-2 | |
| t.Run("Resize", func(t *testing.T) { | |
| payload := map[string]string{ | |
| "instance_type": "basic-2", | |
| } | |
| resp := postRequest(t, client, fmt.Sprintf("%s%s/%s/resize", testutil.TestBaseURL, testutil.TestRouteInstances, instanceID), token, payload) | |
| defer func() { _ = resp.Body.Close() }() | |
| assert.Equal(t, http.StatusOK, resp.StatusCode) | |
| }) | |
| // 4. Terminate | |
| t.Run("TerminateInstance", func(t *testing.T) { | |
| resp := deleteRequest(t, client, fmt.Sprintf(testutil.TestRouteFormat, testutil.TestBaseURL, testutil.TestRouteInstances, instanceID), token) | |
| defer func() { _ = resp.Body.Close() }() | |
| assert.Equal(t, http.StatusOK, resp.StatusCode) | |
| }) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/compute_e2e_test.go` around lines 257 - 316, TestResizeInstanceDownsize
currently launches the instance with "instance_type": "basic-2" and then resizes
to the same "basic-2", so it never verifies an actual downsize; update the
LaunchInstance payload in TestResizeInstanceDownsize to use a larger type (e.g.,
"standard-1") and keep the Resize payload targeting "basic-2", so the test
actually exercises a downsize flow for the instance created by instanceID and
the existing waitForInstanceStatus/Resize/Terminate steps.
Add CreateSnapshot, RestoreSnapshot, and DeleteSnapshot methods to ComputeBackend interface. Libvirt adapter implements these via QCOW2 volume snapshots. Docker/Firecracker backends implement as no-ops since hot resize doesn't need snapshots. ResizeInstance in Libvirt adapter now: 1. Creates pre-resize snapshot before making changes 2. On failure: rolls back from snapshot and redefines domain 3. On success: deletes snapshot to avoid accumulation Also update ResilientCompute wrapper, all mock backends in tests, and fix nil pointer dereference bug in rollback path.
- rollbackFromSnapshot now calls DomainDefineXML after RestoreSnapshot to properly redefine the domain with original XML after DomainUndefine - Use UUID-based snapshot names to avoid second-precision collisions - Add unit tests for CreateSnapshot, RestoreSnapshot, DeleteSnapshot - Add unit tests for ResizeInstance rollback on DomainDefineXML and DomainUndefine failures - Add unit tests for rollbackFromSnapshot success and failure paths
…aky tests - Check rollbackFromSnapshot errors at all 3 call sites in ResizeInstance (was ignoring errors at applyDomainResize and DomainUndefine failure paths) - Skip RollbackFromSnapshot/success and RollbackFromSnapshot/redefine_fails_after_restore since they require real tar archive contents that the execCommand mock cannot provide (rollback path is covered by ResizeInstance_RollbackOnFailure tests)
…regexes - Line 232-233 (DomainCreate failure): check rollbackFromSnapshot error and include it in the returned error. This is the 4th rollback call site in ResizeInstance — all now checked consistently. - Pre-compile 3 regexes (memoryRe, currentMemRe, vcpuRe) at package level instead of re-compiling on every applyDomainResize call.
snapshotPath(id, name string) now returns the snapshot archive path, replacing 3 identical inline format strings in CreateSnapshot, RestoreSnapshot, and DeleteSnapshot.
Document the snapshot creation step (step 4) and the rollback mechanism that restores volume and redefines domain on failure. Also document edge cases: snapshot creation failure is non-fatal, and rollback itself can fail.
- TestLibvirtAdapter_ApplyDomainResize: 5 subtests covering regex replacement (both replaced, missing both elements, memory without unit attr, resize-to-same-values error, vcpu with placement attr) - TestLibvirtAdapter_ResizeInstance_RollbackOnFailure: add rollback_on_DomainCreate_failure subtest (was previously untested) Removed incorrect SnapshotCreationFails_ResizeProceeds test from instance_unit_test.go — InstanceService.ResizeInstance does not call CreateSnapshot (snapshot creation is adapter-internal).
d4ee00d to
043a052
Compare
Summary
Add snapshot-before-resize safety net for Libvirt/KVM cold resize operations.
CreateSnapshot,RestoreSnapshot,DeleteSnapshottoComputeBackendinterfaceResizeInstancein Libvirt adapter creates a pre-resize snapshot, rolls back on failure, deletes on successTest plan
Related
ADR-025 flagged cold resize risk; this implements the snapshot-before-resize pattern.
Summary by CodeRabbit
New Features
POST /instances/:id/resizeendpointDocumentation
Tests