feat(instances): add pause and resume functionality by poyrazK · Pull Request #320 · poyrazK/thecloud

poyrazK · 2026-04-28T11:36:05Z

Summary

Add POST /instances/:id/pause and POST /instances/:id/resume API endpoints
Implement pause/resume for Docker (ContainerPause/ContainerUnpause) and Libvirt (DomainSuspend/DomainResume) backends
Add StatusPaused to instance status enum
Add state validation (RUNNING→PAUSED, PAUSED→RUNNING)
Add RBAC authorization with PermissionInstanceUpdate
Add ResilientCompute wrapper with circuit breaker and bulkhead patterns
Add comprehensive test coverage

Test plan

go build ./... - compiles
go test ./internal/platform/... -v - platform tests pass
go test ./internal/repositories/libvirt/... -v - libvirt tests pass
go test ./internal/repositories/docker/... -v - docker tests pass
go test ./internal/handlers/... -v - handler tests pass
go vet ./... - no issues

Commits (11 total)

feat(instances): add StatusPaused and pause/resume to interfaces
feat(docker): implement PauseInstance and ResumeInstance
feat(libvirt): implement PauseInstance and ResumeInstance via DomainSuspend/DomainResume
feat(stubs): add PauseInstance, ResumeInstance, and ResizeInstance to Firecracker and Noop backends
feat(services): implement PauseInstance and ResumeInstance in InstanceService
feat(handlers): add POST /instances/:id/pause and /resume endpoints
feat(router): wire /instances/:id/pause and /resume routes
feat(platform): add Bulkhead concurrency limiter and enhance CircuitBreaker
feat(platform): add ResilientCompute wrapper with circuit breaker and bulkhead
test(mocks): add PauseInstance and ResumeInstance to K8s and libvirt mocks
test(workers): add missing interface methods to worker test mocks

Summary by CodeRabbit

New Features
- Added instance pause/resume endpoints, including a PAUSED lifecycle state and validated transitions; backends updated to support pause/resume where available.
Documentation
- Updated API docs, feature guide, and roadmap to describe pause/resume behavior and backend differences.
Bug Fixes
- Resilience layer now returns context cancellation/deadline errors (not the wait-timeout error) when appropriate.
Tests
- Expanded tests and mocks to cover pause/resume flows, errors, and rollback paths.

Add PAUSED status to InstanceStatus enum and extend both ComputeBackend and InstanceService interfaces with PauseInstance and ResumeInstance methods.

Use ContainerPause and ContainerUnpause Docker API calls to freeze and unfreeze containers. Add ContainerUpdate for ResizeInstance support.

…uspend/DomainResume Add DomainSuspend and DomainResume to LibvirtClient interface and implement them in RealLibvirtClient using virDomainSuspend and virDomainResume libvirt calls.

… Firecracker and Noop backends Firecracker returns nil (not supported). Noop returns nil (success). These are placeholder implementations to satisfy the interface.

…eService Add RBAC authorization with PermissionInstanceUpdate, validate state transitions (RUNNING->PAUSED, PAUSED->RUNNING), update database record with new status, and emit audit log events for pause/resume actions.

Wire Pause and Resume HTTP handlers with RBAC authorization, swagger documentation, and corresponding handler tests.

Add POST routes for pause and resume actions to the instance group with RBAC authorization middleware.

…reaker Add Bulkhead type with semaphore-based concurrency limiting to prevent cascading failures. Enhance CircuitBreaker with NewCircuitBreakerWithOpts, StateChangeFunc callback, and improved half-open state handling.

… bulkhead Wrap ComputeBackend with circuit breaker and bulkhead patterns for resilience. Includes timeout-aware call protection and comprehensive test coverage for circuit tripping, bulkhead limits, and timeouts.

…mocks Update mock implementations to satisfy the updated ComputeBackend interface with pause/resume methods.

Update mock compute backends in worker tests to implement PauseInstance, ResumeInstance, and ResizeInstance to satisfy the updated ports.ComputeBackend interface.

coderabbitai · 2026-04-28T11:36:11Z

Warning

Rate limit exceeded

@poyrazK has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 45 minutes and 1 second before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 45e4a828-549d-4f1d-8006-beb86482d3c6

📥 Commits

Reviewing files that changed from the base of the PR and between 5010b8e and a073124.

📒 Files selected for processing (7)

internal/core/services/function_internal_test.go
internal/core/services/instance.go
internal/core/services/instance_unit_test.go
internal/platform/bulkhead.go
internal/repositories/docker/adapter.go
internal/repositories/libvirt/adapter.go
internal/workers/healing_worker_test.go

📝 Walkthrough

Walkthrough

Adds PAUSED instance state and pause/resume lifecycle across stack: API routes and handlers, service logic with validation/audit/metrics, compute/backend interface extensions and implementations (Docker, Libvirt, Firecracker, noop), resilient/bulkhead plumbing, error definitions, and tests/docs updates.

Changes

Cohort / File(s)	Summary
Domain & Errors `internal/core/domain/instance.go`, `internal/errors/errors.go`	Added `StatusPaused` (`"PAUSED"`) and conflict errors `ErrInstanceNotPausable`, `ErrInstanceNotResumable`.
API & Handlers `internal/api/setup/router.go`, `internal/handlers/instance_handler.go`, `internal/handlers/instance_handler_test.go`	Registered `POST /instances/:id/pause` and `/resume`; added handler methods and handler tests covering success, not-found, and conflict cases.
Service Layer `internal/core/ports/instance.go`, `internal/core/services/instance.go`, `internal/core/services/instance_unit_test.go`, `internal/core/services/mock_compute_test.go`	Extended `InstanceService` interface; implemented `PauseInstance`/`ResumeInstance` with RBAC, state validation, compute delegation, DB updates, audit logging, rollback on repo failures, metrics, and unit tests including rollback paths.
Compute Ports & Resilience `internal/core/ports/compute.go`, `internal/platform/resilient_compute.go`, `internal/platform/resilient_compute_test.go`	Extended `ComputeBackend` interface; added resilient `PauseInstance`/`ResumeInstance` wrappers that run under bulkhead/circuit-breaker/timeouts; tests updated.
Backends: Docker / Firecracker / Libvirt / Noop `internal/repositories/docker/adapter.go`, `internal/repositories/docker/fakes_test.go`, `internal/repositories/firecracker/...`, `internal/repositories/libvirt/...`, `internal/repositories/noop/adapters.go`	Added pause/resume methods across adapters: Docker delegates to ContainerPause/Unpause; Libvirt implements DomainSuspend/DomainResume with state checks; Firecracker no-op/stub implementations; noop backend returns success; corresponding test fakes/mocks added.
Libvirt Client & Mocks `internal/repositories/libvirt/libvirt_client.go`, `internal/repositories/libvirt/mock_client_test.go`, `internal/repositories/libvirt/real_client.go`	Added `DomainSuspend`/`DomainResume` to libvirt client interface; real and mock client implementations added.
Platform Bulkhead behavior `internal/platform/bulkhead.go`, `internal/platform/bulkhead_test.go`	Changed acquire semantics to return `ctx.Err()` on context cancellation (reserve `ErrBulkheadFull` for wait-timeout only); updated test expectation.
Tests & Mocks Across Repo `internal/repositories/k8s/...`, `internal/repositories/libvirt/...`, `internal/workers/..._test.go`, `internal/workers/...`	Extended many test-local mocks and fakes to include `PauseInstance`/`ResumeInstance` stubs enabling broader test coverage.
Docs & API Spec `docs/api-reference.md`, `docs/FEATURES.md`, `docs/guides/libvirt-backend.md`, `docs/swagger/...`, `compute-roadmap.md`	Documented new endpoints, updated InstanceStatus enum to include PAUSED, added feature/docs entries and Swagger/OpenAPI templates and generated specs.
Misc. `internal/core/ports/...`, `internal/repositories/...`	Minor formatting/refactor in a few files to align signatures; many files extended consistently to support pause/resume functionality.

Sequence Diagram

sequenceDiagram
    participant Client as Client
    participant API as API Router/Handler
    participant Svc as Instance Service
    participant Backend as Compute Backend
    participant DB as Database
    participant Audit as Audit Logger

    Client->>API: POST /instances/:id/pause
    API->>Svc: PauseInstance(ctx, id)
    Svc->>Svc: RBAC + Load instance (repo)
    Svc->>Backend: PauseInstance(ctx, targetID)
    Backend-->>Svc: success / error
    alt success
        Svc->>DB: Update instance status -> PAUSED
        DB-->>Svc: success
        Svc->>Audit: Write audit (instance.pause)
        Svc-->>API: success
        API-->>Client: 200 "instance paused"
    else backend conflict
        Svc-->>API: error (Conflict)
        API-->>Client: 409
    else db update fails
        Svc->>Backend: ResumeInstance(ctx, targetID)  // best-effort rollback
        Svc-->>API: error
        API-->>Client: 500
    end

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I paused my hop, then held my breath,
Buttons pressed to cheat quick death,
From Docker sigh to libvirt sleep,
State frozen safe, then wake from keep.
A little hop, then onward run—compute's new trick is done!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.21% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(instances): add pause and resume functionality' accurately and concisely describes the main change: adding pause and resume capabilities to instances across the codebase.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/instance-pause-resume

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 45 minutes and 1 second.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Adds pause/resume lifecycle support for instances across API, services, and compute backends, and introduces a resilience wrapper (circuit breaker + bulkhead + timeouts) for compute operations.

Changes:

Add POST /instances/:id/pause and POST /instances/:id/resume endpoints, routed with RBAC PermissionInstanceUpdate.
Implement pause/resume in service layer + Docker/Libvirt backends and extend ports/status enums.
Introduce platform.ResilientCompute, plus new/updated circuit breaker and bulkhead primitives (with tests).

Reviewed changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
internal/workers/pipeline_worker_test.go	Extends worker test mock backend with new interface methods.
internal/workers/healing_worker_test.go	Extends worker test service mock with pause/resume methods.
internal/workers/database_failover_worker_test.go	Extends worker test mock backend with pause/resume/resize methods.
internal/repositories/noop/adapters.go	Adds no-op pause/resume/resize to satisfy updated compute interface.
internal/repositories/libvirt/real_client.go	Adds context-aware DomainSuspend/DomainResume wrappers.
internal/repositories/libvirt/mock_client_test.go	Extends libvirt mock client with suspend/resume methods.
internal/repositories/libvirt/libvirt_client.go	Extends LibvirtClient interface with suspend/resume.
internal/repositories/libvirt/lb_proxy_test.go	Updates libvirt LB proxy test mock to satisfy interface.
internal/repositories/libvirt/adapter.go	Implements Pause/Resume/Resize for libvirt backend.
internal/repositories/k8s/mocks_test.go	Extends k8s test mock service with pause/resume.
internal/repositories/k8s/kubeadm_provisioner_test.go	Extends kubeadm test mock service with pause/resume.
internal/repositories/firecracker/adapter_noop.go	Adds unsupported-method stubs for pause/resume/resize.
internal/repositories/firecracker/adapter.go	Adds pause/resume/resize methods (currently returning nil).
internal/repositories/docker/fakes_test.go	Extends docker fake client for pause/unpause + container update.
internal/repositories/docker/adapter.go	Implements Docker pause/resume and container resource update (resize).
internal/platform/resilient_compute.go	New compute decorator adding bulkhead/CB/timeouts to backend calls.
internal/platform/resilient_compute_test.go	Tests resilient compute passthrough, CB, bulkhead, timeouts, unwrap.
internal/platform/circuit_breaker.go	Enhances circuit breaker (opts, state strings, half-open single-flight).
internal/platform/bulkhead.go	New bulkhead (semaphore) concurrency limiter with wait timeouts.
internal/handlers/instance_handler_test.go	Extends handler test mock service with pause/resume methods.
internal/handlers/instance_handler.go	Adds pause/resume HTTP handlers + swagger annotations.
internal/core/services/mock_compute_test.go	Extends service mocks to satisfy updated interfaces.
internal/core/services/instance.go	Adds PauseInstance/ResumeInstance with RBAC + state validation + audit/metrics.
internal/core/ports/instance.go	Extends InstanceService interface with pause/resume.
internal/core/ports/compute.go	Extends ComputeBackend with pause/resume and resize.
internal/core/domain/instance.go	Adds `StatusPaused` to instance status enum.
internal/api/setup/router.go	Wires pause/resume routes under `/instances/:id`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T11:43:33Z

 	Ping(ctx context.Context) error
 	// Type returns a string identifier of the backend (e.g., "docker", "kvm").
 	Type() string
+	// ResizeInstance updates the CPU and memory limits of a running or stopped instance.


ResizeInstance's signature uses raw int64s but the interface comment doesn't define units. In this PR, ResilientCompute/Docker expect NanoCPUs + bytes while Libvirt converts cpu/1e9 to vCPU count. Please clarify the contract in the interface doc comment (e.g., cpu in NanoCPUs, memory in bytes) to avoid backend-specific interpretations.

Suggested change

// ResizeInstance updates the CPU and memory limits of a running or stopped instance.

// ResizeInstance updates the CPU and memory limits of a running or stopped instance.

// cpu is specified in NanoCPUs (1 vCPU = 1e9 NanoCPUs), and memory is specified in bytes.

Copilot · 2026-04-28T11:43:34Z

+	return nil // Firecracker does not support pause/resume
+}
+
+func (a *FirecrackerAdapter) ResumeInstance(ctx context.Context, id string) error {
+	if a.cfg.MockMode {
+		return nil
+	}
+	return nil // Firecracker does not support pause/resume


ResumeInstance returns nil even when Firecracker pause/resume isn't supported, which can lead to incorrect instance status transitions (PAUSED→RUNNING) without any backend change. Return a not-supported error when not in MockMode.

Suggested change

return nil // Firecracker does not support pause/resume

}

func (a *FirecrackerAdapter) ResumeInstance(ctx context.Context, id string) error {

if a.cfg.MockMode {

return nil

}

return nil // Firecracker does not support pause/resume

return fmt.Errorf("pause instance %s: firecracker does not support pause/resume", id)

}

func (a *FirecrackerAdapter) ResumeInstance(ctx context.Context, id string) error {

if a.cfg.MockMode {

return nil

}

return fmt.Errorf("resume instance %s: firecracker does not support pause/resume", id)

Copilot · 2026-04-28T11:43:34Z

+		s.logger.Error("failed to pause container", "container_id", target, "error", err)
+		return errors.Wrap(errors.Internal, "failed to pause container", err)


These error/log messages hardcode "container" (and container_id) for pause/resume failures, but the backend can be libvirt where the target is a domain name. Consider using backend-neutral wording ("instance"/"resource") to keep logs accurate across backends.

Suggested change

s.logger.Error("failed to pause container", "container_id", target, "error", err)

return errors.Wrap(errors.Internal, "failed to pause container", err)

s.logger.Error("failed to pause instance", "target", target, "error", err)

return errors.Wrap(errors.Internal, "failed to pause instance", err)

Copilot · 2026-04-28T11:43:34Z

+	// Modify memory (in KiB) and vCPU in the XML
+	newDOMXML := a.applyDomainResize(domXML, int(memory/1024), int(cpu/1e9))
+


Libvirt vCPU count is derived as int(cpu/1e9). If callers provide fractional CPU (e.g., 0.5 core => 5e8 NanoCPUs), this becomes 0 and produces invalid domain XML. Validate inputs (cpu/memory > 0) and define a rounding strategy (e.g., ceil to at least 1 vCPU) or reject unsupported fractional values with a clear error.

Copilot · 2026-04-28T11:43:34Z

+// applyDomainResize updates vCPU and memory in domain XML.
+func (a *LibvirtAdapter) applyDomainResize(xml string, memoryKiB, vcpus int) string {
+	// Replace memory allocation
+	xml = regexp.MustCompile(`<memory unit="KiB">\d+</memory>`).
+		ReplaceAllString(xml, fmt.Sprintf(`<memory unit="KiB">%d</memory>`, memoryKiB))
+	xml = regexp.MustCompile(`<currentMemory unit="KiB">\d+</currentMemory>`).
+		ReplaceAllString(xml, fmt.Sprintf(`<currentMemory unit="KiB">%d</currentMemory>`, memoryKiB))
+	// Replace vCPU count
+	xml = regexp.MustCompile(`<vcpu[^>]*>\d+</vcpu>`).
+		ReplaceAllString(xml, fmt.Sprintf(`<vcpu>%d</vcpu>`, vcpus))


applyDomainResize recompiles regexps on every call via regexp.MustCompile. Since ResizeInstance can be called frequently, consider precompiling these regexps once (package-level vars) to avoid repeated compilation overhead.

Suggested change

// applyDomainResize updates vCPU and memory in domain XML.

func (a *LibvirtAdapter) applyDomainResize(xml string, memoryKiB, vcpus int) string {

// Replace memory allocation

xml = regexp.MustCompile(`<memory unit="KiB">\d+</memory>`).

ReplaceAllString(xml, fmt.Sprintf(`<memory unit="KiB">%d</memory>`, memoryKiB))

xml = regexp.MustCompile(`<currentMemory unit="KiB">\d+</currentMemory>`).

ReplaceAllString(xml, fmt.Sprintf(`<currentMemory unit="KiB">%d</currentMemory>`, memoryKiB))

// Replace vCPU count

xml = regexp.MustCompile(`<vcpu[^>]*>\d+</vcpu>`).

ReplaceAllString(xml, fmt.Sprintf(`<vcpu>%d</vcpu>`, vcpus))

var (

domainMemoryXMLRegexp = regexp.MustCompile(`<memory unit="KiB">\d+</memory>`)

domainCurrentMemoryXMLRegexp = regexp.MustCompile(`<currentMemory unit="KiB">\d+</currentMemory>`)

domainVCPUXMLRegexp = regexp.MustCompile(`<vcpu[^>]*>\d+</vcpu>`)

)

// applyDomainResize updates vCPU and memory in domain XML.

func (a *LibvirtAdapter) applyDomainResize(xml string, memoryKiB, vcpus int) string {

// Replace memory allocation

xml = domainMemoryXMLRegexp.ReplaceAllString(

xml,

fmt.Sprintf(`<memory unit="KiB">%d</memory>`, memoryKiB),

)

xml = domainCurrentMemoryXMLRegexp.ReplaceAllString(

xml,

fmt.Sprintf(`<currentMemory unit="KiB">%d</currentMemory>`, memoryKiB),

)

// Replace vCPU count

xml = domainVCPUXMLRegexp.ReplaceAllString(

xml,

fmt.Sprintf(`<vcpu>%d</vcpu>`, vcpus),

)

Copilot · 2026-04-28T11:43:35Z

+func (a *DockerAdapter) PauseInstance(ctx context.Context, name string) error {
+	if err := a.cli.ContainerPause(ctx, name); err != nil {
+		return fmt.Errorf("failed to pause container %s: %w", name, err)
+	}
+	return nil
+}
+
+func (a *DockerAdapter) ResumeInstance(ctx context.Context, name string) error {
+	if err := a.cli.ContainerUnpause(ctx, name); err != nil {
+		return fmt.Errorf("failed to resume container %s: %w", name, err)
+	}
+	return nil
+}
+
+// ResizeInstance updates the CPU and memory limits of a container.
+func (a *DockerAdapter) ResizeInstance(ctx context.Context, id string, cpuNanoCPUs, memoryBytes int64) error {
+	resp, err := a.cli.ContainerUpdate(ctx, id, container.UpdateConfig{
+		Resources: container.Resources{
+			NanoCPUs:    cpuNanoCPUs,
+			Memory:      memoryBytes,
+			MemorySwap:  memoryBytes, // Must be >= Memory; setting equal disables swap while allowing memory update
+		},
+	})
+	if err != nil {
+		return fmt.Errorf("failed to update container %s: %w", id, err)
+	}
+	if resp.Warnings != nil {
+		a.logger.Warn("container update warnings", "container_id", id, "warnings", resp.Warnings)
+	}
+	return nil


DockerAdapter gained PauseInstance/ResumeInstance/ResizeInstance but docker/adapter_unit_test.go has no tests for these new methods. Add unit tests that verify the adapter forwards to ContainerPause/ContainerUnpause/ContainerUpdate and wraps errors consistently (similar to existing Start/Stop tests).

Copilot · 2026-04-28T11:43:35Z

+	return nil // Firecracker does not support pause/resume
+}
+
+func (a *FirecrackerAdapter) ResumeInstance(ctx context.Context, id string) error {
+	if a.cfg.MockMode {
+		return nil
+	}
+	return nil // Firecracker does not support pause/resume


PauseInstance returns nil even when Firecracker pause/resume isn't supported. This will make callers think the instance was paused successfully and may update persisted status/metrics incorrectly. Return a clear error (e.g., NotImplemented/InvalidOperation) when not in MockMode, consistent with adapter_noop.go and other unimplemented Firecracker operations.

Suggested change

return nil // Firecracker does not support pause/resume

}

func (a *FirecrackerAdapter) ResumeInstance(ctx context.Context, id string) error {

if a.cfg.MockMode {

return nil

}

return nil // Firecracker does not support pause/resume

return fmt.Errorf("pause instance is not supported by Firecracker")

}

func (a *FirecrackerAdapter) ResumeInstance(ctx context.Context, id string) error {

if a.cfg.MockMode {

return nil

}

return fmt.Errorf("resume instance is not supported by Firecracker")

Copilot · 2026-04-28T11:43:36Z

+	if a.cfg.MockMode {
+		return nil
+	}
+	return nil // Firecracker does not support resize


ResizeInstance returns nil even though the comment says Firecracker doesn't support resize. This silently reports success to callers and can desync desired vs actual resources. Return an explicit not-supported error when not in MockMode.

Suggested change

return nil // Firecracker does not support resize

return fmt.Errorf("resize is not supported by Firecracker")

Copilot · 2026-04-28T11:43:36Z

+	inst.Status = domain.StatusPaused
+	if err := s.repo.Update(ctx, inst); err != nil {
+		return err
+	}
+
+	if err := s.auditSvc.Log(ctx, inst.UserID, "instance.pause", "instance", inst.ID.String(), map[string]interface{}{
+		"name": inst.Name,
+	}); err != nil {
+		s.logger.Warn("failed to log audit event", "action", "instance.pause", "instance_id", inst.ID, "error", err)
+	}
+
+	platform.InstanceOperationsTotal.WithLabelValues("pause", "success").Inc()
+	s.logger.Info("instance paused", "instance_id", inst.ID)
+	return nil


PauseInstance changes the instance status to PAUSED but does not update InstancesTotal gauges (e.g., decrement running / increment paused). StartInstance/StopInstance maintain these metrics, so pause/resume will currently leave instance counts incorrect in Prometheus.

Copilot · 2026-04-28T11:43:36Z

+	inst.Status = domain.StatusRunning
+	if err := s.repo.Update(ctx, inst); err != nil {
+		return err
+	}
+
+	if err := s.auditSvc.Log(ctx, inst.UserID, "instance.resume", "instance", inst.ID.String(), map[string]interface{}{
+		"name": inst.Name,
+	}); err != nil {
+		s.logger.Warn("failed to log audit event", "action", "instance.resume", "instance_id", inst.ID, "error", err)
+	}
+
+	platform.InstanceOperationsTotal.WithLabelValues("resume", "success").Inc()
+	s.logger.Info("instance resumed", "instance_id", inst.ID)
+	return nil


ResumeInstance sets status back to RUNNING but does not update InstancesTotal gauges for a PAUSED→RUNNING transition. If you introduce the PAUSED status label, make sure the gauges are kept consistent here as well.

…ause/resume tests - Move regex compilation in applyDomainResize to struct fields to avoid recompilation on every call (Must Fix from PR #320 review) - Add ErrNotSupported sentinel error in Firecracker adapter and return it from PauseInstance/ResumeInstance/ResizeInstance stubs (Consider item) - Add unit tests for pause/resume state transitions in instance service covering: success paths, wrong state errors, and compute backend errors

Copilot

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T14:40:56Z

 	// Type returns a string identifier of the backend (e.g., "docker", "kvm").
 	Type() string
+	// ResizeInstance updates the CPU and memory limits of a running or stopped instance.
+	ResizeInstance(ctx context.Context, id string, cpu, memory int64) error


ResizeInstance's cpu/memory parameters are undocumented in the interface, but implementations treat them as nano-CPUs and bytes (e.g., Docker uses NanoCPUs/Memory; libvirt divides cpu by 1e9 and memory by 1024). Please document the required units in the ComputeBackend interface comment (and consider renaming params to cpuNano/memoryBytes for clarity) to prevent callers from passing vCPU counts or MB and getting invalid 0 vCPU / incorrect memory updates.

Suggested change

ResizeInstance(ctx context.Context, id string, cpu, memory int64) error

// cpuNano must be specified in nano-CPUs (for example, 1 vCPU = 1_000_000_000).

// memoryBytes must be specified in bytes (for example, 512 MiB = 536_870_912).

// Callers must not pass vCPU counts or MB/MiB values directly.

ResizeInstance(ctx context.Context, id string, cpuNano, memoryBytes int64) error

Copilot · 2026-04-28T14:40:57Z

+	inst.Status = domain.StatusPaused
+	if err := s.repo.Update(ctx, inst); err != nil {
+		return err
+	}
+
+	if err := s.auditSvc.Log(ctx, inst.UserID, "instance.pause", "instance", inst.ID.String(), map[string]interface{}{
+		"name": inst.Name,
+	}); err != nil {
+		s.logger.Warn("failed to log audit event", "action", "instance.pause", "instance_id", inst.ID, "error", err)
+	}
+
+	platform.InstanceOperationsTotal.WithLabelValues("pause", "success").Inc()
+	s.logger.Info("instance paused", "instance_id", inst.ID)
+	return nil
+}
+
+// ResumeInstance resumes a paused instance back to running state.
+func (s *InstanceService) ResumeInstance(ctx context.Context, idOrName string) error {
+	userID := appcontext.UserIDFromContext(ctx)
+	tenantID := appcontext.TenantIDFromContext(ctx)
+
+	if err := s.rbacSvc.Authorize(ctx, userID, tenantID, domain.PermissionInstanceUpdate, idOrName); err != nil {
+		return err
+	}
+
+	inst, err := s.GetInstance(ctx, idOrName)
+	if err != nil {
+		return err
+	}
+
+	if inst.Status != domain.StatusPaused {
+		return errors.New(errors.Conflict, "instance must be PAUSED to resume, got: "+string(inst.Status))
+	}
+
+	target := inst.ContainerID
+	if target == "" {
+		target = s.formatContainerName(inst.ID)
+	}
+
+	if err := s.compute.ResumeInstance(ctx, target); err != nil {
+		platform.InstanceOperationsTotal.WithLabelValues("resume", "failure").Inc()
+		s.logger.Error("failed to resume container", "container_id", target, "error", err)
+		return errors.Wrap(errors.Internal, "failed to resume container", err)
+	}
+
+	inst.Status = domain.StatusRunning
+	if err := s.repo.Update(ctx, inst); err != nil {
+		return err
+	}
+
+	if err := s.auditSvc.Log(ctx, inst.UserID, "instance.resume", "instance", inst.ID.String(), map[string]interface{}{
+		"name": inst.Name,
+	}); err != nil {
+		s.logger.Warn("failed to log audit event", "action", "instance.resume", "instance_id", inst.ID, "error", err)
+	}
+
+	platform.InstanceOperationsTotal.WithLabelValues("resume", "success").Inc()
+	s.logger.Info("instance resumed", "instance_id", inst.ID)


PauseInstance/ResumeInstance update the instance status in the DB but don't update the InstancesTotal gauge (unlike StartInstance/StopInstance). This will leave Prometheus instance counts inconsistent once instances move into/out of PAUSED. Consider decrementing running/incrementing paused on pause, and decrementing paused/incrementing running on resume (using the same backend label).

Copilot · 2026-04-28T14:40:57Z

+	if err != nil {
+		return fmt.Errorf("failed to update container %s: %w", id, err)
+	}
+	if resp.Warnings != nil {


In ResizeInstance, resp.Warnings is a slice; it can be non-nil but empty. Logging a warning when len(resp.Warnings)==0 can create noisy logs. Consider checking len(resp.Warnings) > 0 instead of resp.Warnings != nil.

Suggested change

if resp.Warnings != nil {

if len(resp.Warnings) > 0 {

Copilot · 2026-04-28T14:40:57Z

+func (r *ResilientCompute) GetInstanceLogs(ctx context.Context, id string) (io.ReadCloser, error) {
+	var rc io.ReadCloser
+	err := r.callProtected(ctx, r.opts.CallTimeout, func(ctx context.Context) error {
+		var e error
+		rc, e = r.inner.GetInstanceLogs(ctx, id)
+		return e
+	})
+	return rc, err
+}
+
+func (r *ResilientCompute) GetInstanceStats(ctx context.Context, id string) (io.ReadCloser, error) {
+	var rc io.ReadCloser
+	err := r.callProtected(ctx, r.opts.CallTimeout, func(ctx context.Context) error {
+		var e error
+		rc, e = r.inner.GetInstanceStats(ctx, id)
+		return e
+	})
+	return rc, err
+}


ResilientCompute's timeout context is canceled before the caller can consume the returned stream. For GetInstanceLogs/GetInstanceStats, callProtected creates ctx2 and defers cancel(), so ctx2 is canceled as soon as the backend returns the ReadCloser, which can prematurely terminate Docker/libvirt log/stats streams. Consider either (1) not using callProtected for streaming methods, or (2) returning a wrapped ReadCloser whose Close() triggers cancel, so the context stays alive while the stream is read.

Copilot · 2026-04-28T14:40:58Z

+// callProtected runs fn through bulkhead → circuit breaker → timeout.
+func (r *ResilientCompute) callProtected(ctx context.Context, timeout time.Duration, fn func(ctx context.Context) error) error {
+	return r.bulkhead.Execute(ctx, func() error {
+		return r.cb.Execute(func() error {
+			ctx2, cancel := context.WithTimeout(ctx, timeout)
+			defer cancel()


callProtected applies the per-call timeout only inside the circuit-breaker closure, after a bulkhead slot is acquired. This means total request time can exceed CallTimeout by up to BulkheadWait (or indefinitely if BulkheadWait=0 and ctx has no deadline). If the intent is a true end-to-end per-call timeout, create the timeout context before bulkhead acquisition and use that context throughout (bulkhead + circuit breaker + backend call).

Suggested change

// callProtected runs fn through bulkhead → circuit breaker → timeout.

func (r *ResilientCompute) callProtected(ctx context.Context, timeout time.Duration, fn func(ctx context.Context) error) error {

return r.bulkhead.Execute(ctx, func() error {

return r.cb.Execute(func() error {

ctx2, cancel := context.WithTimeout(ctx, timeout)

defer cancel()

// callProtected runs fn through timeout → bulkhead → circuit breaker so the

// per-call timeout applies to the full operation, including bulkhead wait time.

func (r *ResilientCompute) callProtected(ctx context.Context, timeout time.Duration, fn func(ctx context.Context) error) error {

ctx2, cancel := context.WithTimeout(ctx, timeout)

defer cancel()

return r.bulkhead.Execute(ctx2, func() error {

return r.cb.Execute(func() error {

Copilot · 2026-04-28T14:40:58Z

+func (b *Bulkhead) acquire(ctx context.Context) error {
+	select {
+	case <-ctx.Done():
+		return ErrBulkheadFull
+	default:
+	}
+
+	if b.timeout > 0 {
+		timer := time.NewTimer(b.timeout)
+		defer timer.Stop()
+		select {
+		case b.sem <- struct{}{}:
+			return nil
+		case <-timer.C:
+			return ErrBulkheadFull
+		case <-ctx.Done():
+			return ErrBulkheadFull
+		}
+	}
+	// No explicit timeout — rely on context.
+	select {
+	case b.sem <- struct{}{}:
+		return nil
+	case <-ctx.Done():
+		return ErrBulkheadFull
+	}


Bulkhead.acquire returns ErrBulkheadFull for all ctx.Done() cases (including caller cancellation/deadline exceeded). This loses the original cancellation reason and can misclassify failures as bulkhead saturation. Consider returning ctx.Err() when the context is already done, and reserve ErrBulkheadFull for actual semaphore wait timeouts (timer.C) or when ctx expires specifically while waiting for a slot (wrapping ctx.Err() if you still want a sentinel).

Copilot · 2026-04-28T14:40:58Z

+		reMemory:           regexp.MustCompile(`<memory unit="KiB">\d+</memory>`),
+		reCurrentMemory:    regexp.MustCompile(`<currentMemory unit="KiB">\d+</currentMemory>`),


The precompiled memory/currentMemory regexes only match double-quoted attributes (unit="KiB"), but the domain XML generated in templates.go uses single quotes (unit='KiB') and does not include a tag. As a result, ResizeInstance will not update memory at all for domains created by this adapter. Update the regexes to handle the actual template output (both quote styles) or, preferably, parse/modify the XML structurally instead of regex substitution.

Suggested change

reMemory: regexp.MustCompile(`<memory unit="KiB">\d+</memory>`),

reCurrentMemory: regexp.MustCompile(`<currentMemory unit="KiB">\d+</currentMemory>`),

reMemory: regexp.MustCompile(`<memory unit=['"]KiB['"]>\d+</memory>`),

reCurrentMemory: regexp.MustCompile(`<currentMemory unit=['"]KiB['"]>\d+</currentMemory>`),

Copilot · 2026-04-28T14:40:59Z

+// applyDomainResize updates vCPU and memory in domain XML.
+func (a *LibvirtAdapter) applyDomainResize(xml string, memoryKiB, vcpus int) string {
+	// Replace memory allocation
+	xml = a.reMemory.ReplaceAllString(xml, fmt.Sprintf(`<memory unit="KiB">%d</memory>`, memoryKiB))
+	xml = a.reCurrentMemory.ReplaceAllString(xml, fmt.Sprintf(`<currentMemory unit="KiB">%d</currentMemory>`, memoryKiB))
+	// Replace vCPU count
+	xml = a.reVCPU.ReplaceAllString(xml, fmt.Sprintf(`<vcpu>%d</vcpu>`, vcpus))


applyDomainResize replaces the entire <vcpu ...>... element with a bare %d, which drops existing attributes like placement='static' from the template and may change libvirt behavior. Consider preserving attributes via capturing groups (e.g., keep the opening tag) or using an XML parser to update only the numeric text content.

Suggested change

// applyDomainResize updates vCPU and memory in domain XML.

func (a *LibvirtAdapter) applyDomainResize(xml string, memoryKiB, vcpus int) string {

// Replace memory allocation

xml = a.reMemory.ReplaceAllString(xml, fmt.Sprintf(`<memory unit="KiB">%d</memory>`, memoryKiB))

xml = a.reCurrentMemory.ReplaceAllString(xml, fmt.Sprintf(`<currentMemory unit="KiB">%d</currentMemory>`, memoryKiB))

// Replace vCPU count

xml = a.reVCPU.ReplaceAllString(xml, fmt.Sprintf(`<vcpu>%d</vcpu>`, vcpus))

// replaceVCPUCountPreservingAttributes updates the vCPU count while preserving any existing

// attributes on the <vcpu> element, such as placement="static".

func (a *LibvirtAdapter) replaceVCPUCountPreservingAttributes(xml string, vcpus int) string {

reVCPUContent := regexp.MustCompile(`(?s)^(<vcpu\b[^>]*>)[^<]*(</vcpu>)$`)

return a.reVCPU.ReplaceAllStringFunc(xml, func(match string) string {

parts := reVCPUContent.FindStringSubmatch(match)

if len(parts) != 3 {

return match

}

return fmt.Sprintf(`%s%d%s`, parts[1], vcpus, parts[2])

})

}

// applyDomainResize updates vCPU and memory in domain XML.

func (a *LibvirtAdapter) applyDomainResize(xml string, memoryKiB, vcpus int) string {

// Replace memory allocation

xml = a.reMemory.ReplaceAllString(xml, fmt.Sprintf(`<memory unit="KiB">%d</memory>`, memoryKiB))

xml = a.reCurrentMemory.ReplaceAllString(xml, fmt.Sprintf(`<currentMemory unit="KiB">%d</currentMemory>`, memoryKiB))

// Replace vCPU count while preserving existing <vcpu> attributes.

xml = a.replaceVCPUCountPreservingAttributes(xml, vcpus)

coderabbitai

Actionable comments posted: 13

🧹 Nitpick comments (4)

internal/platform/circuit_breaker.go (1)

84-92: Replace hard-coded default values with named constants

Lines 85, 88, and 91 hard-code breaker defaults. Promote them to constants to keep defaults centralized and avoid scattered literals.

Proposed refactor

+const (
+	defaultFailureThreshold = 5
+	defaultResetTimeout     = 30 * time.Second
+	defaultSuccessRequired  = 1
+)
+
 func NewCircuitBreakerWithOpts(opts CircuitBreakerOpts) *CircuitBreaker {
 	if opts.Threshold <= 0 {
-		opts.Threshold = 5
+		opts.Threshold = defaultFailureThreshold
 	}
 	if opts.ResetTimeout <= 0 {
-		opts.ResetTimeout = 30 * time.Second
+		opts.ResetTimeout = defaultResetTimeout
 	}
 	if opts.SuccessRequired <= 0 {
-		opts.SuccessRequired = 1
+		opts.SuccessRequired = defaultSuccessRequired
 	}

As per coding guidelines, "**/*.go`: Do not use magic numbers - use named constants instead".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/platform/circuit_breaker.go` around lines 84 - 92, Replace the
hard-coded magic numbers in the circuit breaker defaults with named constants:
declare DefaultThreshold (int = 5), DefaultResetTimeout (time.Duration = 30 *
time.Second) and DefaultSuccessRequired (int = 1) near the top of the circuit
breaker implementation, then use these constants in the initialization block
that currently checks opts.Threshold, opts.ResetTimeout and opts.SuccessRequired
(replace the literals 5, 30 * time.Second and 1 with the corresponding
constants) so defaults are centralized and self-documented.

internal/repositories/docker/fakes_test.go (1)

98-104: Track pause/unpause calls in the fake for better test assertions

ContainerPause and ContainerUnpause currently return success without recording calls, unlike other fake methods. Incrementing counters here improves verification of pause/resume flows.

Proposed refactor

 func (f *fakeDockerClient) ContainerPause(ctx context.Context, containerID string) error {
+	f.inc("ContainerPause")
 	return nil
 }

 func (f *fakeDockerClient) ContainerUnpause(ctx context.Context, containerID string) error {
+	f.inc("ContainerUnpause")
 	return nil
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/repositories/docker/fakes_test.go` around lines 98 - 104, The fake
Docker client methods ContainerPause and ContainerUnpause don't record calls;
update the fakeDockerClient to track these invocations (e.g., add
pauseCount/unpauseCount or record entries in an operations slice and protect
with the same mutex used by other fake methods) and increment the appropriate
counter/append an entry inside ContainerPause and ContainerUnpause so tests can
assert pause/unpause behavior; ensure you follow the existing fake's concurrency
pattern and naming conventions used by other methods in fakeDockerClient.

internal/core/services/instance_unit_test.go (1)

981-1096: Consider table-driving the pause/resume cases.

The new suite repeats the same setup and assertion pattern several times. A table would cut the duplication and make it easier to add more pause/resume permutations later.

As per coding guidelines, use table-driven tests in test files.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/core/services/instance_unit_test.go` around lines 981 - 1096, The
tests for PauseInstance_* and ResumeInstance_* duplicate setup/expectation
logic; convert them into a single table-driven loop that defines cases (name,
method: "PauseInstance" or "ResumeInstance", initial Status, compute error or
nil, expected final Status or error substring, whether Update and Audit should
be asserted). In the loop call svc.PauseInstance or svc.ResumeInstance based on
the case, set up common mocks once per case (repo.GetByName Maybe, repo.GetByID
-> inst, rbacSvc.Authorize for PermissionInstanceUpdate and
PermissionInstanceRead), conditionally set
compute.On("PauseInstance"/"ResumeInstance", mock.Anything, "cid-1").Return(...)
and compute.On("Type").Return("docker").Maybe(), and conditionally assert
repo.On("Update", ...) and auditSvc.On("Log", ...).Return(...).Use
mock.MatchedBy for status checks and assert error/no-error and inst.Status per
case; this collapses the repeated logic while keeping behavior for PauseInstance
and ResumeInstance.

internal/platform/resilient_compute.go (1)

33-52: Extract the default limits into named constants.

The constructor hard-codes several fallback durations and thresholds, which makes the defaults harder to audit. As per coding guidelines, Do not use magic numbers - use named constants instead.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/platform/resilient_compute.go` around lines 33 - 52, Extract the
hard-coded defaults in ResilientComputeOpts.withDefaults into named
package-level constants (e.g. defaultCallTimeout, defaultLongCallTimeout,
defaultCBThreshold, defaultCBResetTimeout, defaultBulkheadMaxConc,
defaultBulkheadWait) declared near the top of the file; then replace the literal
values used when setting o.CallTimeout, o.LongCallTimeout, o.CBThreshold,
o.CBResetTimeout, o.BulkheadMaxConc and o.BulkheadWait with those constants to
remove magic numbers and make the defaults easy to audit.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/core/services/instance_unit_test.go`:
- Line 303: The test expectations that call repo.On("GetByName", mock.Anything,
instanceID.String()).Return(...).Maybe() should be required instead of optional;
locate the GetByName expectations in instance_unit_test.go (the repo.On calls
that use "GetByName" with instanceID.String()) and remove the .Maybe() so the
mock requires that GetByName is invoked; apply the same change for each
occurrence noted (lines corresponding to the other GetByName expectations) to
ensure the name-first resolution path exercised by InstanceService is enforced
by the tests.

In `@internal/core/services/instance.go`:
- Around line 644-653: When s.compute.PauseInstance (and symmetrically
s.compute.ResumeInstance) succeeds but s.repo.Update fails, add a best-effort
compensating rollback before returning to avoid state drift: after setting
inst.Status and detecting s.repo.Update error inside the methods that call
s.compute.PauseInstance/ResumeInstance (e.g., the current Pause flow and the
Resume flow), call the opposite compute operation (call s.compute.ResumeInstance
on Pause failure path, call s.compute.PauseInstance on Resume failure path) and
log both the repo update error and any rollback error (using s.logger and
platform.InstanceOperationsTotal metrics); ensure the original repo update error
is still returned but include rollback outcome in logs so the system records
both attempts.

In `@internal/platform/bulkhead.go`:
- Around line 22-38: The Bulkhead default WaitTimeout from BulkheadOpts isn't
applied in NewBulkhead, so callers get a zero timeout and acquire may block
until context cancel; modify NewBulkhead to set opts.WaitTimeout to
5*time.Second when it's zero (e.g., if opts.WaitTimeout == 0 { opts.WaitTimeout
= 5 * time.Second }) and assign that value to the Bulkhead.timeout field so
functions like acquire honor the documented 5s default.

In `@internal/platform/circuit_breaker.go`:
- Around line 195-206: The recordSuccess logic currently treats any non-HalfOpen
state the same and resets failureCount/state, which lets a late success close an
already-OPEN breaker; modify the switch in recordSuccess so only an actual
StateClosed path clears failureCount and sets state (i.e., add an explicit case
StateClosed that resets failureCount/state) and do nothing when cb.state ==
StateOpen so stale successes cannot transition an open breaker to closed; keep
the HalfOpen handling (successCount, successRequired and
transitionLocked(StateClosed)) as-is and reference cb.successCount,
cb.failureCount, cb.successRequired, cb.transitionLocked, and the state
constants StateHalfOpen/StateClosed/StateOpen.

In `@internal/platform/resilient_compute_test.go`:
- Around line 128-330: Multiple tests in resilient_compute_test.go repeat the
same setup and assertions; refactor them into table-driven subtests: consolidate
scenarios like TestResilientComputePassthrough,
TestResilientComputeCircuitTrips, TestResilientComputeBulkheadLimits,
TestResilientComputeTimeout, TestResilientComputePingBypassesBulkhead and the
ResizeInstance subtests into one or more table-driven tests where each case
defines a name, mock (mockCompute with delay/err), ResilientComputeOpts, the
invoked method(s) (e.g., StartInstance, Ping, ResizeInstance,
LaunchInstanceWithOptions, Exec), and expected outcomes (error type like
ErrCircuitOpen/ErrBulkheadFull or expected return values/call counts); iterate
cases using t.Run and reuse NewResilientCompute and common ctx/setup code,
asserting results per-case so duplication is removed while keeping checks for
callCount, timing expectations, and bypass behavior.
- Around line 19-125: Replace the hand-rolled mockCompute with a testify-based
mock by defining a MockCompute struct that embeds testify/mock.Mock and
implementing each interface method to call m.Called(...) and return the expected
typed results (implement LaunchInstanceWithOptions, StartInstance, StopInstance,
DeleteInstance, GetInstanceLogs, GetInstanceStats, GetInstancePort,
GetInstanceIP, GetConsoleURL, Exec, RunTask, WaitTask, CreateNetwork,
DeleteNetwork, AttachVolume, DetachVolume, Ping, ResizeInstance, PauseInstance,
ResumeInstance, and Type). Remove atomic.Int64 callCount and delay fields from
the mock; if tests need call counting or simulated delay use testify's
mock.AssertNumberOfCalls and the Run(func(args mock.Arguments){...}) hook on
specific method expectations to increment counters or sleep. Ensure each method
converts mock.Called return values to the correct concrete types (string,
[]string, int, io.ReadCloser, error, etc.) to match the original signatures.

In `@internal/platform/resilient_compute.go`:
- Around line 152-156: The ResizeInstance method currently uses r.callProtected
with r.opts.CallTimeout which is too short for destructive/long-running
operations; change the call to use r.opts.LongCallTimeout (or a dedicated resize
timeout field) so ResilientCompute.ResizeInstance calls r.callProtected(ctx,
r.opts.LongCallTimeout, ...) instead of r.opts.CallTimeout to avoid backend
cancellations during VM resize transitions.

In `@internal/repositories/firecracker/adapter.go`:
- Around line 154-173: The MockMode branches in PauseInstance, ResumeInstance
and ResizeInstance currently return nil and silently report success; change them
to return ErrNotSupported (or a specific ErrMockModeNotSupported if you prefer)
when a.cfg.MockMode is true so callers get a non-nil error for unsupported
operations; update the functions PauseInstance, ResumeInstance and
ResizeInstance to return the error rather than nil and ensure ErrNotSupported is
the error used so behavior is consistent with non-mock paths.
- Around line 27-29: Remove the package-level ErrNotSupported sentinel from
adapter.go and stop exporting a global error; instead return a locally created
error where needed (e.g., use fmt.Errorf or errors.New inside the
function/method that currently returns ErrNotSupported) or define a
function-scoped or method-scoped typed error if callers need type assertions;
update all references to ErrNotSupported to use the new local error creation
(search for ErrNotSupported usages) so no package-level global var remains
alongside idRegex.

In `@internal/repositories/libvirt/adapter.go`:
- Around line 221-243: The code currently undefines the live domain
(DomainUndefine) before ensuring the new XML can be successfully defined,
risking the VM being left offline; instead, validate and create the replacement
definition first by calling DomainDefineXML with a temporary/unique name/UUID
(generate a temp name or modify the <name> or <uuid> in newDOMXML returned from
applyDomainResize) to ensure the XML parses and a definition can be created,
then if that succeeds, undefine the original domain (DomainUndefine on dom) and
redefine the final newDOMXML (or rename the temp definition to the original
name/uuid as needed), and only then start the domain (DomainCreate) if
wasRunning; reference functions/methods: DomainGetXMLDesc, applyDomainResize,
DomainDefineXML, DomainUndefine, DomainCreate.
- Around line 249-257: The applyDomainResize function currently mutates domain
XML with regex (see applyDomainResize and the regexes reMemory, reCurrentMemory,
reVCPU), losing vCPU attributes (placement/cpuset) and mishandling memory units;
replace the regex approach by parsing the XML into an element tree (e.g., using
Go's encoding/xml or an XML DOM library), locate the <vcpu> element and update
only its inner text while preserving all attributes, locate <memory> and
<currentMemory> elements, detect/convert their existing unit attribute to the
desired KiB value (or set the unit attribute to "KiB" after converting) instead
of hardcoding unit="KiB", and then marshal the modified XML back to string;
update applyDomainResize to use these parsed-element updates and remove reliance
on reMemory/reCurrentMemory/reVCPU regex replacements.

In `@internal/workers/pipeline_worker_test.go`:
- Around line 161-169: The three mock methods on mockComputeBackendExtended
(PauseInstance, ResumeInstance, ResizeInstance) currently return nil and
silently succeed; update each to record the call via m.Called(...) (e.g., for
ResizeInstance pass ctx, id, cpu, memory) and return the mocked error/result
from the call (use args.Error(0) / args.Get(0) as appropriate) so tests can
assert invocations and errors instead of masking unexpected calls.

---

Nitpick comments:
In `@internal/core/services/instance_unit_test.go`:
- Around line 981-1096: The tests for PauseInstance_* and ResumeInstance_*
duplicate setup/expectation logic; convert them into a single table-driven loop
that defines cases (name, method: "PauseInstance" or "ResumeInstance", initial
Status, compute error or nil, expected final Status or error substring, whether
Update and Audit should be asserted). In the loop call svc.PauseInstance or
svc.ResumeInstance based on the case, set up common mocks once per case
(repo.GetByName Maybe, repo.GetByID -> inst, rbacSvc.Authorize for
PermissionInstanceUpdate and PermissionInstanceRead), conditionally set
compute.On("PauseInstance"/"ResumeInstance", mock.Anything, "cid-1").Return(...)
and compute.On("Type").Return("docker").Maybe(), and conditionally assert
repo.On("Update", ...) and auditSvc.On("Log", ...).Return(...).Use
mock.MatchedBy for status checks and assert error/no-error and inst.Status per
case; this collapses the repeated logic while keeping behavior for PauseInstance
and ResumeInstance.

In `@internal/platform/circuit_breaker.go`:
- Around line 84-92: Replace the hard-coded magic numbers in the circuit breaker
defaults with named constants: declare DefaultThreshold (int = 5),
DefaultResetTimeout (time.Duration = 30 * time.Second) and
DefaultSuccessRequired (int = 1) near the top of the circuit breaker
implementation, then use these constants in the initialization block that
currently checks opts.Threshold, opts.ResetTimeout and opts.SuccessRequired
(replace the literals 5, 30 * time.Second and 1 with the corresponding
constants) so defaults are centralized and self-documented.

In `@internal/platform/resilient_compute.go`:
- Around line 33-52: Extract the hard-coded defaults in
ResilientComputeOpts.withDefaults into named package-level constants (e.g.
defaultCallTimeout, defaultLongCallTimeout, defaultCBThreshold,
defaultCBResetTimeout, defaultBulkheadMaxConc, defaultBulkheadWait) declared
near the top of the file; then replace the literal values used when setting
o.CallTimeout, o.LongCallTimeout, o.CBThreshold, o.CBResetTimeout,
o.BulkheadMaxConc and o.BulkheadWait with those constants to remove magic
numbers and make the defaults easy to audit.

In `@internal/repositories/docker/fakes_test.go`:
- Around line 98-104: The fake Docker client methods ContainerPause and
ContainerUnpause don't record calls; update the fakeDockerClient to track these
invocations (e.g., add pauseCount/unpauseCount or record entries in an
operations slice and protect with the same mutex used by other fake methods) and
increment the appropriate counter/append an entry inside ContainerPause and
ContainerUnpause so tests can assert pause/unpause behavior; ensure you follow
the existing fake's concurrency pattern and naming conventions used by other
methods in fakeDockerClient.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b1fd7891-edeb-490e-ac8b-384c0d8e1d6d

📥 Commits

Reviewing files that changed from the base of the PR and between ec3e19c and b11d04d.

📒 Files selected for processing (28)

internal/api/setup/router.go
internal/core/domain/instance.go
internal/core/ports/compute.go
internal/core/ports/instance.go
internal/core/services/instance.go
internal/core/services/instance_unit_test.go
internal/core/services/mock_compute_test.go
internal/handlers/instance_handler.go
internal/handlers/instance_handler_test.go
internal/platform/bulkhead.go
internal/platform/circuit_breaker.go
internal/platform/resilient_compute.go
internal/platform/resilient_compute_test.go
internal/repositories/docker/adapter.go
internal/repositories/docker/fakes_test.go
internal/repositories/firecracker/adapter.go
internal/repositories/firecracker/adapter_noop.go
internal/repositories/k8s/kubeadm_provisioner_test.go
internal/repositories/k8s/mocks_test.go
internal/repositories/libvirt/adapter.go
internal/repositories/libvirt/lb_proxy_test.go
internal/repositories/libvirt/libvirt_client.go
internal/repositories/libvirt/mock_client_test.go
internal/repositories/libvirt/real_client.go
internal/repositories/noop/adapters.go
internal/workers/database_failover_worker_test.go
internal/workers/healing_worker_test.go
internal/workers/pipeline_worker_test.go

coderabbitai · 2026-04-28T14:42:09Z

+	// StatusPaused indicates the instance is paused (frozen CPU, retained memory/network).
+	StatusPaused InstanceStatus = "PAUSED"


⚠️ Potential issue | 🟠 Major

StatusPaused introduction is not reflected in existing status switch paths

Adding PAUSED at Line 37 needs corresponding handling in current status-based switches; provided snippets show gaps in:

internal/core/services/instance.go:868-879 (termination metrics),

internal/core/services/dashboard.go:55-66 (running/stopped summary),

internal/autoscaler/server.go:248-260 (proto state mapping).

Without those updates, paused instances are silently miscounted or unmapped.

coderabbitai · 2026-04-28T14:42:10Z

+	switch cb.state {
+	case StateHalfOpen:
+		cb.successCount++
+		if cb.successCount >= cb.successRequired {
+			cb.failureCount = 0
+			cb.successCount = 0
+			cbFunc, name, from, to, changed = cb.transitionLocked(StateClosed)
+		}
+	default:
+		cb.failureCount = 0
+		cb.state = StateClosed
+	}


⚠️ Potential issue | 🔴 Critical

Do not close an OPEN breaker from stale successes

At Line 205, recordSuccess can set StateOpen -> StateClosed in the default branch. Under concurrency, a request admitted while closed can finish successfully after another request has already opened the breaker, and this code will close it prematurely.

Proposed fix

func (cb *CircuitBreaker) recordSuccess() { @@ switch cb.state { case StateHalfOpen: cb.successCount++ if cb.successCount >= cb.successRequired { cb.failureCount = 0 cb.successCount = 0 cbFunc, name, from, to, changed = cb.transitionLocked(StateClosed) } - default: + case StateClosed: cb.failureCount = 0 - cb.state = StateClosed + case StateOpen: + // Ignore stale successes from requests admitted before breaker opened. }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

switch cb.state {

case StateHalfOpen:

cb.successCount++

if cb.successCount >= cb.successRequired {

cb.failureCount = 0

cb.successCount = 0

cbFunc, name, from, to, changed = cb.transitionLocked(StateClosed)

}

default:

cb.failureCount = 0

cb.state = StateClosed

}

switch cb.state {

case StateHalfOpen:

cb.successCount++

if cb.successCount >= cb.successRequired {

cb.failureCount = 0

cb.successCount = 0

cbFunc, name, from, to, changed = cb.transitionLocked(StateClosed)

}

case StateClosed:

cb.failureCount = 0

case StateOpen:

// Ignore stale successes from requests admitted before breaker opened.

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/platform/circuit_breaker.go` around lines 195 - 206, The recordSuccess logic currently treats any non-HalfOpen state the same and resets failureCount/state, which lets a late success close an already-OPEN breaker; modify the switch in recordSuccess so only an actual StateClosed path clears failureCount and sets state (i.e., add an explicit case StateClosed that resets failureCount/state) and do nothing when cb.state == StateOpen so stale successes cannot transition an open breaker to closed; keep the HalfOpen handling (successCount, successRequired and transitionLocked(StateClosed)) as-is and reference cb.successCount, cb.failureCount, cb.successRequired, cb.transitionLocked, and the state constants StateHalfOpen/StateClosed/StateOpen.

coderabbitai · 2026-04-28T14:42:10Z

+func (a *FirecrackerAdapter) PauseInstance(ctx context.Context, id string) error {
+	if a.cfg.MockMode {
+		return nil
+	}
+	return ErrNotSupported
+}
+
+func (a *FirecrackerAdapter) ResumeInstance(ctx context.Context, id string) error {
+	if a.cfg.MockMode {
+		return nil
+	}
+	return ErrNotSupported
+}
+
+func (a *FirecrackerAdapter) ResizeInstance(ctx context.Context, id string, cpu, memory int64) error {
+	if a.cfg.MockMode {
+		return nil
+	}
+	return ErrNotSupported
+}


⚠️ Potential issue | 🟠 Major

MockMode should not silently return success for unsupported operations.

Returning nil for pause/resume/resize in MockMode can falsely mark instances as paused/resumed/resized even though no backend action exists.

♻️ Proposed fix

func (a *FirecrackerAdapter) PauseInstance(ctx context.Context, id string) error { if a.cfg.MockMode { - return nil + return errNotSupported() } - return ErrNotSupported + return errNotSupported() } func (a *FirecrackerAdapter) ResumeInstance(ctx context.Context, id string) error { if a.cfg.MockMode { - return nil + return errNotSupported() } - return ErrNotSupported + return errNotSupported() } func (a *FirecrackerAdapter) ResizeInstance(ctx context.Context, id string, cpu, memory int64) error { if a.cfg.MockMode { - return nil + return errNotSupported() } - return ErrNotSupported + return errNotSupported() }

As per coding guidelines **/*.go: Do not use silent failures - avoid blank identifier assignment like _ = someFunc().

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/repositories/firecracker/adapter.go` around lines 154 - 173, The MockMode branches in PauseInstance, ResumeInstance and ResizeInstance currently return nil and silently report success; change them to return ErrNotSupported (or a specific ErrMockModeNotSupported if you prefer) when a.cfg.MockMode is true so callers get a non-nil error for unsupported operations; update the functions PauseInstance, ResumeInstance and ResizeInstance to return the error rather than nil and ensure ErrNotSupported is the error used so behavior is consistent with non-mock paths.

coderabbitai · 2026-04-28T14:42:10Z

+// applyDomainResize updates vCPU and memory in domain XML.
+func (a *LibvirtAdapter) applyDomainResize(xml string, memoryKiB, vcpus int) string {
+	// Replace memory allocation
+	xml = a.reMemory.ReplaceAllString(xml, fmt.Sprintf(`<memory unit="KiB">%d</memory>`, memoryKiB))
+	xml = a.reCurrentMemory.ReplaceAllString(xml, fmt.Sprintf(`<currentMemory unit="KiB">%d</currentMemory>`, memoryKiB))
+	// Replace vCPU count
+	xml = a.reVCPU.ReplaceAllString(xml, fmt.Sprintf(`<vcpu>%d</vcpu>`, vcpus))
+	return xml
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n internal/repositories/libvirt/adapter.go | head -300 | tail -80

Repository: poyrazK/thecloud

Length of output: 3345

🏁 Script executed:

rg "reMemory|reCurrentMemory|reVCPU" internal/repositories/libvirt/adapter.go -A 1 -B 1

Repository: poyrazK/thecloud

Length of output: 885

🌐 Web query:

libvirt XML memory unit variants KiB B MiB specification

💡 Result:

In libvirt domain XML, memory-related elements (such as , , , and sizes in memory devices) support an optional 'unit' attribute that specifies the unit for the numeric value. The default unit is "KiB" (kibibytes, 2^10 bytes). Valid unit variants include: - "b" or "bytes": bytes (1 byte) - "KB": kilobytes (10^3 = 1000 bytes) - "k" or "KiB": kibibytes (2^10 = 1024 bytes) - "MB": megabytes (10^6 = 1,000,000 bytes) - "M" or "MiB": mebibytes (2^20 = 1,048,576 bytes) - "GB": gigabytes (10^9 = 1,000,000,000 bytes) - "G" or "GiB": gibibytes (2^30 = 1,073,741,824 bytes) - "TB": terabytes (10^12 bytes) - "T" or "TiB": tebibytes (2^40 bytes) Values are rounded up to the nearest kibibyte by libvirt and may be further adjusted by the hypervisor. These units apply consistently across memory allocation, backing, devices, and related elements like memtune. Examples: 1048576 (1 GiB) 1024 (1 GiB) 1 (1 GiB)

Citations:

1: https://www.claudiokuenzler.com/blog/1471/kvm-qemu-libvirt-memory-allocation-unit-gb-gib

2: https://libvirt.org/formatdomain.html

3: https://libvirt.org/kbase/memorydevices.html

4: https://avdv.github.io/libvirt/formatdomain.html

5: https://libvirt.org/formatstorage.html

🌐 Web query:

libvirt vcpu tag attributes placement cpuset specification

💡 Result:

In libvirt domain XML configuration, the element supports two key attributes related to CPU placement and cpuset specification: 'placement' and 'cpuset'. The 'placement' attribute (since libvirt 0.9.11, QEMU/KVM only) indicates the CPU placement mode for the domain process: - 'static' (default if cpuset specified): Pins the domain process to the cpuset or all available CPUs. - 'auto': Pins to advisory nodeset from numad daemon, ignoring cpuset. The 'cpuset' attribute (since 0.4.4) is a comma-separated list of host physical CPU numbers or ranges for default pinning of the domain process and vCPUs: - Syntax: single numbers (e.g., '0'), ranges (e.g., '1-4'), exclusions (e.g., '^3' excludes CPU 3 from prior range). - Example: cpuset="1-4,^3,6" pins to CPUs 1,2,4,6. - Overrides: Ignored if cputune/emulatorpin specified (for process) or vcpupin for specific vCPUs. Unspecified vCPUs without vcpupin use this cpuset. Example XML: 2 For finer control, use with and . These apply primarily to QEMU/KVM; check hypervisor support.

Citations:

1: https://libvirt.org/formatdomain.html

2: https://libvirt.gitlab.io/libvirt-appdev-guide-python/libvirt_application_development_guide_using_python-Guest_Domains-Domain_Config-Memory_CPU.html

3: https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/virtualization_administration_guide/sect-cpu-allocation

Use proper XML parsing instead of regex-based mutation for domain XML modifications.

The applyDomainResize function has two problems:

The <vcpu> regex matches attributes like placement and cpuset (<vcpu[^>]*>), but the replacement <vcpu>%d</vcpu> strips them entirely, causing data loss.

The memory and currentMemory regexes only match unit="KiB", but libvirt supports multiple units (KiB, MiB, GiB, bytes, etc.). Existing XML with different units won't be replaced, and the hardcoded unit="KiB" replacement assumes a specific variant.

Parsing the XML tree would safely preserve attributes and handle unit variations correctly.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/repositories/libvirt/adapter.go` around lines 249 - 257, The applyDomainResize function currently mutates domain XML with regex (see applyDomainResize and the regexes reMemory, reCurrentMemory, reVCPU), losing vCPU attributes (placement/cpuset) and mishandling memory units; replace the regex approach by parsing the XML into an element tree (e.g., using Go's encoding/xml or an XML DOM library), locate the <vcpu> element and update only its inner text while preserving all attributes, locate <memory> and <currentMemory> elements, detect/convert their existing unit attribute to the desired KiB value (or set the unit attribute to "KiB" after converting) instead of hardcoding unit="KiB", and then marshal the modified XML back to string; update applyDomainResize to use these parsed-element updates and remove reliance on reMemory/reCurrentMemory/reVCPU regex replacements.

coderabbitai · 2026-04-28T14:42:10Z

+func (m *mockComputeBackendExtended) PauseInstance(ctx context.Context, id string) error {
+	return nil
+}
+func (m *mockComputeBackendExtended) ResumeInstance(ctx context.Context, id string) error {
+	return nil
+}
+func (m *mockComputeBackendExtended) ResizeInstance(ctx context.Context, id string, cpu, memory int64) error {
+	return nil
+}


⚠️ Potential issue | 🟡 Minor

New mock methods silently succeed and can mask unexpected calls

At Lines 162, 165, and 168 the methods return nil without recording invocations. Wire them through m.Called(...) so tests can assert expected behavior and catch accidental usage.

Proposed fix

func (m *mockComputeBackendExtended) PauseInstance(ctx context.Context, id string) error { - return nil + return m.Called(ctx, id).Error(0) } func (m *mockComputeBackendExtended) ResumeInstance(ctx context.Context, id string) error { - return nil + return m.Called(ctx, id).Error(0) } func (m *mockComputeBackendExtended) ResizeInstance(ctx context.Context, id string, cpu, memory int64) error { - return nil + return m.Called(ctx, id, cpu, memory).Error(0) }

As per coding guidelines, "**/*.go: Do not use silent failures - avoid blank identifier assignment like _ = someFunc()".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/workers/pipeline_worker_test.go` around lines 161 - 169, The three mock methods on mockComputeBackendExtended (PauseInstance, ResumeInstance, ResizeInstance) currently return nil and silently succeed; update each to record the call via m.Called(...) (e.g., for ResizeInstance pass ctx, id, cpu, memory) and return the mocked error/result from the call (use args.Error(0) / args.Get(0) as appropriate) so tests can assert invocations and errors instead of masking unexpected calls.

Add DomainGetState check to PauseInstance and ResumeInstance in the libvirt adapter, consistent with existing patterns in other methods (ResizeInstance, stopDomainIfRunning). This prevents silently succeeding when the domain is already in the target state.

- Bulkhead.acquire now returns ctx.Err() when context is cancelled, instead of always returning ErrBulkheadFull, making the error type distinguishable from capacity exhaustion - ResumeInstance error log now explicitly notes the instance is left in PAUSED state so operators can detect and recover from partial failures

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Resolve merge conflicts: - Add PauseInstance/ResumeInstance to NoopComputeBackend - Remove duplicate ResizeInstance in docker/adapter.go - Add PauseInstance/ResumeInstance to ResilientCompute - Add PauseInstance/ResumeInstance to mockCompute test struct - Remove duplicate ResizeInstance in mock_compute_test.go - Fix bulkhead test to expect ctx.Err() instead of ErrBulkheadFull - Add domainStatePaused constant to libvirt adapter - Update bulkhead acquire to return ctx.Err() on context cancellation

…tations - Remove duplicate ResizeInstance from firecracker/adapter.go (origin/main has correct version) - Remove ErrNotSupported from firecracker PauseInstance/ResumeInstance (no-op backend) - Ensure firecracker adapter_noop.go is correct

Copilot

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+// PauseInstance suspends a running domain (freezes CPU, retains memory/network).
+func (a *LibvirtAdapter) PauseInstance(ctx context.Context, id string) error {
+	dom, err := a.client.DomainLookupByName(ctx, id)
+	if err != nil {
+		return fmt.Errorf(errDomainNotFound, err)
+	}
+
+	state, _, err := a.client.DomainGetState(ctx, dom, 0)
+	if err != nil {
+		return fmt.Errorf("failed to get domain state: %w", err)
+	}
+	if state != domainStateRunning {
+		return fmt.Errorf("domain is not running (state=%d), cannot pause", state)
+	}
+
+	if err := a.client.DomainSuspend(ctx, dom); err != nil {
+		return fmt.Errorf("failed to suspend domain: %w", err)
+	}
+	a.logger.Info("domain paused", "domain", id)
+	return nil
+}
+
+// ResumeInstance resumes a paused domain back to running state.
+func (a *LibvirtAdapter) ResumeInstance(ctx context.Context, id string) error {
+	dom, err := a.client.DomainLookupByName(ctx, id)
+	if err != nil {
+		return fmt.Errorf(errDomainNotFound, err)
+	}
+
+	state, _, err := a.client.DomainGetState(ctx, dom, 0)
+	if err != nil {
+		return fmt.Errorf("failed to get domain state: %w", err)
+	}
+	if state != domainStatePaused {
+		return fmt.Errorf("domain is not paused (state=%d), cannot resume", state)
+	}
+
+	if err := a.client.DomainResume(ctx, dom); err != nil {
+		return fmt.Errorf("failed to resume domain: %w", err)
+	}
+	a.logger.Info("domain resumed", "domain", id)
+	return nil
+}


 	ContainerExecInspect(ctx context.Context, execID string) (container.ExecInspect, error)
 	ContainerRename(ctx context.Context, containerID string, newName string) error
-	ContainerUpdate(ctx context.Context, containerID string, updateConfig container.UpdateConfig) (container.UpdateResponse, error)
+ContainerUpdate(ctx context.Context, containerID string, updateConfig container.UpdateConfig) (container.UpdateResponse, error)


+func (b *NoopComputeBackend) Ping(ctx context.Context) error                      { return nil }
+func (b *NoopComputeBackend) Type() string                                   { return "noop" }
+func (b *NoopComputeBackend) PauseInstance(ctx context.Context, id string) error   { return nil }
+func (b *NoopComputeBackend) ResumeInstance(ctx context.Context, id string) error  { return nil }
 func (b *NoopComputeBackend) ResizeInstance(ctx context.Context, id string, cpu, memory int64) error { return nil }


+		s.logger.Error("failed to resume container", "container_id", target, "error", err)
+		return errors.Wrap(errors.Internal, "failed to resume container", err)


+| PR | Title | Implementation Status |
+|----|-------|----------------------|
+| PR 1 | Image Management | ✅ COMPLETED |
+| PR 2 | Instance Resize/Scale | ✅ COMPLETED |
+| PR 3 | Instance Pause/Resume | ❌ NOT IMPLEMENTED |
+| PR 4 | Provision Retry & Error Handling | ❌ NOT IMPLEMENTED |
+| PR 5 | Instance Tags | ❌ NOT IMPLEMENTED |
+| PR 6 | WebSocket Progress Events | ⚠️ PARTIALLY IMPLEMENTED |
+| PR 7 | Instance Snapshots | ❌ NOT IMPLEMENTED |
+| PR 8 | Instance Live Migration | ❌ NOT IMPLEMENTED |


+	dom, err := a.client.DomainLookupByName(ctx, id)
+	if err != nil {
+		return fmt.Errorf(errDomainNotFound, err)
+	}
+
+	state, _, err := a.client.DomainGetState(ctx, dom, 0)
+	if err != nil {
+		return fmt.Errorf("failed to get domain state: %w", err)
+	}
+	if state != domainStateRunning {
+		return fmt.Errorf("domain is not running (state=%d), cannot pause", state)
+	}
+
+	if err := a.client.DomainSuspend(ctx, dom); err != nil {
+		return fmt.Errorf("failed to suspend domain: %w", err)
+	}
+	a.logger.Info("domain paused", "domain", id)
+	return nil
+}
+
+// ResumeInstance resumes a paused domain back to running state.
+func (a *LibvirtAdapter) ResumeInstance(ctx context.Context, id string) error {
+	dom, err := a.client.DomainLookupByName(ctx, id)
+	if err != nil {
+		return fmt.Errorf(errDomainNotFound, err)
+	}
+


+func (a *FirecrackerAdapter) PauseInstance(ctx context.Context, id string) error {
+	if a.cfg.MockMode {
+		return nil
+	}
+	return nil // Firecracker does not support pause/resume
+}
+
+func (a *FirecrackerAdapter) ResumeInstance(ctx context.Context, id string) error {
+	if a.cfg.MockMode {
+		return nil
+	}
+	return nil // Firecracker does not support pause/resume
+}


@@ -65,15 +63,15 @@ func (b *Bulkhead) acquire(ctx context.Context) error {
 		case <-timer.C:
 			return ErrBulkheadFull
 		case <-ctx.Done():
-			return ErrBulkheadFull
+			return ctx.Err()


+	inst.Status = domain.StatusPaused
+	if err := s.repo.Update(ctx, inst); err != nil {
+		return err
+	}
+
+	if err := s.auditSvc.Log(ctx, inst.UserID, "instance.pause", "instance", inst.ID.String(), map[string]interface{}{
+		"name": inst.Name,
+	}); err != nil {
+		s.logger.Warn("failed to log audit event", "action", "instance.pause", "instance_id", inst.ID, "error", err)
+	}
+
+	platform.InstanceOperationsTotal.WithLabelValues("pause", "success").Inc()
+	s.logger.Info("instance paused", "instance_id", inst.ID)
+	return nil
+}
+
+// ResumeInstance resumes a paused instance back to running state.
+func (s *InstanceService) ResumeInstance(ctx context.Context, idOrName string) error {
+	userID := appcontext.UserIDFromContext(ctx)
+	tenantID := appcontext.TenantIDFromContext(ctx)
+
+	if err := s.rbacSvc.Authorize(ctx, userID, tenantID, domain.PermissionInstanceUpdate, idOrName); err != nil {
+		return err
+	}
+
+	inst, err := s.GetInstance(ctx, idOrName)
+	if err != nil {
+		return err
+	}
+
+	if inst.Status != domain.StatusPaused {
+		return errors.New(errors.Conflict, "instance must be PAUSED to resume, got: "+string(inst.Status))
+	}
+
+	target := inst.ContainerID
+	if target == "" {
+		target = s.formatContainerName(inst.ID)
+	}
+
+	if err := s.compute.ResumeInstance(ctx, target); err != nil {
+		platform.InstanceOperationsTotal.WithLabelValues("resume", "failure").Inc()
+		s.logger.Error("failed to resume container", "container_id", target, "error", err)
+		return errors.Wrap(errors.Internal, "failed to resume container", err)
+	}
+
+	inst.Status = domain.StatusRunning
+	if err := s.repo.Update(ctx, inst); err != nil {
+		return err
+	}
+
+	if err := s.auditSvc.Log(ctx, inst.UserID, "instance.resume", "instance", inst.ID.String(), map[string]interface{}{
+		"name": inst.Name,
+	}); err != nil {
+		s.logger.Warn("failed to log audit event", "action", "instance.resume", "instance_id", inst.ID, "error", err)
+	}
+
+	platform.InstanceOperationsTotal.WithLabelValues("resume", "success").Inc()
+	s.logger.Info("instance resumed", "instance_id", inst.ID)
+	return nil


- ResumeInstance: roll back to RUNNING state if backend resume fails, preventing instances from being left in inconsistent PAUSED state - PauseInstance: detect and return Conflict error when backend reports state violation (instance not in RUNNING state) instead of Internal - Add sentinel errors ErrInstanceNotPausable and ErrInstanceNotResumable for state-based failures in libvirt adapter Fixes review findings from PR #320

- Add pause and resume endpoint documentation to api-reference.md - Update FEATURES.md compute section with pause/resume and resilient compute details - Add Pause/Resume row to Docker vs Libvirt comparison table

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

- TestInstanceHandlerPause: success case - TestInstanceHandlerPauseNotFound: 404 case - TestInstanceHandlerPauseConflict: 409 case for wrong state - TestInstanceHandlerResume: success case - TestInstanceHandlerResumeNotFound: 404 case - TestInstanceHandlerResumeConflict: 409 case for wrong state

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/platform/bulkhead.go (1)

9-11: ⚠️ Potential issue | 🟡 Minor

Update comments to match the new error contract.

ErrBulkheadFull is now only returned on bulkhead wait-timeout, but Line 9-11 and Line 41-44 still say context expiry also returns ErrBulkheadFull. Please align comments with current behavior to prevent misuse.

✏️ Suggested doc-only patch

-// ErrBulkheadFull is returned when the bulkhead's concurrency limit is reached
-// and the caller's timeout/context expires before a slot opens.
+// ErrBulkheadFull is returned when the bulkhead's concurrency limit is reached
+// and the bulkhead wait timeout expires before a slot opens.
 var ErrBulkheadFull = errors.New("bulkhead: concurrency limit reached")

 // Execute runs fn within the bulkhead's concurrency limit.
-// If the bulkhead is full and the wait timeout (or context) expires,
-// ErrBulkheadFull is returned without calling fn.
+// If the caller context is canceled or its deadline is exceeded while waiting,
+// the corresponding context error is returned.
+// If the bulkhead wait timeout expires first, ErrBulkheadFull is returned
+// without calling fn.
 func (b *Bulkhead) Execute(ctx context.Context, fn func() error) error {

Also applies to: 41-44

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/platform/bulkhead.go` around lines 9 - 11, Update the comment for
ErrBulkheadFull to reflect the new contract: state that ErrBulkheadFull is
returned only when the bulkhead's wait times out (wait-timeout) and not when the
caller's context is cancelled or expires; remove or change any text that says
context expiry/cancellation also returns ErrBulkheadFull (including the other
comment block referencing the same behavior). Ensure the comment clearly
distinguishes context cancellation errors from ErrBulkheadFull so callers won't
mistake context expiry for a wait-timeout.

♻️ Duplicate comments (1)

internal/core/services/instance.go (1)
661-664: ⚠️ Potential issue | 🟠 Major

Add compensating backend rollback when DB update fails after successful pause/resume.

After a successful backend transition, returning on repo.Update failure can leave runtime and DB state inconsistent. Use best-effort inverse backend action before returning the DB error.

Also applies to: 712-715
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/core/services/instance.go` around lines 661 - 664, After performing
the backend state transition (e.g., setting inst.Status = domain.StatusPaused /
domain.StatusRunning) but before returning when s.repo.Update(ctx, inst) fails,
invoke the inverse backend operation as a best-effort compensating rollback
(e.g., call the service method that undoes the pause when repo.Update fails
after pausing, or undoes the resume when repo.Update fails after resuming),
swallow and log any errors from that rollback, then return the original DB
error; apply the same pattern for the other identical block (around the
inst.Status assignment and s.repo.Update at the second location).

🧹 Nitpick comments (1)

compute-roadmap.md (1)
90-90: Optional: Address markdown formatting issues.

Static analysis detected minor formatting issues that should be fixed for better markdown rendering:

Tables missing surrounding blank lines (lines 90, 118, 213, 266, 321, 387, 464, 527): Add blank lines before and after each table per MD058.

Fenced code block missing language specifier (line 378): Add a language identifier (e.g., text or leave empty as ) to the code fence.

These changes improve rendering consistency across different markdown parsers.

Also applies to: 118-118, 213-213, 266-266, 321-321, 387-387, 464-464, 527-527, 378-378
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute-roadmap.md` at line 90, Add blank lines before and after each
markdown table (for example the table starting with "| Action | File |") at the
occurrences noted (lines flagged in the review) so each table is separated by a
blank line above and below per MD058, and update the fenced code block at the
flagged location (the fence around the snippet at the review’s line 378) to
include a language specifier such as ```text (or an explicit empty fence ```) to
satisfy the code-block rule; make these edits across the noted occurrences
(tables at the flagged lines and the single fenced block) to fix rendering.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compute-roadmap.md`:
- Line 11: The roadmap incorrectly marks "PR 3 | Instance Pause/Resume" as "❌
NOT IMPLEMENTED"; update the status table entry for PR 3 to "✅ COMPLETED" and
rework the section currently titled "What needs to be built" into "What was
built" for PR 3: list the implemented artifacts (StatusPaused in InstanceStatus
enum, POST /instances/:id/pause and POST /instances/:id/resume endpoints,
service-layer validation for RUNNING↔PAUSED transitions, ComputeBackend
implementations for Docker/Libvirt/Firecracker/Noop with
ContainerPause/ContainerUnpause and DomainSuspend/DomainResume, audit events,
resume rollback behavior, and tests) following the same structure/pattern used
for PR 1 and PR 2.

In `@docs/FEATURES.md`:
- Line 35: The docs state uses STARTING but the domain model defines
INITIALIZING; update the transition line to use INITIALIZING (so the sequence
reads INITIALIZING → RUNNING → STOPPED / PAUSED → DELETED) to match the instance
status constants declared in the domain (e.g., the INITIALIZING/ RUNNING /
STOPPED / PAUSED / DELETED status constants in instance.go); ensure the term
exactly matches the domain constant name to avoid contract drift.

In `@docs/swagger/swagger.json`:
- Around line 3969-4020: The pause/resume instance endpoint docs (e.g., the
"/instances/{id}/pause" POST operation and the corresponding
"/instances/{id}/resume" operation) are missing authentication/authorization
failure responses; add 401 and 403 response entries to each protected
operation’s "responses" object (using the same schema ref
"#/definitions/httputil.Response") so generated clients/docs include
unauthorized and forbidden cases.

In `@internal/core/services/instance.go`:
- Around line 700-709: Before calling s.compute.ResumeInstance capture the
instance's current status (e.g., oldStatus := inst.Status) and on resume failure
restore inst.Status = oldStatus instead of forcibly setting
domain.StatusRunning; then call s.repo.Update(ctx, inst) and include both the
resume error and any repo error in s.logger.Error; update any related
logs/metrics (platform.InstanceOperationsTotal) accordingly and return the
wrapped error as before.

In `@internal/handlers/instance_handler.go`:
- Around line 254-257: The Swagger comment blocks above the instance handler(s)
are missing a 400 response even though the handler returns errors.InvalidInput
when the request id is empty; update the comment blocks that currently list
`@Success` 200, `@Failure` 404, `@Failure` 409, `@Failure` 500 to also include `@Failure`
400 {object} httputil.Response so the generated docs reflect the
errors.InvalidInput path (apply the same change to the other similar comment
block that validates id and returns errors.InvalidInput).

In `@internal/repositories/libvirt/adapter.go`:
- Around line 217-219: The resume path currently returns a plain fmt.Errorf when
the domain is not in domainStatePaused; replace that plain error with a wrapped
typed error using apierrors.ErrInstanceNotResumable (mirroring the pause path)
so callers can detect Conflict; e.g. change the return to wrap the message with
apierrors.ErrInstanceNotResumable (using the project’s usual Wrapf/Wrap pattern)
and include the same descriptive text and state value.

---

Outside diff comments:
In `@internal/platform/bulkhead.go`:
- Around line 9-11: Update the comment for ErrBulkheadFull to reflect the new
contract: state that ErrBulkheadFull is returned only when the bulkhead's wait
times out (wait-timeout) and not when the caller's context is cancelled or
expires; remove or change any text that says context expiry/cancellation also
returns ErrBulkheadFull (including the other comment block referencing the same
behavior). Ensure the comment clearly distinguishes context cancellation errors
from ErrBulkheadFull so callers won't mistake context expiry for a wait-timeout.

---

Duplicate comments:
In `@internal/core/services/instance.go`:
- Around line 661-664: After performing the backend state transition (e.g.,
setting inst.Status = domain.StatusPaused / domain.StatusRunning) but before
returning when s.repo.Update(ctx, inst) fails, invoke the inverse backend
operation as a best-effort compensating rollback (e.g., call the service method
that undoes the pause when repo.Update fails after pausing, or undoes the resume
when repo.Update fails after resuming), swallow and log any errors from that
rollback, then return the original DB error; apply the same pattern for the
other identical block (around the inst.Status assignment and s.repo.Update at
the second location).

---

Nitpick comments:
In `@compute-roadmap.md`:
- Line 90: Add blank lines before and after each markdown table (for example the
table starting with "| Action | File |") at the occurrences noted (lines flagged
in the review) so each table is separated by a blank line above and below per
MD058, and update the fenced code block at the flagged location (the fence
around the snippet at the review’s line 378) to include a language specifier
such as ```text (or an explicit empty fence ```) to satisfy the code-block rule;
make these edits across the noted occurrences (tables at the flagged lines and
the single fenced block) to fix rendering.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 19fda502-3ac2-4bf7-9a7a-57b157afdcca

📥 Commits

Reviewing files that changed from the base of the PR and between b11d04d and a999e74.

📒 Files selected for processing (32)

compute-roadmap.md
docs/FEATURES.md
docs/api-reference.md
docs/guides/libvirt-backend.md
docs/swagger/docs.go
docs/swagger/swagger.json
docs/swagger/swagger.yaml
internal/api/setup/router.go
internal/core/ports/compute.go
internal/core/ports/instance.go
internal/core/services/instance.go
internal/core/services/instance_unit_test.go
internal/core/services/mock_compute_test.go
internal/errors/errors.go
internal/handlers/instance_handler.go
internal/handlers/instance_handler_test.go
internal/platform/bulkhead.go
internal/platform/bulkhead_test.go
internal/platform/resilient_compute.go
internal/platform/resilient_compute_test.go
internal/repositories/docker/adapter.go
internal/repositories/docker/fakes_test.go
internal/repositories/firecracker/adapter.go
internal/repositories/firecracker/adapter_noop.go
internal/repositories/k8s/kubeadm_provisioner_test.go
internal/repositories/k8s/mocks_test.go
internal/repositories/libvirt/adapter.go
internal/repositories/libvirt/lb_proxy_test.go
internal/repositories/noop/adapters.go
internal/workers/database_failover_worker_test.go
internal/workers/healing_worker_test.go
internal/workers/pipeline_worker_test.go

✅ Files skipped from review due to trivial changes (7)

internal/workers/pipeline_worker_test.go
internal/errors/errors.go
internal/platform/resilient_compute_test.go
docs/guides/libvirt-backend.md
internal/platform/bulkhead_test.go
docs/api-reference.md
internal/repositories/k8s/kubeadm_provisioner_test.go

🚧 Files skipped from review as they are similar to previous changes (10)

internal/repositories/docker/fakes_test.go
internal/repositories/libvirt/lb_proxy_test.go
internal/api/setup/router.go
internal/workers/database_failover_worker_test.go
internal/repositories/firecracker/adapter.go
internal/platform/resilient_compute.go
internal/repositories/firecracker/adapter_noop.go
internal/repositories/docker/adapter.go
internal/core/services/mock_compute_test.go
internal/core/services/instance_unit_test.go

- compute-roadmap.md: update PR 3 status to COMPLETED with implemented artifacts - docs/FEATURES.md: fix instance state name from STARTING to INITIALIZING - docs/swagger: add 400/401/403 responses to pause/resume endpoints - internal/handlers/instance_handler.go: add 400/401/403 to swagger comments - internal/core/services/instance.go: - resume rollback now preserves old status instead of hardcoding RUNNING - add compensating rollback when repo.Update fails after pause/resume - rollback now attempts both repo Update (to restore status) and backend undo - internal/platform/bulkhead.go: clarify ErrBulkheadFull only on wait timeout - internal/repositories/libvirt/adapter.go: add ErrInstanceNotResumable to resume path - instance_unit_test.go: add RepoError_Rollback test cases for pause and resume Fixes all code review findings.

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (3)

internal/platform/bulkhead.go (1)

27-39: ⚠️ Potential issue | 🟠 Major

Wait-timeout contract is still inconsistent with constructor behavior.

Line 27 says WaitTimeout defaults to 5s, but Lines 31-39 do not apply that default, so zero means “wait on context only.” This mismatch can still cause unbounded waits when callers use a non-deadline context.

♻️ Proposed fix

+const (
+	defaultBulkheadMaxConc     = 10
+	defaultBulkheadWaitTimeout = 5 * time.Second
+)
+
 type BulkheadOpts struct {
 	Name        string        // Identifier for logging/metrics.
-	MaxConc     int           // Maximum concurrent requests. Default 10.
-	WaitTimeout time.Duration // How long to wait for a slot. Default 5s. 0 means use context deadline.
+	MaxConc     int           // Maximum concurrent requests. Default 10.
+	WaitTimeout time.Duration // How long to wait for a slot. Default 5s.
 }
 
 // NewBulkhead creates a new concurrency-limiting bulkhead.
 func NewBulkhead(opts BulkheadOpts) *Bulkhead {
 	if opts.MaxConc <= 0 {
-		opts.MaxConc = 10
+		opts.MaxConc = defaultBulkheadMaxConc
+	}
+	if opts.WaitTimeout <= 0 {
+		opts.WaitTimeout = defaultBulkheadWaitTimeout
 	}
 	return &Bulkhead{
 		name:    opts.Name,
 		sem:     make(chan struct{}, opts.MaxConc),
 		timeout: opts.WaitTimeout,
 	}
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/platform/bulkhead.go` around lines 27 - 39, The WaitTimeout field on
BulkheadOpts is documented to default to 5s but NewBulkhead currently treats
zero as “use context deadline”; update NewBulkhead to apply the documented
default: if opts.WaitTimeout == 0 set it to 5*time.Second before assigning to
the Bulkhead.timeout field (and ensure the comment on WaitTimeout matches this
behavior); reference the WaitTimeout field on BulkheadOpts, the NewBulkhead
function, and the Bulkhead.timeout field when making the change.

internal/core/services/instance_unit_test.go (1)

304-304: ⚠️ Potential issue | 🟡 Minor

Keep these GetByName expectations required.

TerminateInstance and Exec still try GetByName before falling back to GetByID for UUID inputs. Leaving these expectations optional lets that resolution path disappear without failing the tests.

Also applies to: 345-356, 676-676, 731-731, 781-781, 836-836, 896-896, 956-956
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/core/services/instance_unit_test.go` at line 304, The mock
expectation for repo.On("GetByName", mock.Anything, instanceID.String()) is
marked optional with .Maybe(), which allows the name-resolution path to
disappear and hide regressions; change those expectations to required (remove
.Maybe() and use the default required expectation or call .Once()/.Times(1)) so
TerminateInstance and Exec tests must exercise GetByName before they fall back
to GetByID; update the same pattern at the other occurrences that reference
repo.On("GetByName", mock.Anything, ...) in the test (lines noted in the review)
to ensure the name lookup path is enforced.

internal/core/services/instance.go (1)

711-721: ⚠️ Potential issue | 🟠 Major

Don't collapse backend resume conflicts into 500s.

ComputeBackend.ResumeInstance can return a Conflict-typed error such as internal/errors.ErrInstanceNotResumable. This branch always wraps Internal, so /instances/{id}/resume returns 500 instead of the documented 409.

🧩 Suggested fix

 	if err := s.compute.ResumeInstance(ctx, target); err != nil {
 		platform.InstanceOperationsTotal.WithLabelValues("resume", "failure").Inc()
+		if errors.Is(err, errors.Conflict) {
+			s.logger.Warn("resume not possible in current state", "container_id", target, "error", err)
+			return errors.New(errors.Conflict, err.Error())
+		}
 		oldStatus := inst.Status
 		s.logger.Error("failed to resume container, rolling back",
 			"container_id", target, "instance_id", inst.ID, "old_status", oldStatus, "error", err)
 		inst.Status = oldStatus

Please add a unit case for a Conflict returned from ResumeInstance so this stays covered.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/core/services/instance.go` around lines 711 - 721, The resume error
handling currently always wraps errors.Internal, collapsing conflict errors into
500s; change the s.compute.ResumeInstance error branch to detect conflict-type
errors (e.g., internal/errors.ErrInstanceNotResumable or using errors.Is/As) and
handle them specially: increment the platform.InstanceOperationsTotal metric
with "resume","conflict", log a clear conflict message including "container_id"
and "instance_id", set inst.Status back to oldStatus and attempt repo.Update as
before, but return a wrapped conflict error (not errors.Internal). Also add a
unit test case for s.compute.ResumeInstance returning the conflict error to
assert a 409-equivalent response and the "conflict" metric increment.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compute-roadmap.md`:
- Around line 90-98: Several tables in the markdown are missing surrounding
blank lines which triggers MD058; locate each table starting with the header row
"| Action | File |" (and other similar pipe-delimited tables) and add a single
blank line before and after each table block (e.g., the tables near the shown
ranges and any other occurrences), ensuring there is an empty line above the
header row and an empty line after the final table row so markdownlint MD058 is
satisfied.
- Around line 24-50: The PR summary is inconsistent: the top section claims
ImportImage and six methods (including ImportImage) are implemented, but the
"What exists" list shows only five methods and ImageHandler with five endpoints
while "What needs to be built" still lists ImportImage; align the document by
choosing the true current state and updating the lists accordingly—either remove
ImportImage from the initial "All features implemented" list or update "What
exists" to include ImportImage across domain.Image, ports.ImageService (include
ImportImage), internal/core/services/image.go (ImageService implementation
containing ImportImage), internal/handlers/image_handler.go (ImageHandler with
POST /api/v1/images/import), and tests (TestImageService_Unit); also ensure RBAC
and router notes (PermissionImageCreate, PermissionImageRead*, router wiring)
are consistent with the chosen state.
- Around line 334-340: The fenced code block using triple backticks in the
Markdown (the block that starts with ``` and contains the flow "LaunchInstance →
enqueue job → return 202" through "Client connects to
/ws/instances/{id}/events") needs a language identifier to satisfy markdownlint
MD040; update the opening fence from ``` to ```text (or another appropriate
language) so the block becomes a labeled fenced code block.

In `@internal/core/services/instance_unit_test.go`:
- Around line 982-1155: The test testInstanceServicePauseResumeUnit reuses the
same mock objects across subtests which accumulates expectations; refactor into
a table-driven test where each table entry describes the case (name, initial
instance state, expected compute/repo/rbac/audit behaviors and expected
error/state), and inside the loop create fresh mocks (new(MockInstanceRepo),
new(MockComputeBackend), new(MockRBACService), new(MockAuditService)), construct
svc with those mocks, set up per-case repo.On/compute.On/rbacSvc.On/auditSvc.On
expectations (using Once/Maybe as appropriate), call svc.PauseInstance or
svc.ResumeInstance, and assert results; ensure each case sets its own instance
value (domain.Instance) and that mock.AssertExpectationsForObjects is called for
the mocks created in that iteration to guarantee isolation and avoid cross-test
expectation leakage.

In `@internal/core/services/instance.go`:
- Around line 663-673: The repo update rollback currently restores inst.Status
(oldStatus) before attempting the compute backend compensating call, which can
leave DB and backend inconsistent; change the order in both rollback branches so
you first call the compensating backend method (use
s.compute.ResumeInstance(ctx, target) in the pause-failure path and
s.compute.PauseInstance(ctx, target) in the resume-failure path), check/log any
backend error, and only after a successful or best-effort backend call attempt
to write the old status back via s.repo.Update(ctx, inst) (use the existing
error logging pattern for repo update failures); apply the same ordering change
to the symmetric rollback block that uses s.compute.PauseInstance and
s.repo.Update.

---

Duplicate comments:
In `@internal/core/services/instance_unit_test.go`:
- Line 304: The mock expectation for repo.On("GetByName", mock.Anything,
instanceID.String()) is marked optional with .Maybe(), which allows the
name-resolution path to disappear and hide regressions; change those
expectations to required (remove .Maybe() and use the default required
expectation or call .Once()/.Times(1)) so TerminateInstance and Exec tests must
exercise GetByName before they fall back to GetByID; update the same pattern at
the other occurrences that reference repo.On("GetByName", mock.Anything, ...) in
the test (lines noted in the review) to ensure the name lookup path is enforced.

In `@internal/core/services/instance.go`:
- Around line 711-721: The resume error handling currently always wraps
errors.Internal, collapsing conflict errors into 500s; change the
s.compute.ResumeInstance error branch to detect conflict-type errors (e.g.,
internal/errors.ErrInstanceNotResumable or using errors.Is/As) and handle them
specially: increment the platform.InstanceOperationsTotal metric with
"resume","conflict", log a clear conflict message including "container_id" and
"instance_id", set inst.Status back to oldStatus and attempt repo.Update as
before, but return a wrapped conflict error (not errors.Internal). Also add a
unit test case for s.compute.ResumeInstance returning the conflict error to
assert a 409-equivalent response and the "conflict" metric increment.

In `@internal/platform/bulkhead.go`:
- Around line 27-39: The WaitTimeout field on BulkheadOpts is documented to
default to 5s but NewBulkhead currently treats zero as “use context deadline”;
update NewBulkhead to apply the documented default: if opts.WaitTimeout == 0 set
it to 5*time.Second before assigning to the Bulkhead.timeout field (and ensure
the comment on WaitTimeout matches this behavior); reference the WaitTimeout
field on BulkheadOpts, the NewBulkhead function, and the Bulkhead.timeout field
when making the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 66a88b06-6736-4d62-b629-2e550acf8a50

📥 Commits

Reviewing files that changed from the base of the PR and between a999e74 and 5010b8e.

📒 Files selected for processing (11)

compute-roadmap.md
docs/FEATURES.md
docs/swagger/docs.go
docs/swagger/swagger.json
docs/swagger/swagger.yaml
internal/core/services/instance.go
internal/core/services/instance_unit_test.go
internal/handlers/instance_handler.go
internal/handlers/instance_handler_test.go
internal/platform/bulkhead.go
internal/repositories/libvirt/adapter.go

🚧 Files skipped from review as they are similar to previous changes (3)

docs/FEATURES.md
internal/handlers/instance_handler_test.go
docs/swagger/swagger.yaml

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

- bulkhead: apply 5s default WaitTimeout when zero (matches docs) - instance: add conflict detection in ResumeInstance (matches PauseInstance) - instance: fix rollback order - compensating backend call first, then repo - instance_unit_test: refactor pause/resume tests to table-driven with fresh mocks per case; add ResumeInstance_ConflictError test case

- docker adapter: fix indentation regression in ContainerUpdate interface method - libvirt adapter: improve error messages with human-readable state names instead of raw numeric values (e.g., "domain is PAUSED" not "domain is 3")

Fixes lint failure: internal/core/services/instance_unit_test.go:1012:12: test helper function should start from t.Helper() (thelper)

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Fixes lint failure across all 9 assert closures in the pause/resume table-driven test cases (lines 1012, 1028, 1047, 1071, 1095, 1111, 1131, 1148, 1172).

…esume

After merging main, testComputeBackend was missing the new interface methods.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

The previous log said "failed to resume container, rolling back" which was misleading since no backend rollback occurs when ResumeInstance fails - the container stays PAUSED. Updated to "failed to resume container, instance left in PAUSED state".

poyrazK

It's ready to merge

poyrazK added 11 commits April 28, 2026 14:32

feat(instances): add StatusPaused and pause/resume to interfaces

73deae9

Add PAUSED status to InstanceStatus enum and extend both ComputeBackend and InstanceService interfaces with PauseInstance and ResumeInstance methods.

feat(docker): implement PauseInstance and ResumeInstance

77cdfdb

Use ContainerPause and ContainerUnpause Docker API calls to freeze and unfreeze containers. Add ContainerUpdate for ResizeInstance support.

feat(libvirt): implement PauseInstance and ResumeInstance via DomainS…

f5f81e8

…uspend/DomainResume Add DomainSuspend and DomainResume to LibvirtClient interface and implement them in RealLibvirtClient using virDomainSuspend and virDomainResume libvirt calls.

feat(stubs): add PauseInstance, ResumeInstance, and ResizeInstance to…

dccef5d

… Firecracker and Noop backends Firecracker returns nil (not supported). Noop returns nil (success). These are placeholder implementations to satisfy the interface.

feat(handlers): add POST /instances/:id/pause and /resume endpoints

418e708

Wire Pause and Resume HTTP handlers with RBAC authorization, swagger documentation, and corresponding handler tests.

feat(router): wire /instances/:id/pause and /resume routes

e154d1e

Add POST routes for pause and resume actions to the instance group with RBAC authorization middleware.

test(mocks): add PauseInstance and ResumeInstance to K8s and libvirt …

9c3a665

…mocks Update mock implementations to satisfy the updated ComputeBackend interface with pause/resume methods.

test(workers): add missing interface methods to worker test mocks

308184a

Update mock compute backends in worker tests to implement PauseInstance, ResumeInstance, and ResizeInstance to satisfy the updated ports.ComputeBackend interface.

Copilot AI review requested due to automatic review settings April 28, 2026 11:36

Copilot started reviewing on behalf of poyrazK April 28, 2026 11:36 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 28, 2026 14:33

Copilot started reviewing on behalf of poyrazK April 28, 2026 14:33 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

poyrazK added 2 commits April 28, 2026 17:46

Copilot AI review requested due to automatic review settings April 28, 2026 15:05

Copilot started reviewing on behalf of poyrazK April 28, 2026 15:06 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

poyrazK added 2 commits April 29, 2026 18:32

Copilot AI review requested due to automatic review settings April 29, 2026 15:39

Copilot started reviewing on behalf of poyrazK April 29, 2026 15:39 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

poyrazK added 2 commits April 29, 2026 19:54

Merge origin/main into feat/instance-pause-resume

4bd1c5d

Copilot AI review requested due to automatic review settings April 29, 2026 17:34

Copilot started reviewing on behalf of poyrazK April 29, 2026 17:35 View session

docs: update API reference and feature docs for pause/resume

a999e74

- Add pause and resume endpoint documentation to api-reference.md - Update FEATURES.md compute section with pause/resume and resilient compute details - Add Pause/Resume row to Docker vs Libvirt comparison table

Copilot AI reviewed Apr 29, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread compute-roadmap.md Outdated

Comment thread docs/FEATURES.md Outdated

Comment thread docs/swagger/swagger.json

Comment thread internal/core/services/instance.go

Comment thread internal/handlers/instance_handler.go

Comment thread internal/repositories/libvirt/adapter.go

Copilot AI review requested due to automatic review settings April 29, 2026 18:51

Copilot started reviewing on behalf of poyrazK April 29, 2026 18:52 View session

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread compute-roadmap.md Outdated

Comment thread compute-roadmap.md Outdated

Comment thread compute-roadmap.md Outdated

Comment thread internal/core/services/instance_unit_test.go

Comment thread internal/core/services/instance.go Outdated

Copilot AI reviewed Apr 29, 2026

View reviewed changes

poyrazK added 3 commits April 30, 2026 12:26

revert: remove compute-roadmap.md from branch

a20a805

fix: address PR #320 review findings

e3817de

- docker adapter: fix indentation regression in ContainerUpdate interface method - libvirt adapter: improve error messages with human-readable state names instead of raw numeric values (e.g., "domain is PAUSED" not "domain is 3")

Copilot AI review requested due to automatic review settings April 30, 2026 10:03

Copilot started reviewing on behalf of poyrazK April 30, 2026 10:04 View session

fix(tests): add t.Helper() to assert closures in pause/resume test cases

bd1bb23

Fixes lint failure: internal/core/services/instance_unit_test.go:1012:12: test helper function should start from t.Helper() (thelper)

Copilot AI reviewed Apr 30, 2026

View reviewed changes

poyrazK added 4 commits April 30, 2026 13:35

fix(tests): add t.Helper() to all pause/resume assert closures

ff89197

Fixes lint failure across all 9 assert closures in the pause/resume table-driven test cases (lines 1012, 1028, 1047, 1071, 1095, 1111, 1131, 1148, 1172).

ci: trigger fresh build

7a94f6c

Merge remote-tracking branch 'origin/main' into feat/instance-pause-r…

fc4869d

…esume

fix(tests): add PauseInstance and ResumeInstance to testComputeBackend

9922509

After merging main, testComputeBackend was missing the new interface methods.

Copilot AI review requested due to automatic review settings April 30, 2026 11:08

Copilot started reviewing on behalf of poyrazK April 30, 2026 11:08 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

poyrazK commented Apr 30, 2026

View reviewed changes

poyrazK merged commit b4819c0 into main Apr 30, 2026
22 checks passed

	// ResizeInstance updates the CPU and memory limits of a running or stopped instance.
	// ResizeInstance updates the CPU and memory limits of a running or stopped instance.
	// cpu is specified in NanoCPUs (1 vCPU = 1e9 NanoCPUs), and memory is specified in bytes.

		s.logger.Error("failed to pause container", "container_id", target, "error", err)
		return errors.Wrap(errors.Internal, "failed to pause container", err)

		// Modify memory (in KiB) and vCPU in the XML
		newDOMXML := a.applyDomainResize(domXML, int(memory/1024), int(cpu/1e9))

	return nil // Firecracker does not support resize
	return fmt.Errorf("resize is not supported by Firecracker")

-	ResizeInstance(ctx context.Context, id string, cpu, memory int64) error
+	// cpuNano must be specified in nano-CPUs (for example, 1 vCPU = 1_000_000_000).
+	// memoryBytes must be specified in bytes (for example, 512 MiB = 536_870_912).
+	// Callers must not pass vCPU counts or MB/MiB values directly.
+	ResizeInstance(ctx context.Context, id string, cpuNano, memoryBytes int64) error

		reMemory: regexp.MustCompile(`<memory unit="KiB">\d+</memory>`),
		reCurrentMemory: regexp.MustCompile(`<currentMemory unit="KiB">\d+</currentMemory>`),

		// StatusPaused indicates the instance is paused (frozen CPU, retained memory/network).
		StatusPaused InstanceStatus = "PAUSED"

		s.logger.Error("failed to resume container", "container_id", target, "error", err)
		return errors.Wrap(errors.Internal, "failed to resume container", err)

Conversation

poyrazK commented Apr 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Commits (11 total)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

poyrazK commented Apr 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading