Skip to content

feat(instances): add pause and resume functionality#320

Merged
poyrazK merged 32 commits intomainfrom
feat/instance-pause-resume
Apr 30, 2026
Merged

feat(instances): add pause and resume functionality#320
poyrazK merged 32 commits intomainfrom
feat/instance-pause-resume

Conversation

@poyrazK
Copy link
Copy Markdown
Owner

@poyrazK poyrazK commented Apr 28, 2026

Summary

  • Add POST /instances/:id/pause and POST /instances/:id/resume API endpoints
  • Implement pause/resume for Docker (ContainerPause/ContainerUnpause) and Libvirt (DomainSuspend/DomainResume) backends
  • Add StatusPaused to instance status enum
  • Add state validation (RUNNING→PAUSED, PAUSED→RUNNING)
  • Add RBAC authorization with PermissionInstanceUpdate
  • Add ResilientCompute wrapper with circuit breaker and bulkhead patterns
  • Add comprehensive test coverage

Test plan

  • go build ./... - compiles
  • go test ./internal/platform/... -v - platform tests pass
  • go test ./internal/repositories/libvirt/... -v - libvirt tests pass
  • go test ./internal/repositories/docker/... -v - docker tests pass
  • go test ./internal/handlers/... -v - handler tests pass
  • go vet ./... - no issues

Commits (11 total)

  1. feat(instances): add StatusPaused and pause/resume to interfaces
  2. feat(docker): implement PauseInstance and ResumeInstance
  3. feat(libvirt): implement PauseInstance and ResumeInstance via DomainSuspend/DomainResume
  4. feat(stubs): add PauseInstance, ResumeInstance, and ResizeInstance to Firecracker and Noop backends
  5. feat(services): implement PauseInstance and ResumeInstance in InstanceService
  6. feat(handlers): add POST /instances/:id/pause and /resume endpoints
  7. feat(router): wire /instances/:id/pause and /resume routes
  8. feat(platform): add Bulkhead concurrency limiter and enhance CircuitBreaker
  9. feat(platform): add ResilientCompute wrapper with circuit breaker and bulkhead
  10. test(mocks): add PauseInstance and ResumeInstance to K8s and libvirt mocks
  11. test(workers): add missing interface methods to worker test mocks

Summary by CodeRabbit

  • New Features

    • Added instance pause/resume endpoints, including a PAUSED lifecycle state and validated transitions; backends updated to support pause/resume where available.
  • Documentation

    • Updated API docs, feature guide, and roadmap to describe pause/resume behavior and backend differences.
  • Bug Fixes

    • Resilience layer now returns context cancellation/deadline errors (not the wait-timeout error) when appropriate.
  • Tests

    • Expanded tests and mocks to cover pause/resume flows, errors, and rollback paths.

poyrazK added 11 commits April 28, 2026 14:32
Add PAUSED status to InstanceStatus enum and extend both
ComputeBackend and InstanceService interfaces with PauseInstance
and ResumeInstance methods.
Use ContainerPause and ContainerUnpause Docker API calls to
freeze and unfreeze containers. Add ContainerUpdate for
ResizeInstance support.
…uspend/DomainResume

Add DomainSuspend and DomainResume to LibvirtClient interface and
implement them in RealLibvirtClient using virDomainSuspend and
virDomainResume libvirt calls.
… Firecracker and Noop backends

Firecracker returns nil (not supported). Noop returns nil (success).
These are placeholder implementations to satisfy the interface.
…eService

Add RBAC authorization with PermissionInstanceUpdate, validate state
transitions (RUNNING->PAUSED, PAUSED->RUNNING), update database record
with new status, and emit audit log events for pause/resume actions.
Wire Pause and Resume HTTP handlers with RBAC authorization,
swagger documentation, and corresponding handler tests.
Add POST routes for pause and resume actions to the instance
group with RBAC authorization middleware.
…reaker

Add Bulkhead type with semaphore-based concurrency limiting to prevent
cascading failures. Enhance CircuitBreaker with NewCircuitBreakerWithOpts,
StateChangeFunc callback, and improved half-open state handling.
… bulkhead

Wrap ComputeBackend with circuit breaker and bulkhead patterns for
resilience. Includes timeout-aware call protection and comprehensive
test coverage for circuit tripping, bulkhead limits, and timeouts.
…mocks

Update mock implementations to satisfy the updated ComputeBackend
interface with pause/resume methods.
Update mock compute backends in worker tests to implement
PauseInstance, ResumeInstance, and ResizeInstance to satisfy
the updated ports.ComputeBackend interface.
Copilot AI review requested due to automatic review settings April 28, 2026 11:36
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

Warning

Rate limit exceeded

@poyrazK has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 45 minutes and 1 second before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 45e4a828-549d-4f1d-8006-beb86482d3c6

📥 Commits

Reviewing files that changed from the base of the PR and between 5010b8e and a073124.

📒 Files selected for processing (7)
  • internal/core/services/function_internal_test.go
  • internal/core/services/instance.go
  • internal/core/services/instance_unit_test.go
  • internal/platform/bulkhead.go
  • internal/repositories/docker/adapter.go
  • internal/repositories/libvirt/adapter.go
  • internal/workers/healing_worker_test.go
📝 Walkthrough

Walkthrough

Adds PAUSED instance state and pause/resume lifecycle across stack: API routes and handlers, service logic with validation/audit/metrics, compute/backend interface extensions and implementations (Docker, Libvirt, Firecracker, noop), resilient/bulkhead plumbing, error definitions, and tests/docs updates.

Changes

Cohort / File(s) Summary
Domain & Errors
internal/core/domain/instance.go, internal/errors/errors.go
Added StatusPaused ("PAUSED") and conflict errors ErrInstanceNotPausable, ErrInstanceNotResumable.
API & Handlers
internal/api/setup/router.go, internal/handlers/instance_handler.go, internal/handlers/instance_handler_test.go
Registered POST /instances/:id/pause and /resume; added handler methods and handler tests covering success, not-found, and conflict cases.
Service Layer
internal/core/ports/instance.go, internal/core/services/instance.go, internal/core/services/instance_unit_test.go, internal/core/services/mock_compute_test.go
Extended InstanceService interface; implemented PauseInstance/ResumeInstance with RBAC, state validation, compute delegation, DB updates, audit logging, rollback on repo failures, metrics, and unit tests including rollback paths.
Compute Ports & Resilience
internal/core/ports/compute.go, internal/platform/resilient_compute.go, internal/platform/resilient_compute_test.go
Extended ComputeBackend interface; added resilient PauseInstance/ResumeInstance wrappers that run under bulkhead/circuit-breaker/timeouts; tests updated.
Backends: Docker / Firecracker / Libvirt / Noop
internal/repositories/docker/adapter.go, internal/repositories/docker/fakes_test.go, internal/repositories/firecracker/..., internal/repositories/libvirt/..., internal/repositories/noop/adapters.go
Added pause/resume methods across adapters: Docker delegates to ContainerPause/Unpause; Libvirt implements DomainSuspend/DomainResume with state checks; Firecracker no-op/stub implementations; noop backend returns success; corresponding test fakes/mocks added.
Libvirt Client & Mocks
internal/repositories/libvirt/libvirt_client.go, internal/repositories/libvirt/mock_client_test.go, internal/repositories/libvirt/real_client.go
Added DomainSuspend/DomainResume to libvirt client interface; real and mock client implementations added.
Platform Bulkhead behavior
internal/platform/bulkhead.go, internal/platform/bulkhead_test.go
Changed acquire semantics to return ctx.Err() on context cancellation (reserve ErrBulkheadFull for wait-timeout only); updated test expectation.
Tests & Mocks Across Repo
internal/repositories/k8s/..., internal/repositories/libvirt/..., internal/workers/..._test.go, internal/workers/...
Extended many test-local mocks and fakes to include PauseInstance/ResumeInstance stubs enabling broader test coverage.
Docs & API Spec
docs/api-reference.md, docs/FEATURES.md, docs/guides/libvirt-backend.md, docs/swagger/..., compute-roadmap.md
Documented new endpoints, updated InstanceStatus enum to include PAUSED, added feature/docs entries and Swagger/OpenAPI templates and generated specs.
Misc.
internal/core/ports/..., internal/repositories/...
Minor formatting/refactor in a few files to align signatures; many files extended consistently to support pause/resume functionality.

Sequence Diagram

sequenceDiagram
    participant Client as Client
    participant API as API Router/Handler
    participant Svc as Instance Service
    participant Backend as Compute Backend
    participant DB as Database
    participant Audit as Audit Logger

    Client->>API: POST /instances/:id/pause
    API->>Svc: PauseInstance(ctx, id)
    Svc->>Svc: RBAC + Load instance (repo)
    Svc->>Backend: PauseInstance(ctx, targetID)
    Backend-->>Svc: success / error
    alt success
        Svc->>DB: Update instance status -> PAUSED
        DB-->>Svc: success
        Svc->>Audit: Write audit (instance.pause)
        Svc-->>API: success
        API-->>Client: 200 "instance paused"
    else backend conflict
        Svc-->>API: error (Conflict)
        API-->>Client: 409
    else db update fails
        Svc->>Backend: ResumeInstance(ctx, targetID)  // best-effort rollback
        Svc-->>API: error
        API-->>Client: 500
    end
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I paused my hop, then held my breath,
Buttons pressed to cheat quick death,
From Docker sigh to libvirt sleep,
State frozen safe, then wake from keep.
A little hop, then onward run—compute's new trick is done!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.21% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(instances): add pause and resume functionality' accurately and concisely describes the main change: adding pause and resume capabilities to instances across the codebase.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/instance-pause-resume

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 45 minutes and 1 second.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds pause/resume lifecycle support for instances across API, services, and compute backends, and introduces a resilience wrapper (circuit breaker + bulkhead + timeouts) for compute operations.

Changes:

  • Add POST /instances/:id/pause and POST /instances/:id/resume endpoints, routed with RBAC PermissionInstanceUpdate.
  • Implement pause/resume in service layer + Docker/Libvirt backends and extend ports/status enums.
  • Introduce platform.ResilientCompute, plus new/updated circuit breaker and bulkhead primitives (with tests).

Reviewed changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
internal/workers/pipeline_worker_test.go Extends worker test mock backend with new interface methods.
internal/workers/healing_worker_test.go Extends worker test service mock with pause/resume methods.
internal/workers/database_failover_worker_test.go Extends worker test mock backend with pause/resume/resize methods.
internal/repositories/noop/adapters.go Adds no-op pause/resume/resize to satisfy updated compute interface.
internal/repositories/libvirt/real_client.go Adds context-aware DomainSuspend/DomainResume wrappers.
internal/repositories/libvirt/mock_client_test.go Extends libvirt mock client with suspend/resume methods.
internal/repositories/libvirt/libvirt_client.go Extends LibvirtClient interface with suspend/resume.
internal/repositories/libvirt/lb_proxy_test.go Updates libvirt LB proxy test mock to satisfy interface.
internal/repositories/libvirt/adapter.go Implements Pause/Resume/Resize for libvirt backend.
internal/repositories/k8s/mocks_test.go Extends k8s test mock service with pause/resume.
internal/repositories/k8s/kubeadm_provisioner_test.go Extends kubeadm test mock service with pause/resume.
internal/repositories/firecracker/adapter_noop.go Adds unsupported-method stubs for pause/resume/resize.
internal/repositories/firecracker/adapter.go Adds pause/resume/resize methods (currently returning nil).
internal/repositories/docker/fakes_test.go Extends docker fake client for pause/unpause + container update.
internal/repositories/docker/adapter.go Implements Docker pause/resume and container resource update (resize).
internal/platform/resilient_compute.go New compute decorator adding bulkhead/CB/timeouts to backend calls.
internal/platform/resilient_compute_test.go Tests resilient compute passthrough, CB, bulkhead, timeouts, unwrap.
internal/platform/circuit_breaker.go Enhances circuit breaker (opts, state strings, half-open single-flight).
internal/platform/bulkhead.go New bulkhead (semaphore) concurrency limiter with wait timeouts.
internal/handlers/instance_handler_test.go Extends handler test mock service with pause/resume methods.
internal/handlers/instance_handler.go Adds pause/resume HTTP handlers + swagger annotations.
internal/core/services/mock_compute_test.go Extends service mocks to satisfy updated interfaces.
internal/core/services/instance.go Adds PauseInstance/ResumeInstance with RBAC + state validation + audit/metrics.
internal/core/ports/instance.go Extends InstanceService interface with pause/resume.
internal/core/ports/compute.go Extends ComputeBackend with pause/resume and resize.
internal/core/domain/instance.go Adds StatusPaused to instance status enum.
internal/api/setup/router.go Wires pause/resume routes under /instances/:id.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Ping(ctx context.Context) error
// Type returns a string identifier of the backend (e.g., "docker", "kvm").
Type() string
// ResizeInstance updates the CPU and memory limits of a running or stopped instance.
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResizeInstance's signature uses raw int64s but the interface comment doesn't define units. In this PR, ResilientCompute/Docker expect NanoCPUs + bytes while Libvirt converts cpu/1e9 to vCPU count. Please clarify the contract in the interface doc comment (e.g., cpu in NanoCPUs, memory in bytes) to avoid backend-specific interpretations.

Suggested change
// ResizeInstance updates the CPU and memory limits of a running or stopped instance.
// ResizeInstance updates the CPU and memory limits of a running or stopped instance.
// cpu is specified in NanoCPUs (1 vCPU = 1e9 NanoCPUs), and memory is specified in bytes.

Copilot uses AI. Check for mistakes.
Comment on lines +156 to +163
return nil // Firecracker does not support pause/resume
}

func (a *FirecrackerAdapter) ResumeInstance(ctx context.Context, id string) error {
if a.cfg.MockMode {
return nil
}
return nil // Firecracker does not support pause/resume
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResumeInstance returns nil even when Firecracker pause/resume isn't supported, which can lead to incorrect instance status transitions (PAUSED→RUNNING) without any backend change. Return a not-supported error when not in MockMode.

Suggested change
return nil // Firecracker does not support pause/resume
}
func (a *FirecrackerAdapter) ResumeInstance(ctx context.Context, id string) error {
if a.cfg.MockMode {
return nil
}
return nil // Firecracker does not support pause/resume
return fmt.Errorf("pause instance %s: firecracker does not support pause/resume", id)
}
func (a *FirecrackerAdapter) ResumeInstance(ctx context.Context, id string) error {
if a.cfg.MockMode {
return nil
}
return fmt.Errorf("resume instance %s: firecracker does not support pause/resume", id)

Copilot uses AI. Check for mistakes.
Comment on lines +646 to +647
s.logger.Error("failed to pause container", "container_id", target, "error", err)
return errors.Wrap(errors.Internal, "failed to pause container", err)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These error/log messages hardcode "container" (and container_id) for pause/resume failures, but the backend can be libvirt where the target is a domain name. Consider using backend-neutral wording ("instance"/"resource") to keep logs accurate across backends.

Suggested change
s.logger.Error("failed to pause container", "container_id", target, "error", err)
return errors.Wrap(errors.Internal, "failed to pause container", err)
s.logger.Error("failed to pause instance", "target", target, "error", err)
return errors.Wrap(errors.Internal, "failed to pause instance", err)

Copilot uses AI. Check for mistakes.
Comment on lines +219 to +221
// Modify memory (in KiB) and vCPU in the XML
newDOMXML := a.applyDomainResize(domXML, int(memory/1024), int(cpu/1e9))

Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Libvirt vCPU count is derived as int(cpu/1e9). If callers provide fractional CPU (e.g., 0.5 core => 5e8 NanoCPUs), this becomes 0 and produces invalid domain XML. Validate inputs (cpu/memory > 0) and define a rounding strategy (e.g., ceil to at least 1 vCPU) or reject unsupported fractional values with a clear error.

Copilot uses AI. Check for mistakes.
Comment on lines +241 to +250
// applyDomainResize updates vCPU and memory in domain XML.
func (a *LibvirtAdapter) applyDomainResize(xml string, memoryKiB, vcpus int) string {
// Replace memory allocation
xml = regexp.MustCompile(`<memory unit="KiB">\d+</memory>`).
ReplaceAllString(xml, fmt.Sprintf(`<memory unit="KiB">%d</memory>`, memoryKiB))
xml = regexp.MustCompile(`<currentMemory unit="KiB">\d+</currentMemory>`).
ReplaceAllString(xml, fmt.Sprintf(`<currentMemory unit="KiB">%d</currentMemory>`, memoryKiB))
// Replace vCPU count
xml = regexp.MustCompile(`<vcpu[^>]*>\d+</vcpu>`).
ReplaceAllString(xml, fmt.Sprintf(`<vcpu>%d</vcpu>`, vcpus))
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applyDomainResize recompiles regexps on every call via regexp.MustCompile. Since ResizeInstance can be called frequently, consider precompiling these regexps once (package-level vars) to avoid repeated compilation overhead.

Suggested change
// applyDomainResize updates vCPU and memory in domain XML.
func (a *LibvirtAdapter) applyDomainResize(xml string, memoryKiB, vcpus int) string {
// Replace memory allocation
xml = regexp.MustCompile(`<memory unit="KiB">\d+</memory>`).
ReplaceAllString(xml, fmt.Sprintf(`<memory unit="KiB">%d</memory>`, memoryKiB))
xml = regexp.MustCompile(`<currentMemory unit="KiB">\d+</currentMemory>`).
ReplaceAllString(xml, fmt.Sprintf(`<currentMemory unit="KiB">%d</currentMemory>`, memoryKiB))
// Replace vCPU count
xml = regexp.MustCompile(`<vcpu[^>]*>\d+</vcpu>`).
ReplaceAllString(xml, fmt.Sprintf(`<vcpu>%d</vcpu>`, vcpus))
var (
domainMemoryXMLRegexp = regexp.MustCompile(`<memory unit="KiB">\d+</memory>`)
domainCurrentMemoryXMLRegexp = regexp.MustCompile(`<currentMemory unit="KiB">\d+</currentMemory>`)
domainVCPUXMLRegexp = regexp.MustCompile(`<vcpu[^>]*>\d+</vcpu>`)
)
// applyDomainResize updates vCPU and memory in domain XML.
func (a *LibvirtAdapter) applyDomainResize(xml string, memoryKiB, vcpus int) string {
// Replace memory allocation
xml = domainMemoryXMLRegexp.ReplaceAllString(
xml,
fmt.Sprintf(`<memory unit="KiB">%d</memory>`, memoryKiB),
)
xml = domainCurrentMemoryXMLRegexp.ReplaceAllString(
xml,
fmt.Sprintf(`<currentMemory unit="KiB">%d</currentMemory>`, memoryKiB),
)
// Replace vCPU count
xml = domainVCPUXMLRegexp.ReplaceAllString(
xml,
fmt.Sprintf(`<vcpu>%d</vcpu>`, vcpus),
)

Copilot uses AI. Check for mistakes.
Comment thread internal/repositories/docker/adapter.go Outdated
Comment on lines +381 to +410
func (a *DockerAdapter) PauseInstance(ctx context.Context, name string) error {
if err := a.cli.ContainerPause(ctx, name); err != nil {
return fmt.Errorf("failed to pause container %s: %w", name, err)
}
return nil
}

func (a *DockerAdapter) ResumeInstance(ctx context.Context, name string) error {
if err := a.cli.ContainerUnpause(ctx, name); err != nil {
return fmt.Errorf("failed to resume container %s: %w", name, err)
}
return nil
}

// ResizeInstance updates the CPU and memory limits of a container.
func (a *DockerAdapter) ResizeInstance(ctx context.Context, id string, cpuNanoCPUs, memoryBytes int64) error {
resp, err := a.cli.ContainerUpdate(ctx, id, container.UpdateConfig{
Resources: container.Resources{
NanoCPUs: cpuNanoCPUs,
Memory: memoryBytes,
MemorySwap: memoryBytes, // Must be >= Memory; setting equal disables swap while allowing memory update
},
})
if err != nil {
return fmt.Errorf("failed to update container %s: %w", id, err)
}
if resp.Warnings != nil {
a.logger.Warn("container update warnings", "container_id", id, "warnings", resp.Warnings)
}
return nil
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DockerAdapter gained PauseInstance/ResumeInstance/ResizeInstance but docker/adapter_unit_test.go has no tests for these new methods. Add unit tests that verify the adapter forwards to ContainerPause/ContainerUnpause/ContainerUpdate and wraps errors consistently (similar to existing Start/Stop tests).

Copilot uses AI. Check for mistakes.
Comment on lines +156 to +163
return nil // Firecracker does not support pause/resume
}

func (a *FirecrackerAdapter) ResumeInstance(ctx context.Context, id string) error {
if a.cfg.MockMode {
return nil
}
return nil // Firecracker does not support pause/resume
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PauseInstance returns nil even when Firecracker pause/resume isn't supported. This will make callers think the instance was paused successfully and may update persisted status/metrics incorrectly. Return a clear error (e.g., NotImplemented/InvalidOperation) when not in MockMode, consistent with adapter_noop.go and other unimplemented Firecracker operations.

Suggested change
return nil // Firecracker does not support pause/resume
}
func (a *FirecrackerAdapter) ResumeInstance(ctx context.Context, id string) error {
if a.cfg.MockMode {
return nil
}
return nil // Firecracker does not support pause/resume
return fmt.Errorf("pause instance is not supported by Firecracker")
}
func (a *FirecrackerAdapter) ResumeInstance(ctx context.Context, id string) error {
if a.cfg.MockMode {
return nil
}
return fmt.Errorf("resume instance is not supported by Firecracker")

Copilot uses AI. Check for mistakes.
if a.cfg.MockMode {
return nil
}
return nil // Firecracker does not support resize
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResizeInstance returns nil even though the comment says Firecracker doesn't support resize. This silently reports success to callers and can desync desired vs actual resources. Return an explicit not-supported error when not in MockMode.

Suggested change
return nil // Firecracker does not support resize
return fmt.Errorf("resize is not supported by Firecracker")

Copilot uses AI. Check for mistakes.
Comment on lines +650 to +663
inst.Status = domain.StatusPaused
if err := s.repo.Update(ctx, inst); err != nil {
return err
}

if err := s.auditSvc.Log(ctx, inst.UserID, "instance.pause", "instance", inst.ID.String(), map[string]interface{}{
"name": inst.Name,
}); err != nil {
s.logger.Warn("failed to log audit event", "action", "instance.pause", "instance_id", inst.ID, "error", err)
}

platform.InstanceOperationsTotal.WithLabelValues("pause", "success").Inc()
s.logger.Info("instance paused", "instance_id", inst.ID)
return nil
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PauseInstance changes the instance status to PAUSED but does not update InstancesTotal gauges (e.g., decrement running / increment paused). StartInstance/StopInstance maintain these metrics, so pause/resume will currently leave instance counts incorrect in Prometheus.

Copilot uses AI. Check for mistakes.
Comment on lines +695 to +708
inst.Status = domain.StatusRunning
if err := s.repo.Update(ctx, inst); err != nil {
return err
}

if err := s.auditSvc.Log(ctx, inst.UserID, "instance.resume", "instance", inst.ID.String(), map[string]interface{}{
"name": inst.Name,
}); err != nil {
s.logger.Warn("failed to log audit event", "action", "instance.resume", "instance_id", inst.ID, "error", err)
}

platform.InstanceOperationsTotal.WithLabelValues("resume", "success").Inc()
s.logger.Info("instance resumed", "instance_id", inst.ID)
return nil
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResumeInstance sets status back to RUNNING but does not update InstancesTotal gauges for a PAUSED→RUNNING transition. If you introduce the PAUSED status label, make sure the gauges are kept consistent here as well.

Copilot uses AI. Check for mistakes.
…ause/resume tests

- Move regex compilation in applyDomainResize to struct fields to avoid
  recompilation on every call (Must Fix from PR #320 review)
- Add ErrNotSupported sentinel error in Firecracker adapter and return it
  from PauseInstance/ResumeInstance/ResizeInstance stubs (Consider item)
- Add unit tests for pause/resume state transitions in instance service
  covering: success paths, wrong state errors, and compute backend errors
Copilot AI review requested due to automatic review settings April 28, 2026 14:33
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// Type returns a string identifier of the backend (e.g., "docker", "kvm").
Type() string
// ResizeInstance updates the CPU and memory limits of a running or stopped instance.
ResizeInstance(ctx context.Context, id string, cpu, memory int64) error
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResizeInstance's cpu/memory parameters are undocumented in the interface, but implementations treat them as nano-CPUs and bytes (e.g., Docker uses NanoCPUs/Memory; libvirt divides cpu by 1e9 and memory by 1024). Please document the required units in the ComputeBackend interface comment (and consider renaming params to cpuNano/memoryBytes for clarity) to prevent callers from passing vCPU counts or MB and getting invalid 0 vCPU / incorrect memory updates.

Suggested change
ResizeInstance(ctx context.Context, id string, cpu, memory int64) error
// cpuNano must be specified in nano-CPUs (for example, 1 vCPU = 1_000_000_000).
// memoryBytes must be specified in bytes (for example, 512 MiB = 536_870_912).
// Callers must not pass vCPU counts or MB/MiB values directly.
ResizeInstance(ctx context.Context, id string, cpuNano, memoryBytes int64) error

Copilot uses AI. Check for mistakes.
Comment on lines +650 to +707
inst.Status = domain.StatusPaused
if err := s.repo.Update(ctx, inst); err != nil {
return err
}

if err := s.auditSvc.Log(ctx, inst.UserID, "instance.pause", "instance", inst.ID.String(), map[string]interface{}{
"name": inst.Name,
}); err != nil {
s.logger.Warn("failed to log audit event", "action", "instance.pause", "instance_id", inst.ID, "error", err)
}

platform.InstanceOperationsTotal.WithLabelValues("pause", "success").Inc()
s.logger.Info("instance paused", "instance_id", inst.ID)
return nil
}

// ResumeInstance resumes a paused instance back to running state.
func (s *InstanceService) ResumeInstance(ctx context.Context, idOrName string) error {
userID := appcontext.UserIDFromContext(ctx)
tenantID := appcontext.TenantIDFromContext(ctx)

if err := s.rbacSvc.Authorize(ctx, userID, tenantID, domain.PermissionInstanceUpdate, idOrName); err != nil {
return err
}

inst, err := s.GetInstance(ctx, idOrName)
if err != nil {
return err
}

if inst.Status != domain.StatusPaused {
return errors.New(errors.Conflict, "instance must be PAUSED to resume, got: "+string(inst.Status))
}

target := inst.ContainerID
if target == "" {
target = s.formatContainerName(inst.ID)
}

if err := s.compute.ResumeInstance(ctx, target); err != nil {
platform.InstanceOperationsTotal.WithLabelValues("resume", "failure").Inc()
s.logger.Error("failed to resume container", "container_id", target, "error", err)
return errors.Wrap(errors.Internal, "failed to resume container", err)
}

inst.Status = domain.StatusRunning
if err := s.repo.Update(ctx, inst); err != nil {
return err
}

if err := s.auditSvc.Log(ctx, inst.UserID, "instance.resume", "instance", inst.ID.String(), map[string]interface{}{
"name": inst.Name,
}); err != nil {
s.logger.Warn("failed to log audit event", "action", "instance.resume", "instance_id", inst.ID, "error", err)
}

platform.InstanceOperationsTotal.WithLabelValues("resume", "success").Inc()
s.logger.Info("instance resumed", "instance_id", inst.ID)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PauseInstance/ResumeInstance update the instance status in the DB but don't update the InstancesTotal gauge (unlike StartInstance/StopInstance). This will leave Prometheus instance counts inconsistent once instances move into/out of PAUSED. Consider decrementing running/incrementing paused on pause, and decrementing paused/incrementing running on resume (using the same backend label).

Copilot uses AI. Check for mistakes.
Comment thread internal/repositories/docker/adapter.go Outdated
if err != nil {
return fmt.Errorf("failed to update container %s: %w", id, err)
}
if resp.Warnings != nil {
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ResizeInstance, resp.Warnings is a slice; it can be non-nil but empty. Logging a warning when len(resp.Warnings)==0 can create noisy logs. Consider checking len(resp.Warnings) > 0 instead of resp.Warnings != nil.

Suggested change
if resp.Warnings != nil {
if len(resp.Warnings) > 0 {

Copilot uses AI. Check for mistakes.
Comment on lines +158 to +176
func (r *ResilientCompute) GetInstanceLogs(ctx context.Context, id string) (io.ReadCloser, error) {
var rc io.ReadCloser
err := r.callProtected(ctx, r.opts.CallTimeout, func(ctx context.Context) error {
var e error
rc, e = r.inner.GetInstanceLogs(ctx, id)
return e
})
return rc, err
}

func (r *ResilientCompute) GetInstanceStats(ctx context.Context, id string) (io.ReadCloser, error) {
var rc io.ReadCloser
err := r.callProtected(ctx, r.opts.CallTimeout, func(ctx context.Context) error {
var e error
rc, e = r.inner.GetInstanceStats(ctx, id)
return e
})
return rc, err
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResilientCompute's timeout context is canceled before the caller can consume the returned stream. For GetInstanceLogs/GetInstanceStats, callProtected creates ctx2 and defers cancel(), so ctx2 is canceled as soon as the backend returns the ReadCloser, which can prematurely terminate Docker/libvirt log/stats streams. Consider either (1) not using callProtected for streaming methods, or (2) returning a wrapped ReadCloser whose Close() triggers cancel, so the context stays alive while the stream is read.

Copilot uses AI. Check for mistakes.
Comment on lines +98 to +103
// callProtected runs fn through bulkhead → circuit breaker → timeout.
func (r *ResilientCompute) callProtected(ctx context.Context, timeout time.Duration, fn func(ctx context.Context) error) error {
return r.bulkhead.Execute(ctx, func() error {
return r.cb.Execute(func() error {
ctx2, cancel := context.WithTimeout(ctx, timeout)
defer cancel()
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

callProtected applies the per-call timeout only inside the circuit-breaker closure, after a bulkhead slot is acquired. This means total request time can exceed CallTimeout by up to BulkheadWait (or indefinitely if BulkheadWait=0 and ctx has no deadline). If the intent is a true end-to-end per-call timeout, create the timeout context before bulkhead acquisition and use that context throughout (bulkhead + circuit breaker + backend call).

Suggested change
// callProtected runs fn through bulkhead → circuit breaker → timeout.
func (r *ResilientCompute) callProtected(ctx context.Context, timeout time.Duration, fn func(ctx context.Context) error) error {
return r.bulkhead.Execute(ctx, func() error {
return r.cb.Execute(func() error {
ctx2, cancel := context.WithTimeout(ctx, timeout)
defer cancel()
// callProtected runs fn through timeout → bulkhead → circuit breaker so the
// per-call timeout applies to the full operation, including bulkhead wait time.
func (r *ResilientCompute) callProtected(ctx context.Context, timeout time.Duration, fn func(ctx context.Context) error) error {
ctx2, cancel := context.WithTimeout(ctx, timeout)
defer cancel()
return r.bulkhead.Execute(ctx2, func() error {
return r.cb.Execute(func() error {

Copilot uses AI. Check for mistakes.
Comment on lines +52 to +77
func (b *Bulkhead) acquire(ctx context.Context) error {
select {
case <-ctx.Done():
return ErrBulkheadFull
default:
}

if b.timeout > 0 {
timer := time.NewTimer(b.timeout)
defer timer.Stop()
select {
case b.sem <- struct{}{}:
return nil
case <-timer.C:
return ErrBulkheadFull
case <-ctx.Done():
return ErrBulkheadFull
}
}
// No explicit timeout — rely on context.
select {
case b.sem <- struct{}{}:
return nil
case <-ctx.Done():
return ErrBulkheadFull
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bulkhead.acquire returns ErrBulkheadFull for all ctx.Done() cases (including caller cancellation/deadline exceeded). This loses the original cancellation reason and can misclassify failures as bulkhead saturation. Consider returning ctx.Err() when the context is already done, and reserve ErrBulkheadFull for actual semaphore wait timeouts (timer.C) or when ctx expires specifically while waiting for a slot (wrapping ctx.Err() if you still want a sentinel).

Copilot uses AI. Check for mistakes.
Comment on lines +130 to +131
reMemory: regexp.MustCompile(`<memory unit="KiB">\d+</memory>`),
reCurrentMemory: regexp.MustCompile(`<currentMemory unit="KiB">\d+</currentMemory>`),
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The precompiled memory/currentMemory regexes only match double-quoted attributes (unit="KiB"), but the domain XML generated in templates.go uses single quotes (unit='KiB') and does not include a tag. As a result, ResizeInstance will not update memory at all for domains created by this adapter. Update the regexes to handle the actual template output (both quote styles) or, preferably, parse/modify the XML structurally instead of regex substitution.

Suggested change
reMemory: regexp.MustCompile(`<memory unit="KiB">\d+</memory>`),
reCurrentMemory: regexp.MustCompile(`<currentMemory unit="KiB">\d+</currentMemory>`),
reMemory: regexp.MustCompile(`<memory unit=['"]KiB['"]>\d+</memory>`),
reCurrentMemory: regexp.MustCompile(`<currentMemory unit=['"]KiB['"]>\d+</currentMemory>`),

Copilot uses AI. Check for mistakes.
Comment on lines +249 to +255
// applyDomainResize updates vCPU and memory in domain XML.
func (a *LibvirtAdapter) applyDomainResize(xml string, memoryKiB, vcpus int) string {
// Replace memory allocation
xml = a.reMemory.ReplaceAllString(xml, fmt.Sprintf(`<memory unit="KiB">%d</memory>`, memoryKiB))
xml = a.reCurrentMemory.ReplaceAllString(xml, fmt.Sprintf(`<currentMemory unit="KiB">%d</currentMemory>`, memoryKiB))
// Replace vCPU count
xml = a.reVCPU.ReplaceAllString(xml, fmt.Sprintf(`<vcpu>%d</vcpu>`, vcpus))
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applyDomainResize replaces the entire <vcpu ...>... element with a bare %d, which drops existing attributes like placement='static' from the template and may change libvirt behavior. Consider preserving attributes via capturing groups (e.g., keep the opening tag) or using an XML parser to update only the numeric text content.

Suggested change
// applyDomainResize updates vCPU and memory in domain XML.
func (a *LibvirtAdapter) applyDomainResize(xml string, memoryKiB, vcpus int) string {
// Replace memory allocation
xml = a.reMemory.ReplaceAllString(xml, fmt.Sprintf(`<memory unit="KiB">%d</memory>`, memoryKiB))
xml = a.reCurrentMemory.ReplaceAllString(xml, fmt.Sprintf(`<currentMemory unit="KiB">%d</currentMemory>`, memoryKiB))
// Replace vCPU count
xml = a.reVCPU.ReplaceAllString(xml, fmt.Sprintf(`<vcpu>%d</vcpu>`, vcpus))
// replaceVCPUCountPreservingAttributes updates the vCPU count while preserving any existing
// attributes on the <vcpu> element, such as placement="static".
func (a *LibvirtAdapter) replaceVCPUCountPreservingAttributes(xml string, vcpus int) string {
reVCPUContent := regexp.MustCompile(`(?s)^(<vcpu\b[^>]*>)[^<]*(</vcpu>)$`)
return a.reVCPU.ReplaceAllStringFunc(xml, func(match string) string {
parts := reVCPUContent.FindStringSubmatch(match)
if len(parts) != 3 {
return match
}
return fmt.Sprintf(`%s%d%s`, parts[1], vcpus, parts[2])
})
}
// applyDomainResize updates vCPU and memory in domain XML.
func (a *LibvirtAdapter) applyDomainResize(xml string, memoryKiB, vcpus int) string {
// Replace memory allocation
xml = a.reMemory.ReplaceAllString(xml, fmt.Sprintf(`<memory unit="KiB">%d</memory>`, memoryKiB))
xml = a.reCurrentMemory.ReplaceAllString(xml, fmt.Sprintf(`<currentMemory unit="KiB">%d</currentMemory>`, memoryKiB))
// Replace vCPU count while preserving existing <vcpu> attributes.
xml = a.replaceVCPUCountPreservingAttributes(xml, vcpus)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🧹 Nitpick comments (4)
internal/platform/circuit_breaker.go (1)

84-92: Replace hard-coded default values with named constants

Lines 85, 88, and 91 hard-code breaker defaults. Promote them to constants to keep defaults centralized and avoid scattered literals.

Proposed refactor
+const (
+	defaultFailureThreshold = 5
+	defaultResetTimeout     = 30 * time.Second
+	defaultSuccessRequired  = 1
+)
+
 func NewCircuitBreakerWithOpts(opts CircuitBreakerOpts) *CircuitBreaker {
 	if opts.Threshold <= 0 {
-		opts.Threshold = 5
+		opts.Threshold = defaultFailureThreshold
 	}
 	if opts.ResetTimeout <= 0 {
-		opts.ResetTimeout = 30 * time.Second
+		opts.ResetTimeout = defaultResetTimeout
 	}
 	if opts.SuccessRequired <= 0 {
-		opts.SuccessRequired = 1
+		opts.SuccessRequired = defaultSuccessRequired
 	}

As per coding guidelines, "**/*.go`: Do not use magic numbers - use named constants instead".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/platform/circuit_breaker.go` around lines 84 - 92, Replace the
hard-coded magic numbers in the circuit breaker defaults with named constants:
declare DefaultThreshold (int = 5), DefaultResetTimeout (time.Duration = 30 *
time.Second) and DefaultSuccessRequired (int = 1) near the top of the circuit
breaker implementation, then use these constants in the initialization block
that currently checks opts.Threshold, opts.ResetTimeout and opts.SuccessRequired
(replace the literals 5, 30 * time.Second and 1 with the corresponding
constants) so defaults are centralized and self-documented.
internal/repositories/docker/fakes_test.go (1)

98-104: Track pause/unpause calls in the fake for better test assertions

ContainerPause and ContainerUnpause currently return success without recording calls, unlike other fake methods. Incrementing counters here improves verification of pause/resume flows.

Proposed refactor
 func (f *fakeDockerClient) ContainerPause(ctx context.Context, containerID string) error {
+	f.inc("ContainerPause")
 	return nil
 }

 func (f *fakeDockerClient) ContainerUnpause(ctx context.Context, containerID string) error {
+	f.inc("ContainerUnpause")
 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/repositories/docker/fakes_test.go` around lines 98 - 104, The fake
Docker client methods ContainerPause and ContainerUnpause don't record calls;
update the fakeDockerClient to track these invocations (e.g., add
pauseCount/unpauseCount or record entries in an operations slice and protect
with the same mutex used by other fake methods) and increment the appropriate
counter/append an entry inside ContainerPause and ContainerUnpause so tests can
assert pause/unpause behavior; ensure you follow the existing fake's concurrency
pattern and naming conventions used by other methods in fakeDockerClient.
internal/core/services/instance_unit_test.go (1)

981-1096: Consider table-driving the pause/resume cases.

The new suite repeats the same setup and assertion pattern several times. A table would cut the duplication and make it easier to add more pause/resume permutations later.

As per coding guidelines, use table-driven tests in test files.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/core/services/instance_unit_test.go` around lines 981 - 1096, The
tests for PauseInstance_* and ResumeInstance_* duplicate setup/expectation
logic; convert them into a single table-driven loop that defines cases (name,
method: "PauseInstance" or "ResumeInstance", initial Status, compute error or
nil, expected final Status or error substring, whether Update and Audit should
be asserted). In the loop call svc.PauseInstance or svc.ResumeInstance based on
the case, set up common mocks once per case (repo.GetByName Maybe, repo.GetByID
-> inst, rbacSvc.Authorize for PermissionInstanceUpdate and
PermissionInstanceRead), conditionally set
compute.On("PauseInstance"/"ResumeInstance", mock.Anything, "cid-1").Return(...)
and compute.On("Type").Return("docker").Maybe(), and conditionally assert
repo.On("Update", ...) and auditSvc.On("Log", ...).Return(...).Use
mock.MatchedBy for status checks and assert error/no-error and inst.Status per
case; this collapses the repeated logic while keeping behavior for PauseInstance
and ResumeInstance.
internal/platform/resilient_compute.go (1)

33-52: Extract the default limits into named constants.

The constructor hard-codes several fallback durations and thresholds, which makes the defaults harder to audit. As per coding guidelines, Do not use magic numbers - use named constants instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/platform/resilient_compute.go` around lines 33 - 52, Extract the
hard-coded defaults in ResilientComputeOpts.withDefaults into named
package-level constants (e.g. defaultCallTimeout, defaultLongCallTimeout,
defaultCBThreshold, defaultCBResetTimeout, defaultBulkheadMaxConc,
defaultBulkheadWait) declared near the top of the file; then replace the literal
values used when setting o.CallTimeout, o.LongCallTimeout, o.CBThreshold,
o.CBResetTimeout, o.BulkheadMaxConc and o.BulkheadWait with those constants to
remove magic numbers and make the defaults easy to audit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/core/services/instance_unit_test.go`:
- Line 303: The test expectations that call repo.On("GetByName", mock.Anything,
instanceID.String()).Return(...).Maybe() should be required instead of optional;
locate the GetByName expectations in instance_unit_test.go (the repo.On calls
that use "GetByName" with instanceID.String()) and remove the .Maybe() so the
mock requires that GetByName is invoked; apply the same change for each
occurrence noted (lines corresponding to the other GetByName expectations) to
ensure the name-first resolution path exercised by InstanceService is enforced
by the tests.

In `@internal/core/services/instance.go`:
- Around line 644-653: When s.compute.PauseInstance (and symmetrically
s.compute.ResumeInstance) succeeds but s.repo.Update fails, add a best-effort
compensating rollback before returning to avoid state drift: after setting
inst.Status and detecting s.repo.Update error inside the methods that call
s.compute.PauseInstance/ResumeInstance (e.g., the current Pause flow and the
Resume flow), call the opposite compute operation (call s.compute.ResumeInstance
on Pause failure path, call s.compute.PauseInstance on Resume failure path) and
log both the repo update error and any rollback error (using s.logger and
platform.InstanceOperationsTotal metrics); ensure the original repo update error
is still returned but include rollback outcome in logs so the system records
both attempts.

In `@internal/platform/bulkhead.go`:
- Around line 22-38: The Bulkhead default WaitTimeout from BulkheadOpts isn't
applied in NewBulkhead, so callers get a zero timeout and acquire may block
until context cancel; modify NewBulkhead to set opts.WaitTimeout to
5*time.Second when it's zero (e.g., if opts.WaitTimeout == 0 { opts.WaitTimeout
= 5 * time.Second }) and assign that value to the Bulkhead.timeout field so
functions like acquire honor the documented 5s default.

In `@internal/platform/circuit_breaker.go`:
- Around line 195-206: The recordSuccess logic currently treats any non-HalfOpen
state the same and resets failureCount/state, which lets a late success close an
already-OPEN breaker; modify the switch in recordSuccess so only an actual
StateClosed path clears failureCount and sets state (i.e., add an explicit case
StateClosed that resets failureCount/state) and do nothing when cb.state ==
StateOpen so stale successes cannot transition an open breaker to closed; keep
the HalfOpen handling (successCount, successRequired and
transitionLocked(StateClosed)) as-is and reference cb.successCount,
cb.failureCount, cb.successRequired, cb.transitionLocked, and the state
constants StateHalfOpen/StateClosed/StateOpen.

In `@internal/platform/resilient_compute_test.go`:
- Around line 128-330: Multiple tests in resilient_compute_test.go repeat the
same setup and assertions; refactor them into table-driven subtests: consolidate
scenarios like TestResilientComputePassthrough,
TestResilientComputeCircuitTrips, TestResilientComputeBulkheadLimits,
TestResilientComputeTimeout, TestResilientComputePingBypassesBulkhead and the
ResizeInstance subtests into one or more table-driven tests where each case
defines a name, mock (mockCompute with delay/err), ResilientComputeOpts, the
invoked method(s) (e.g., StartInstance, Ping, ResizeInstance,
LaunchInstanceWithOptions, Exec), and expected outcomes (error type like
ErrCircuitOpen/ErrBulkheadFull or expected return values/call counts); iterate
cases using t.Run and reuse NewResilientCompute and common ctx/setup code,
asserting results per-case so duplication is removed while keeping checks for
callCount, timing expectations, and bypass behavior.
- Around line 19-125: Replace the hand-rolled mockCompute with a testify-based
mock by defining a MockCompute struct that embeds testify/mock.Mock and
implementing each interface method to call m.Called(...) and return the expected
typed results (implement LaunchInstanceWithOptions, StartInstance, StopInstance,
DeleteInstance, GetInstanceLogs, GetInstanceStats, GetInstancePort,
GetInstanceIP, GetConsoleURL, Exec, RunTask, WaitTask, CreateNetwork,
DeleteNetwork, AttachVolume, DetachVolume, Ping, ResizeInstance, PauseInstance,
ResumeInstance, and Type). Remove atomic.Int64 callCount and delay fields from
the mock; if tests need call counting or simulated delay use testify's
mock.AssertNumberOfCalls and the Run(func(args mock.Arguments){...}) hook on
specific method expectations to increment counters or sleep. Ensure each method
converts mock.Called return values to the correct concrete types (string,
[]string, int, io.ReadCloser, error, etc.) to match the original signatures.

In `@internal/platform/resilient_compute.go`:
- Around line 152-156: The ResizeInstance method currently uses r.callProtected
with r.opts.CallTimeout which is too short for destructive/long-running
operations; change the call to use r.opts.LongCallTimeout (or a dedicated resize
timeout field) so ResilientCompute.ResizeInstance calls r.callProtected(ctx,
r.opts.LongCallTimeout, ...) instead of r.opts.CallTimeout to avoid backend
cancellations during VM resize transitions.

In `@internal/repositories/firecracker/adapter.go`:
- Around line 154-173: The MockMode branches in PauseInstance, ResumeInstance
and ResizeInstance currently return nil and silently report success; change them
to return ErrNotSupported (or a specific ErrMockModeNotSupported if you prefer)
when a.cfg.MockMode is true so callers get a non-nil error for unsupported
operations; update the functions PauseInstance, ResumeInstance and
ResizeInstance to return the error rather than nil and ensure ErrNotSupported is
the error used so behavior is consistent with non-mock paths.
- Around line 27-29: Remove the package-level ErrNotSupported sentinel from
adapter.go and stop exporting a global error; instead return a locally created
error where needed (e.g., use fmt.Errorf or errors.New inside the
function/method that currently returns ErrNotSupported) or define a
function-scoped or method-scoped typed error if callers need type assertions;
update all references to ErrNotSupported to use the new local error creation
(search for ErrNotSupported usages) so no package-level global var remains
alongside idRegex.

In `@internal/repositories/libvirt/adapter.go`:
- Around line 221-243: The code currently undefines the live domain
(DomainUndefine) before ensuring the new XML can be successfully defined,
risking the VM being left offline; instead, validate and create the replacement
definition first by calling DomainDefineXML with a temporary/unique name/UUID
(generate a temp name or modify the <name> or <uuid> in newDOMXML returned from
applyDomainResize) to ensure the XML parses and a definition can be created,
then if that succeeds, undefine the original domain (DomainUndefine on dom) and
redefine the final newDOMXML (or rename the temp definition to the original
name/uuid as needed), and only then start the domain (DomainCreate) if
wasRunning; reference functions/methods: DomainGetXMLDesc, applyDomainResize,
DomainDefineXML, DomainUndefine, DomainCreate.
- Around line 249-257: The applyDomainResize function currently mutates domain
XML with regex (see applyDomainResize and the regexes reMemory, reCurrentMemory,
reVCPU), losing vCPU attributes (placement/cpuset) and mishandling memory units;
replace the regex approach by parsing the XML into an element tree (e.g., using
Go's encoding/xml or an XML DOM library), locate the <vcpu> element and update
only its inner text while preserving all attributes, locate <memory> and
<currentMemory> elements, detect/convert their existing unit attribute to the
desired KiB value (or set the unit attribute to "KiB" after converting) instead
of hardcoding unit="KiB", and then marshal the modified XML back to string;
update applyDomainResize to use these parsed-element updates and remove reliance
on reMemory/reCurrentMemory/reVCPU regex replacements.

In `@internal/workers/pipeline_worker_test.go`:
- Around line 161-169: The three mock methods on mockComputeBackendExtended
(PauseInstance, ResumeInstance, ResizeInstance) currently return nil and
silently succeed; update each to record the call via m.Called(...) (e.g., for
ResizeInstance pass ctx, id, cpu, memory) and return the mocked error/result
from the call (use args.Error(0) / args.Get(0) as appropriate) so tests can
assert invocations and errors instead of masking unexpected calls.

---

Nitpick comments:
In `@internal/core/services/instance_unit_test.go`:
- Around line 981-1096: The tests for PauseInstance_* and ResumeInstance_*
duplicate setup/expectation logic; convert them into a single table-driven loop
that defines cases (name, method: "PauseInstance" or "ResumeInstance", initial
Status, compute error or nil, expected final Status or error substring, whether
Update and Audit should be asserted). In the loop call svc.PauseInstance or
svc.ResumeInstance based on the case, set up common mocks once per case
(repo.GetByName Maybe, repo.GetByID -> inst, rbacSvc.Authorize for
PermissionInstanceUpdate and PermissionInstanceRead), conditionally set
compute.On("PauseInstance"/"ResumeInstance", mock.Anything, "cid-1").Return(...)
and compute.On("Type").Return("docker").Maybe(), and conditionally assert
repo.On("Update", ...) and auditSvc.On("Log", ...).Return(...).Use
mock.MatchedBy for status checks and assert error/no-error and inst.Status per
case; this collapses the repeated logic while keeping behavior for PauseInstance
and ResumeInstance.

In `@internal/platform/circuit_breaker.go`:
- Around line 84-92: Replace the hard-coded magic numbers in the circuit breaker
defaults with named constants: declare DefaultThreshold (int = 5),
DefaultResetTimeout (time.Duration = 30 * time.Second) and
DefaultSuccessRequired (int = 1) near the top of the circuit breaker
implementation, then use these constants in the initialization block that
currently checks opts.Threshold, opts.ResetTimeout and opts.SuccessRequired
(replace the literals 5, 30 * time.Second and 1 with the corresponding
constants) so defaults are centralized and self-documented.

In `@internal/platform/resilient_compute.go`:
- Around line 33-52: Extract the hard-coded defaults in
ResilientComputeOpts.withDefaults into named package-level constants (e.g.
defaultCallTimeout, defaultLongCallTimeout, defaultCBThreshold,
defaultCBResetTimeout, defaultBulkheadMaxConc, defaultBulkheadWait) declared
near the top of the file; then replace the literal values used when setting
o.CallTimeout, o.LongCallTimeout, o.CBThreshold, o.CBResetTimeout,
o.BulkheadMaxConc and o.BulkheadWait with those constants to remove magic
numbers and make the defaults easy to audit.

In `@internal/repositories/docker/fakes_test.go`:
- Around line 98-104: The fake Docker client methods ContainerPause and
ContainerUnpause don't record calls; update the fakeDockerClient to track these
invocations (e.g., add pauseCount/unpauseCount or record entries in an
operations slice and protect with the same mutex used by other fake methods) and
increment the appropriate counter/append an entry inside ContainerPause and
ContainerUnpause so tests can assert pause/unpause behavior; ensure you follow
the existing fake's concurrency pattern and naming conventions used by other
methods in fakeDockerClient.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b1fd7891-edeb-490e-ac8b-384c0d8e1d6d

📥 Commits

Reviewing files that changed from the base of the PR and between ec3e19c and b11d04d.

📒 Files selected for processing (28)
  • internal/api/setup/router.go
  • internal/core/domain/instance.go
  • internal/core/ports/compute.go
  • internal/core/ports/instance.go
  • internal/core/services/instance.go
  • internal/core/services/instance_unit_test.go
  • internal/core/services/mock_compute_test.go
  • internal/handlers/instance_handler.go
  • internal/handlers/instance_handler_test.go
  • internal/platform/bulkhead.go
  • internal/platform/circuit_breaker.go
  • internal/platform/resilient_compute.go
  • internal/platform/resilient_compute_test.go
  • internal/repositories/docker/adapter.go
  • internal/repositories/docker/fakes_test.go
  • internal/repositories/firecracker/adapter.go
  • internal/repositories/firecracker/adapter_noop.go
  • internal/repositories/k8s/kubeadm_provisioner_test.go
  • internal/repositories/k8s/mocks_test.go
  • internal/repositories/libvirt/adapter.go
  • internal/repositories/libvirt/lb_proxy_test.go
  • internal/repositories/libvirt/libvirt_client.go
  • internal/repositories/libvirt/mock_client_test.go
  • internal/repositories/libvirt/real_client.go
  • internal/repositories/noop/adapters.go
  • internal/workers/database_failover_worker_test.go
  • internal/workers/healing_worker_test.go
  • internal/workers/pipeline_worker_test.go

Comment on lines +36 to +37
// StatusPaused indicates the instance is paused (frozen CPU, retained memory/network).
StatusPaused InstanceStatus = "PAUSED"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

StatusPaused introduction is not reflected in existing status switch paths

Adding PAUSED at Line 37 needs corresponding handling in current status-based switches; provided snippets show gaps in:

  • internal/core/services/instance.go:868-879 (termination metrics),
  • internal/core/services/dashboard.go:55-66 (running/stopped summary),
  • internal/autoscaler/server.go:248-260 (proto state mapping).

Without those updates, paused instances are silently miscounted or unmapped.

Comment thread internal/core/services/instance_unit_test.go
Comment thread internal/core/services/instance.go
Comment thread internal/platform/bulkhead.go
Comment on lines +195 to +206
switch cb.state {
case StateHalfOpen:
cb.successCount++
if cb.successCount >= cb.successRequired {
cb.failureCount = 0
cb.successCount = 0
cbFunc, name, from, to, changed = cb.transitionLocked(StateClosed)
}
default:
cb.failureCount = 0
cb.state = StateClosed
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Do not close an OPEN breaker from stale successes

At Line 205, recordSuccess can set StateOpen -> StateClosed in the default branch. Under concurrency, a request admitted while closed can finish successfully after another request has already opened the breaker, and this code will close it prematurely.

Proposed fix
 func (cb *CircuitBreaker) recordSuccess() {
@@
 	switch cb.state {
 	case StateHalfOpen:
 		cb.successCount++
 		if cb.successCount >= cb.successRequired {
 			cb.failureCount = 0
 			cb.successCount = 0
 			cbFunc, name, from, to, changed = cb.transitionLocked(StateClosed)
 		}
-	default:
+	case StateClosed:
 		cb.failureCount = 0
-		cb.state = StateClosed
+	case StateOpen:
+		// Ignore stale successes from requests admitted before breaker opened.
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
switch cb.state {
case StateHalfOpen:
cb.successCount++
if cb.successCount >= cb.successRequired {
cb.failureCount = 0
cb.successCount = 0
cbFunc, name, from, to, changed = cb.transitionLocked(StateClosed)
}
default:
cb.failureCount = 0
cb.state = StateClosed
}
switch cb.state {
case StateHalfOpen:
cb.successCount++
if cb.successCount >= cb.successRequired {
cb.failureCount = 0
cb.successCount = 0
cbFunc, name, from, to, changed = cb.transitionLocked(StateClosed)
}
case StateClosed:
cb.failureCount = 0
case StateOpen:
// Ignore stale successes from requests admitted before breaker opened.
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/platform/circuit_breaker.go` around lines 195 - 206, The
recordSuccess logic currently treats any non-HalfOpen state the same and resets
failureCount/state, which lets a late success close an already-OPEN breaker;
modify the switch in recordSuccess so only an actual StateClosed path clears
failureCount and sets state (i.e., add an explicit case StateClosed that resets
failureCount/state) and do nothing when cb.state == StateOpen so stale successes
cannot transition an open breaker to closed; keep the HalfOpen handling
(successCount, successRequired and transitionLocked(StateClosed)) as-is and
reference cb.successCount, cb.failureCount, cb.successRequired,
cb.transitionLocked, and the state constants
StateHalfOpen/StateClosed/StateOpen.

Comment thread internal/repositories/firecracker/adapter.go Outdated
Comment on lines +154 to +173
func (a *FirecrackerAdapter) PauseInstance(ctx context.Context, id string) error {
if a.cfg.MockMode {
return nil
}
return ErrNotSupported
}

func (a *FirecrackerAdapter) ResumeInstance(ctx context.Context, id string) error {
if a.cfg.MockMode {
return nil
}
return ErrNotSupported
}

func (a *FirecrackerAdapter) ResizeInstance(ctx context.Context, id string, cpu, memory int64) error {
if a.cfg.MockMode {
return nil
}
return ErrNotSupported
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

MockMode should not silently return success for unsupported operations.

Returning nil for pause/resume/resize in MockMode can falsely mark instances as paused/resumed/resized even though no backend action exists.

♻️ Proposed fix
 func (a *FirecrackerAdapter) PauseInstance(ctx context.Context, id string) error {
 	if a.cfg.MockMode {
-		return nil
+		return errNotSupported()
 	}
-	return ErrNotSupported
+	return errNotSupported()
 }

 func (a *FirecrackerAdapter) ResumeInstance(ctx context.Context, id string) error {
 	if a.cfg.MockMode {
-		return nil
+		return errNotSupported()
 	}
-	return ErrNotSupported
+	return errNotSupported()
 }

 func (a *FirecrackerAdapter) ResizeInstance(ctx context.Context, id string, cpu, memory int64) error {
 	if a.cfg.MockMode {
-		return nil
+		return errNotSupported()
 	}
-	return ErrNotSupported
+	return errNotSupported()
 }

As per coding guidelines **/*.go: Do not use silent failures - avoid blank identifier assignment like _ = someFunc().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/repositories/firecracker/adapter.go` around lines 154 - 173, The
MockMode branches in PauseInstance, ResumeInstance and ResizeInstance currently
return nil and silently report success; change them to return ErrNotSupported
(or a specific ErrMockModeNotSupported if you prefer) when a.cfg.MockMode is
true so callers get a non-nil error for unsupported operations; update the
functions PauseInstance, ResumeInstance and ResizeInstance to return the error
rather than nil and ensure ErrNotSupported is the error used so behavior is
consistent with non-mock paths.

Comment thread internal/repositories/libvirt/adapter.go
Comment on lines +249 to +257
// applyDomainResize updates vCPU and memory in domain XML.
func (a *LibvirtAdapter) applyDomainResize(xml string, memoryKiB, vcpus int) string {
// Replace memory allocation
xml = a.reMemory.ReplaceAllString(xml, fmt.Sprintf(`<memory unit="KiB">%d</memory>`, memoryKiB))
xml = a.reCurrentMemory.ReplaceAllString(xml, fmt.Sprintf(`<currentMemory unit="KiB">%d</currentMemory>`, memoryKiB))
// Replace vCPU count
xml = a.reVCPU.ReplaceAllString(xml, fmt.Sprintf(`<vcpu>%d</vcpu>`, vcpus))
return xml
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n internal/repositories/libvirt/adapter.go | head -300 | tail -80

Repository: poyrazK/thecloud

Length of output: 3345


🏁 Script executed:

rg "reMemory|reCurrentMemory|reVCPU" internal/repositories/libvirt/adapter.go -A 1 -B 1

Repository: poyrazK/thecloud

Length of output: 885


🌐 Web query:

libvirt XML memory unit variants KiB B MiB specification

💡 Result:

In libvirt domain XML, memory-related elements (such as , , , and sizes in memory devices) support an optional 'unit' attribute that specifies the unit for the numeric value. The default unit is "KiB" (kibibytes, 2^10 bytes). Valid unit variants include: - "b" or "bytes": bytes (1 byte) - "KB": kilobytes (10^3 = 1000 bytes) - "k" or "KiB": kibibytes (2^10 = 1024 bytes) - "MB": megabytes (10^6 = 1,000,000 bytes) - "M" or "MiB": mebibytes (2^20 = 1,048,576 bytes) - "GB": gigabytes (10^9 = 1,000,000,000 bytes) - "G" or "GiB": gibibytes (2^30 = 1,073,741,824 bytes) - "TB": terabytes (10^12 bytes) - "T" or "TiB": tebibytes (2^40 bytes) Values are rounded up to the nearest kibibyte by libvirt and may be further adjusted by the hypervisor. These units apply consistently across memory allocation, backing, devices, and related elements like memtune. Examples: 1048576 (1 GiB) 1024 (1 GiB) 1 (1 GiB)

Citations:


🌐 Web query:

libvirt vcpu tag attributes placement cpuset specification

💡 Result:

In libvirt domain XML configuration, the element supports two key attributes related to CPU placement and cpuset specification: 'placement' and 'cpuset'. The 'placement' attribute (since libvirt 0.9.11, QEMU/KVM only) indicates the CPU placement mode for the domain process: - 'static' (default if cpuset specified): Pins the domain process to the cpuset or all available CPUs. - 'auto': Pins to advisory nodeset from numad daemon, ignoring cpuset. The 'cpuset' attribute (since 0.4.4) is a comma-separated list of host physical CPU numbers or ranges for default pinning of the domain process and vCPUs: - Syntax: single numbers (e.g., '0'), ranges (e.g., '1-4'), exclusions (e.g., '^3' excludes CPU 3 from prior range). - Example: cpuset="1-4,^3,6" pins to CPUs 1,2,4,6. - Overrides: Ignored if cputune/emulatorpin specified (for process) or vcpupin for specific vCPUs. Unspecified vCPUs without vcpupin use this cpuset. Example XML: 2 For finer control, use with and . These apply primarily to QEMU/KVM; check hypervisor support.

Citations:


Use proper XML parsing instead of regex-based mutation for domain XML modifications.

The applyDomainResize function has two problems:

  1. The <vcpu> regex matches attributes like placement and cpuset (<vcpu[^>]*>), but the replacement <vcpu>%d</vcpu> strips them entirely, causing data loss.
  2. The memory and currentMemory regexes only match unit="KiB", but libvirt supports multiple units (KiB, MiB, GiB, bytes, etc.). Existing XML with different units won't be replaced, and the hardcoded unit="KiB" replacement assumes a specific variant.

Parsing the XML tree would safely preserve attributes and handle unit variations correctly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/repositories/libvirt/adapter.go` around lines 249 - 257, The
applyDomainResize function currently mutates domain XML with regex (see
applyDomainResize and the regexes reMemory, reCurrentMemory, reVCPU), losing
vCPU attributes (placement/cpuset) and mishandling memory units; replace the
regex approach by parsing the XML into an element tree (e.g., using Go's
encoding/xml or an XML DOM library), locate the <vcpu> element and update only
its inner text while preserving all attributes, locate <memory> and
<currentMemory> elements, detect/convert their existing unit attribute to the
desired KiB value (or set the unit attribute to "KiB" after converting) instead
of hardcoding unit="KiB", and then marshal the modified XML back to string;
update applyDomainResize to use these parsed-element updates and remove reliance
on reMemory/reCurrentMemory/reVCPU regex replacements.

Comment on lines +161 to +169
func (m *mockComputeBackendExtended) PauseInstance(ctx context.Context, id string) error {
return nil
}
func (m *mockComputeBackendExtended) ResumeInstance(ctx context.Context, id string) error {
return nil
}
func (m *mockComputeBackendExtended) ResizeInstance(ctx context.Context, id string, cpu, memory int64) error {
return nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

New mock methods silently succeed and can mask unexpected calls

At Lines 162, 165, and 168 the methods return nil without recording invocations. Wire them through m.Called(...) so tests can assert expected behavior and catch accidental usage.

Proposed fix
 func (m *mockComputeBackendExtended) PauseInstance(ctx context.Context, id string) error {
-	return nil
+	return m.Called(ctx, id).Error(0)
 }
 func (m *mockComputeBackendExtended) ResumeInstance(ctx context.Context, id string) error {
-	return nil
+	return m.Called(ctx, id).Error(0)
 }
 func (m *mockComputeBackendExtended) ResizeInstance(ctx context.Context, id string, cpu, memory int64) error {
-	return nil
+	return m.Called(ctx, id, cpu, memory).Error(0)
 }

As per coding guidelines, "**/*.go: Do not use silent failures - avoid blank identifier assignment like _ = someFunc()".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/workers/pipeline_worker_test.go` around lines 161 - 169, The three
mock methods on mockComputeBackendExtended (PauseInstance, ResumeInstance,
ResizeInstance) currently return nil and silently succeed; update each to record
the call via m.Called(...) (e.g., for ResizeInstance pass ctx, id, cpu, memory)
and return the mocked error/result from the call (use args.Error(0) /
args.Get(0) as appropriate) so tests can assert invocations and errors instead
of masking unexpected calls.

poyrazK added 2 commits April 28, 2026 17:46
Add DomainGetState check to PauseInstance and ResumeInstance in
the libvirt adapter, consistent with existing patterns in other
methods (ResizeInstance, stopDomainIfRunning). This prevents
silently succeeding when the domain is already in the target state.
- Bulkhead.acquire now returns ctx.Err() when context is cancelled,
  instead of always returning ErrBulkheadFull, making the error
  type distinguishable from capacity exhaustion
- ResumeInstance error log now explicitly notes the instance is left
  in PAUSED state so operators can detect and recover from partial
  failures
Copilot AI review requested due to automatic review settings April 28, 2026 15:05
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

poyrazK added 2 commits April 29, 2026 18:32
Resolve merge conflicts:
- Add PauseInstance/ResumeInstance to NoopComputeBackend
- Remove duplicate ResizeInstance in docker/adapter.go
- Add PauseInstance/ResumeInstance to ResilientCompute
- Add PauseInstance/ResumeInstance to mockCompute test struct
- Remove duplicate ResizeInstance in mock_compute_test.go
- Fix bulkhead test to expect ctx.Err() instead of ErrBulkheadFull
- Add domainStatePaused constant to libvirt adapter
- Update bulkhead acquire to return ctx.Err() on context cancellation
…tations

- Remove duplicate ResizeInstance from firecracker/adapter.go (origin/main has correct version)
- Remove ErrNotSupported from firecracker PauseInstance/ResumeInstance (no-op backend)
- Ensure firecracker adapter_noop.go is correct
Copilot AI review requested due to automatic review settings April 29, 2026 15:39
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +181 to +223
// PauseInstance suspends a running domain (freezes CPU, retains memory/network).
func (a *LibvirtAdapter) PauseInstance(ctx context.Context, id string) error {
dom, err := a.client.DomainLookupByName(ctx, id)
if err != nil {
return fmt.Errorf(errDomainNotFound, err)
}

state, _, err := a.client.DomainGetState(ctx, dom, 0)
if err != nil {
return fmt.Errorf("failed to get domain state: %w", err)
}
if state != domainStateRunning {
return fmt.Errorf("domain is not running (state=%d), cannot pause", state)
}

if err := a.client.DomainSuspend(ctx, dom); err != nil {
return fmt.Errorf("failed to suspend domain: %w", err)
}
a.logger.Info("domain paused", "domain", id)
return nil
}

// ResumeInstance resumes a paused domain back to running state.
func (a *LibvirtAdapter) ResumeInstance(ctx context.Context, id string) error {
dom, err := a.client.DomainLookupByName(ctx, id)
if err != nil {
return fmt.Errorf(errDomainNotFound, err)
}

state, _, err := a.client.DomainGetState(ctx, dom, 0)
if err != nil {
return fmt.Errorf("failed to get domain state: %w", err)
}
if state != domainStatePaused {
return fmt.Errorf("domain is not paused (state=%d), cannot resume", state)
}

if err := a.client.DomainResume(ctx, dom); err != nil {
return fmt.Errorf("failed to resume domain: %w", err)
}
a.logger.Info("domain resumed", "domain", id)
return nil
}
Comment thread internal/repositories/docker/adapter.go Outdated
ContainerExecInspect(ctx context.Context, execID string) (container.ExecInspect, error)
ContainerRename(ctx context.Context, containerID string, newName string) error
ContainerUpdate(ctx context.Context, containerID string, updateConfig container.UpdateConfig) (container.UpdateResponse, error)
ContainerUpdate(ctx context.Context, containerID string, updateConfig container.UpdateConfig) (container.UpdateResponse, error)
Comment on lines +150 to 154
func (b *NoopComputeBackend) Ping(ctx context.Context) error { return nil }
func (b *NoopComputeBackend) Type() string { return "noop" }
func (b *NoopComputeBackend) PauseInstance(ctx context.Context, id string) error { return nil }
func (b *NoopComputeBackend) ResumeInstance(ctx context.Context, id string) error { return nil }
func (b *NoopComputeBackend) ResizeInstance(ctx context.Context, id string, cpu, memory int64) error { return nil }
Comment thread internal/core/services/instance.go Outdated
Comment on lines +698 to +699
s.logger.Error("failed to resume container", "container_id", target, "error", err)
return errors.Wrap(errors.Internal, "failed to resume container", err)
Comment thread compute-roadmap.md Outdated
Comment on lines +7 to +16
| PR | Title | Implementation Status |
|----|-------|----------------------|
| PR 1 | Image Management | ✅ COMPLETED |
| PR 2 | Instance Resize/Scale | ✅ COMPLETED |
| PR 3 | Instance Pause/Resume | ❌ NOT IMPLEMENTED |
| PR 4 | Provision Retry & Error Handling | ❌ NOT IMPLEMENTED |
| PR 5 | Instance Tags | ❌ NOT IMPLEMENTED |
| PR 6 | WebSocket Progress Events | ⚠️ PARTIALLY IMPLEMENTED |
| PR 7 | Instance Snapshots | ❌ NOT IMPLEMENTED |
| PR 8 | Instance Live Migration | ❌ NOT IMPLEMENTED |
Comment on lines +183 to +209
dom, err := a.client.DomainLookupByName(ctx, id)
if err != nil {
return fmt.Errorf(errDomainNotFound, err)
}

state, _, err := a.client.DomainGetState(ctx, dom, 0)
if err != nil {
return fmt.Errorf("failed to get domain state: %w", err)
}
if state != domainStateRunning {
return fmt.Errorf("domain is not running (state=%d), cannot pause", state)
}

if err := a.client.DomainSuspend(ctx, dom); err != nil {
return fmt.Errorf("failed to suspend domain: %w", err)
}
a.logger.Info("domain paused", "domain", id)
return nil
}

// ResumeInstance resumes a paused domain back to running state.
func (a *LibvirtAdapter) ResumeInstance(ctx context.Context, id string) error {
dom, err := a.client.DomainLookupByName(ctx, id)
if err != nil {
return fmt.Errorf(errDomainNotFound, err)
}

Comment on lines +152 to +164
func (a *FirecrackerAdapter) PauseInstance(ctx context.Context, id string) error {
if a.cfg.MockMode {
return nil
}
return nil // Firecracker does not support pause/resume
}

func (a *FirecrackerAdapter) ResumeInstance(ctx context.Context, id string) error {
if a.cfg.MockMode {
return nil
}
return nil // Firecracker does not support pause/resume
}
Comment on lines 52 to +66
@@ -65,15 +63,15 @@ func (b *Bulkhead) acquire(ctx context.Context) error {
case <-timer.C:
return ErrBulkheadFull
case <-ctx.Done():
return ErrBulkheadFull
return ctx.Err()
Comment on lines +657 to +715
inst.Status = domain.StatusPaused
if err := s.repo.Update(ctx, inst); err != nil {
return err
}

if err := s.auditSvc.Log(ctx, inst.UserID, "instance.pause", "instance", inst.ID.String(), map[string]interface{}{
"name": inst.Name,
}); err != nil {
s.logger.Warn("failed to log audit event", "action", "instance.pause", "instance_id", inst.ID, "error", err)
}

platform.InstanceOperationsTotal.WithLabelValues("pause", "success").Inc()
s.logger.Info("instance paused", "instance_id", inst.ID)
return nil
}

// ResumeInstance resumes a paused instance back to running state.
func (s *InstanceService) ResumeInstance(ctx context.Context, idOrName string) error {
userID := appcontext.UserIDFromContext(ctx)
tenantID := appcontext.TenantIDFromContext(ctx)

if err := s.rbacSvc.Authorize(ctx, userID, tenantID, domain.PermissionInstanceUpdate, idOrName); err != nil {
return err
}

inst, err := s.GetInstance(ctx, idOrName)
if err != nil {
return err
}

if inst.Status != domain.StatusPaused {
return errors.New(errors.Conflict, "instance must be PAUSED to resume, got: "+string(inst.Status))
}

target := inst.ContainerID
if target == "" {
target = s.formatContainerName(inst.ID)
}

if err := s.compute.ResumeInstance(ctx, target); err != nil {
platform.InstanceOperationsTotal.WithLabelValues("resume", "failure").Inc()
s.logger.Error("failed to resume container", "container_id", target, "error", err)
return errors.Wrap(errors.Internal, "failed to resume container", err)
}

inst.Status = domain.StatusRunning
if err := s.repo.Update(ctx, inst); err != nil {
return err
}

if err := s.auditSvc.Log(ctx, inst.UserID, "instance.resume", "instance", inst.ID.String(), map[string]interface{}{
"name": inst.Name,
}); err != nil {
s.logger.Warn("failed to log audit event", "action", "instance.resume", "instance_id", inst.ID, "error", err)
}

platform.InstanceOperationsTotal.WithLabelValues("resume", "success").Inc()
s.logger.Info("instance resumed", "instance_id", inst.ID)
return nil
poyrazK added 2 commits April 29, 2026 19:54
- ResumeInstance: roll back to RUNNING state if backend resume fails,
  preventing instances from being left in inconsistent PAUSED state
- PauseInstance: detect and return Conflict error when backend reports
  state violation (instance not in RUNNING state) instead of Internal
- Add sentinel errors ErrInstanceNotPausable and ErrInstanceNotResumable
  for state-based failures in libvirt adapter

Fixes review findings from PR #320
Copilot AI review requested due to automatic review settings April 29, 2026 17:34
- Add pause and resume endpoint documentation to api-reference.md
- Update FEATURES.md compute section with pause/resume and resilient compute details
- Add Pause/Resume row to Docker vs Libvirt comparison table
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

- TestInstanceHandlerPause: success case
- TestInstanceHandlerPauseNotFound: 404 case
- TestInstanceHandlerPauseConflict: 409 case for wrong state
- TestInstanceHandlerResume: success case
- TestInstanceHandlerResumeNotFound: 404 case
- TestInstanceHandlerResumeConflict: 409 case for wrong state
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/platform/bulkhead.go (1)

9-11: ⚠️ Potential issue | 🟡 Minor

Update comments to match the new error contract.

ErrBulkheadFull is now only returned on bulkhead wait-timeout, but Line 9-11 and Line 41-44 still say context expiry also returns ErrBulkheadFull. Please align comments with current behavior to prevent misuse.

✏️ Suggested doc-only patch
-// ErrBulkheadFull is returned when the bulkhead's concurrency limit is reached
-// and the caller's timeout/context expires before a slot opens.
+// ErrBulkheadFull is returned when the bulkhead's concurrency limit is reached
+// and the bulkhead wait timeout expires before a slot opens.
 var ErrBulkheadFull = errors.New("bulkhead: concurrency limit reached")

 // Execute runs fn within the bulkhead's concurrency limit.
-// If the bulkhead is full and the wait timeout (or context) expires,
-// ErrBulkheadFull is returned without calling fn.
+// If the caller context is canceled or its deadline is exceeded while waiting,
+// the corresponding context error is returned.
+// If the bulkhead wait timeout expires first, ErrBulkheadFull is returned
+// without calling fn.
 func (b *Bulkhead) Execute(ctx context.Context, fn func() error) error {

Also applies to: 41-44

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/platform/bulkhead.go` around lines 9 - 11, Update the comment for
ErrBulkheadFull to reflect the new contract: state that ErrBulkheadFull is
returned only when the bulkhead's wait times out (wait-timeout) and not when the
caller's context is cancelled or expires; remove or change any text that says
context expiry/cancellation also returns ErrBulkheadFull (including the other
comment block referencing the same behavior). Ensure the comment clearly
distinguishes context cancellation errors from ErrBulkheadFull so callers won't
mistake context expiry for a wait-timeout.
♻️ Duplicate comments (1)
internal/core/services/instance.go (1)

661-664: ⚠️ Potential issue | 🟠 Major

Add compensating backend rollback when DB update fails after successful pause/resume.

After a successful backend transition, returning on repo.Update failure can leave runtime and DB state inconsistent. Use best-effort inverse backend action before returning the DB error.

Also applies to: 712-715

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/core/services/instance.go` around lines 661 - 664, After performing
the backend state transition (e.g., setting inst.Status = domain.StatusPaused /
domain.StatusRunning) but before returning when s.repo.Update(ctx, inst) fails,
invoke the inverse backend operation as a best-effort compensating rollback
(e.g., call the service method that undoes the pause when repo.Update fails
after pausing, or undoes the resume when repo.Update fails after resuming),
swallow and log any errors from that rollback, then return the original DB
error; apply the same pattern for the other identical block (around the
inst.Status assignment and s.repo.Update at the second location).
🧹 Nitpick comments (1)
compute-roadmap.md (1)

90-90: Optional: Address markdown formatting issues.

Static analysis detected minor formatting issues that should be fixed for better markdown rendering:

  1. Tables missing surrounding blank lines (lines 90, 118, 213, 266, 321, 387, 464, 527): Add blank lines before and after each table per MD058.
  2. Fenced code block missing language specifier (line 378): Add a language identifier (e.g., text or leave empty as ) to the code fence.

These changes improve rendering consistency across different markdown parsers.

Also applies to: 118-118, 213-213, 266-266, 321-321, 387-387, 464-464, 527-527, 378-378

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compute-roadmap.md` at line 90, Add blank lines before and after each
markdown table (for example the table starting with "| Action | File |") at the
occurrences noted (lines flagged in the review) so each table is separated by a
blank line above and below per MD058, and update the fenced code block at the
flagged location (the fence around the snippet at the review’s line 378) to
include a language specifier such as ```text (or an explicit empty fence ```) to
satisfy the code-block rule; make these edits across the noted occurrences
(tables at the flagged lines and the single fenced block) to fix rendering.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compute-roadmap.md`:
- Line 11: The roadmap incorrectly marks "PR 3 | Instance Pause/Resume" as "❌
NOT IMPLEMENTED"; update the status table entry for PR 3 to "✅ COMPLETED" and
rework the section currently titled "What needs to be built" into "What was
built" for PR 3: list the implemented artifacts (StatusPaused in InstanceStatus
enum, POST /instances/:id/pause and POST /instances/:id/resume endpoints,
service-layer validation for RUNNING↔PAUSED transitions, ComputeBackend
implementations for Docker/Libvirt/Firecracker/Noop with
ContainerPause/ContainerUnpause and DomainSuspend/DomainResume, audit events,
resume rollback behavior, and tests) following the same structure/pattern used
for PR 1 and PR 2.

In `@docs/FEATURES.md`:
- Line 35: The docs state uses STARTING but the domain model defines
INITIALIZING; update the transition line to use INITIALIZING (so the sequence
reads INITIALIZING → RUNNING → STOPPED / PAUSED → DELETED) to match the instance
status constants declared in the domain (e.g., the INITIALIZING/ RUNNING /
STOPPED / PAUSED / DELETED status constants in instance.go); ensure the term
exactly matches the domain constant name to avoid contract drift.

In `@docs/swagger/swagger.json`:
- Around line 3969-4020: The pause/resume instance endpoint docs (e.g., the
"/instances/{id}/pause" POST operation and the corresponding
"/instances/{id}/resume" operation) are missing authentication/authorization
failure responses; add 401 and 403 response entries to each protected
operation’s "responses" object (using the same schema ref
"#/definitions/httputil.Response") so generated clients/docs include
unauthorized and forbidden cases.

In `@internal/core/services/instance.go`:
- Around line 700-709: Before calling s.compute.ResumeInstance capture the
instance's current status (e.g., oldStatus := inst.Status) and on resume failure
restore inst.Status = oldStatus instead of forcibly setting
domain.StatusRunning; then call s.repo.Update(ctx, inst) and include both the
resume error and any repo error in s.logger.Error; update any related
logs/metrics (platform.InstanceOperationsTotal) accordingly and return the
wrapped error as before.

In `@internal/handlers/instance_handler.go`:
- Around line 254-257: The Swagger comment blocks above the instance handler(s)
are missing a 400 response even though the handler returns errors.InvalidInput
when the request id is empty; update the comment blocks that currently list
`@Success` 200, `@Failure` 404, `@Failure` 409, `@Failure` 500 to also include `@Failure`
400 {object} httputil.Response so the generated docs reflect the
errors.InvalidInput path (apply the same change to the other similar comment
block that validates id and returns errors.InvalidInput).

In `@internal/repositories/libvirt/adapter.go`:
- Around line 217-219: The resume path currently returns a plain fmt.Errorf when
the domain is not in domainStatePaused; replace that plain error with a wrapped
typed error using apierrors.ErrInstanceNotResumable (mirroring the pause path)
so callers can detect Conflict; e.g. change the return to wrap the message with
apierrors.ErrInstanceNotResumable (using the project’s usual Wrapf/Wrap pattern)
and include the same descriptive text and state value.

---

Outside diff comments:
In `@internal/platform/bulkhead.go`:
- Around line 9-11: Update the comment for ErrBulkheadFull to reflect the new
contract: state that ErrBulkheadFull is returned only when the bulkhead's wait
times out (wait-timeout) and not when the caller's context is cancelled or
expires; remove or change any text that says context expiry/cancellation also
returns ErrBulkheadFull (including the other comment block referencing the same
behavior). Ensure the comment clearly distinguishes context cancellation errors
from ErrBulkheadFull so callers won't mistake context expiry for a wait-timeout.

---

Duplicate comments:
In `@internal/core/services/instance.go`:
- Around line 661-664: After performing the backend state transition (e.g.,
setting inst.Status = domain.StatusPaused / domain.StatusRunning) but before
returning when s.repo.Update(ctx, inst) fails, invoke the inverse backend
operation as a best-effort compensating rollback (e.g., call the service method
that undoes the pause when repo.Update fails after pausing, or undoes the resume
when repo.Update fails after resuming), swallow and log any errors from that
rollback, then return the original DB error; apply the same pattern for the
other identical block (around the inst.Status assignment and s.repo.Update at
the second location).

---

Nitpick comments:
In `@compute-roadmap.md`:
- Line 90: Add blank lines before and after each markdown table (for example the
table starting with "| Action | File |") at the occurrences noted (lines flagged
in the review) so each table is separated by a blank line above and below per
MD058, and update the fenced code block at the flagged location (the fence
around the snippet at the review’s line 378) to include a language specifier
such as ```text (or an explicit empty fence ```) to satisfy the code-block rule;
make these edits across the noted occurrences (tables at the flagged lines and
the single fenced block) to fix rendering.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 19fda502-3ac2-4bf7-9a7a-57b157afdcca

📥 Commits

Reviewing files that changed from the base of the PR and between b11d04d and a999e74.

📒 Files selected for processing (32)
  • compute-roadmap.md
  • docs/FEATURES.md
  • docs/api-reference.md
  • docs/guides/libvirt-backend.md
  • docs/swagger/docs.go
  • docs/swagger/swagger.json
  • docs/swagger/swagger.yaml
  • internal/api/setup/router.go
  • internal/core/ports/compute.go
  • internal/core/ports/instance.go
  • internal/core/services/instance.go
  • internal/core/services/instance_unit_test.go
  • internal/core/services/mock_compute_test.go
  • internal/errors/errors.go
  • internal/handlers/instance_handler.go
  • internal/handlers/instance_handler_test.go
  • internal/platform/bulkhead.go
  • internal/platform/bulkhead_test.go
  • internal/platform/resilient_compute.go
  • internal/platform/resilient_compute_test.go
  • internal/repositories/docker/adapter.go
  • internal/repositories/docker/fakes_test.go
  • internal/repositories/firecracker/adapter.go
  • internal/repositories/firecracker/adapter_noop.go
  • internal/repositories/k8s/kubeadm_provisioner_test.go
  • internal/repositories/k8s/mocks_test.go
  • internal/repositories/libvirt/adapter.go
  • internal/repositories/libvirt/lb_proxy_test.go
  • internal/repositories/noop/adapters.go
  • internal/workers/database_failover_worker_test.go
  • internal/workers/healing_worker_test.go
  • internal/workers/pipeline_worker_test.go
✅ Files skipped from review due to trivial changes (7)
  • internal/workers/pipeline_worker_test.go
  • internal/errors/errors.go
  • internal/platform/resilient_compute_test.go
  • docs/guides/libvirt-backend.md
  • internal/platform/bulkhead_test.go
  • docs/api-reference.md
  • internal/repositories/k8s/kubeadm_provisioner_test.go
🚧 Files skipped from review as they are similar to previous changes (10)
  • internal/repositories/docker/fakes_test.go
  • internal/repositories/libvirt/lb_proxy_test.go
  • internal/api/setup/router.go
  • internal/workers/database_failover_worker_test.go
  • internal/repositories/firecracker/adapter.go
  • internal/platform/resilient_compute.go
  • internal/repositories/firecracker/adapter_noop.go
  • internal/repositories/docker/adapter.go
  • internal/core/services/mock_compute_test.go
  • internal/core/services/instance_unit_test.go

Comment thread compute-roadmap.md Outdated
Comment thread docs/FEATURES.md Outdated
Comment thread docs/swagger/swagger.json
Comment thread internal/core/services/instance.go
Comment thread internal/handlers/instance_handler.go
Comment thread internal/repositories/libvirt/adapter.go
- compute-roadmap.md: update PR 3 status to COMPLETED with implemented artifacts
- docs/FEATURES.md: fix instance state name from STARTING to INITIALIZING
- docs/swagger: add 400/401/403 responses to pause/resume endpoints
- internal/handlers/instance_handler.go: add 400/401/403 to swagger comments
- internal/core/services/instance.go:
  - resume rollback now preserves old status instead of hardcoding RUNNING
  - add compensating rollback when repo.Update fails after pause/resume
  - rollback now attempts both repo Update (to restore status) and backend undo
- internal/platform/bulkhead.go: clarify ErrBulkheadFull only on wait timeout
- internal/repositories/libvirt/adapter.go: add ErrInstanceNotResumable to resume path
- instance_unit_test.go: add RepoError_Rollback test cases for pause and resume

Fixes all code review findings.
Copilot AI review requested due to automatic review settings April 29, 2026 18:51
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (3)
internal/platform/bulkhead.go (1)

27-39: ⚠️ Potential issue | 🟠 Major

Wait-timeout contract is still inconsistent with constructor behavior.

Line 27 says WaitTimeout defaults to 5s, but Lines 31-39 do not apply that default, so zero means “wait on context only.” This mismatch can still cause unbounded waits when callers use a non-deadline context.

♻️ Proposed fix
+const (
+	defaultBulkheadMaxConc     = 10
+	defaultBulkheadWaitTimeout = 5 * time.Second
+)
+
 type BulkheadOpts struct {
 	Name        string        // Identifier for logging/metrics.
-	MaxConc     int           // Maximum concurrent requests. Default 10.
-	WaitTimeout time.Duration // How long to wait for a slot. Default 5s. 0 means use context deadline.
+	MaxConc     int           // Maximum concurrent requests. Default 10.
+	WaitTimeout time.Duration // How long to wait for a slot. Default 5s.
 }
 
 // NewBulkhead creates a new concurrency-limiting bulkhead.
 func NewBulkhead(opts BulkheadOpts) *Bulkhead {
 	if opts.MaxConc <= 0 {
-		opts.MaxConc = 10
+		opts.MaxConc = defaultBulkheadMaxConc
+	}
+	if opts.WaitTimeout <= 0 {
+		opts.WaitTimeout = defaultBulkheadWaitTimeout
 	}
 	return &Bulkhead{
 		name:    opts.Name,
 		sem:     make(chan struct{}, opts.MaxConc),
 		timeout: opts.WaitTimeout,
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/platform/bulkhead.go` around lines 27 - 39, The WaitTimeout field on
BulkheadOpts is documented to default to 5s but NewBulkhead currently treats
zero as “use context deadline”; update NewBulkhead to apply the documented
default: if opts.WaitTimeout == 0 set it to 5*time.Second before assigning to
the Bulkhead.timeout field (and ensure the comment on WaitTimeout matches this
behavior); reference the WaitTimeout field on BulkheadOpts, the NewBulkhead
function, and the Bulkhead.timeout field when making the change.
internal/core/services/instance_unit_test.go (1)

304-304: ⚠️ Potential issue | 🟡 Minor

Keep these GetByName expectations required.

TerminateInstance and Exec still try GetByName before falling back to GetByID for UUID inputs. Leaving these expectations optional lets that resolution path disappear without failing the tests.

Also applies to: 345-356, 676-676, 731-731, 781-781, 836-836, 896-896, 956-956

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/core/services/instance_unit_test.go` at line 304, The mock
expectation for repo.On("GetByName", mock.Anything, instanceID.String()) is
marked optional with .Maybe(), which allows the name-resolution path to
disappear and hide regressions; change those expectations to required (remove
.Maybe() and use the default required expectation or call .Once()/.Times(1)) so
TerminateInstance and Exec tests must exercise GetByName before they fall back
to GetByID; update the same pattern at the other occurrences that reference
repo.On("GetByName", mock.Anything, ...) in the test (lines noted in the review)
to ensure the name lookup path is enforced.
internal/core/services/instance.go (1)

711-721: ⚠️ Potential issue | 🟠 Major

Don't collapse backend resume conflicts into 500s.

ComputeBackend.ResumeInstance can return a Conflict-typed error such as internal/errors.ErrInstanceNotResumable. This branch always wraps Internal, so /instances/{id}/resume returns 500 instead of the documented 409.

🧩 Suggested fix
 	if err := s.compute.ResumeInstance(ctx, target); err != nil {
 		platform.InstanceOperationsTotal.WithLabelValues("resume", "failure").Inc()
+		if errors.Is(err, errors.Conflict) {
+			s.logger.Warn("resume not possible in current state", "container_id", target, "error", err)
+			return errors.New(errors.Conflict, err.Error())
+		}
 		oldStatus := inst.Status
 		s.logger.Error("failed to resume container, rolling back",
 			"container_id", target, "instance_id", inst.ID, "old_status", oldStatus, "error", err)
 		inst.Status = oldStatus

Please add a unit case for a Conflict returned from ResumeInstance so this stays covered.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/core/services/instance.go` around lines 711 - 721, The resume error
handling currently always wraps errors.Internal, collapsing conflict errors into
500s; change the s.compute.ResumeInstance error branch to detect conflict-type
errors (e.g., internal/errors.ErrInstanceNotResumable or using errors.Is/As) and
handle them specially: increment the platform.InstanceOperationsTotal metric
with "resume","conflict", log a clear conflict message including "container_id"
and "instance_id", set inst.Status back to oldStatus and attempt repo.Update as
before, but return a wrapped conflict error (not errors.Internal). Also add a
unit test case for s.compute.ResumeInstance returning the conflict error to
assert a 409-equivalent response and the "conflict" metric increment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compute-roadmap.md`:
- Around line 90-98: Several tables in the markdown are missing surrounding
blank lines which triggers MD058; locate each table starting with the header row
"| Action | File |" (and other similar pipe-delimited tables) and add a single
blank line before and after each table block (e.g., the tables near the shown
ranges and any other occurrences), ensuring there is an empty line above the
header row and an empty line after the final table row so markdownlint MD058 is
satisfied.
- Around line 24-50: The PR summary is inconsistent: the top section claims
ImportImage and six methods (including ImportImage) are implemented, but the
"What exists" list shows only five methods and ImageHandler with five endpoints
while "What needs to be built" still lists ImportImage; align the document by
choosing the true current state and updating the lists accordingly—either remove
ImportImage from the initial "All features implemented" list or update "What
exists" to include ImportImage across domain.Image, ports.ImageService (include
ImportImage), internal/core/services/image.go (ImageService implementation
containing ImportImage), internal/handlers/image_handler.go (ImageHandler with
POST /api/v1/images/import), and tests (TestImageService_Unit); also ensure RBAC
and router notes (PermissionImageCreate, PermissionImageRead*, router wiring)
are consistent with the chosen state.
- Around line 334-340: The fenced code block using triple backticks in the
Markdown (the block that starts with ``` and contains the flow "LaunchInstance →
enqueue job → return 202" through "Client connects to
/ws/instances/{id}/events") needs a language identifier to satisfy markdownlint
MD040; update the opening fence from ``` to ```text (or another appropriate
language) so the block becomes a labeled fenced code block.

In `@internal/core/services/instance_unit_test.go`:
- Around line 982-1155: The test testInstanceServicePauseResumeUnit reuses the
same mock objects across subtests which accumulates expectations; refactor into
a table-driven test where each table entry describes the case (name, initial
instance state, expected compute/repo/rbac/audit behaviors and expected
error/state), and inside the loop create fresh mocks (new(MockInstanceRepo),
new(MockComputeBackend), new(MockRBACService), new(MockAuditService)), construct
svc with those mocks, set up per-case repo.On/compute.On/rbacSvc.On/auditSvc.On
expectations (using Once/Maybe as appropriate), call svc.PauseInstance or
svc.ResumeInstance, and assert results; ensure each case sets its own instance
value (domain.Instance) and that mock.AssertExpectationsForObjects is called for
the mocks created in that iteration to guarantee isolation and avoid cross-test
expectation leakage.

In `@internal/core/services/instance.go`:
- Around line 663-673: The repo update rollback currently restores inst.Status
(oldStatus) before attempting the compute backend compensating call, which can
leave DB and backend inconsistent; change the order in both rollback branches so
you first call the compensating backend method (use
s.compute.ResumeInstance(ctx, target) in the pause-failure path and
s.compute.PauseInstance(ctx, target) in the resume-failure path), check/log any
backend error, and only after a successful or best-effort backend call attempt
to write the old status back via s.repo.Update(ctx, inst) (use the existing
error logging pattern for repo update failures); apply the same ordering change
to the symmetric rollback block that uses s.compute.PauseInstance and
s.repo.Update.

---

Duplicate comments:
In `@internal/core/services/instance_unit_test.go`:
- Line 304: The mock expectation for repo.On("GetByName", mock.Anything,
instanceID.String()) is marked optional with .Maybe(), which allows the
name-resolution path to disappear and hide regressions; change those
expectations to required (remove .Maybe() and use the default required
expectation or call .Once()/.Times(1)) so TerminateInstance and Exec tests must
exercise GetByName before they fall back to GetByID; update the same pattern at
the other occurrences that reference repo.On("GetByName", mock.Anything, ...) in
the test (lines noted in the review) to ensure the name lookup path is enforced.

In `@internal/core/services/instance.go`:
- Around line 711-721: The resume error handling currently always wraps
errors.Internal, collapsing conflict errors into 500s; change the
s.compute.ResumeInstance error branch to detect conflict-type errors (e.g.,
internal/errors.ErrInstanceNotResumable or using errors.Is/As) and handle them
specially: increment the platform.InstanceOperationsTotal metric with
"resume","conflict", log a clear conflict message including "container_id" and
"instance_id", set inst.Status back to oldStatus and attempt repo.Update as
before, but return a wrapped conflict error (not errors.Internal). Also add a
unit test case for s.compute.ResumeInstance returning the conflict error to
assert a 409-equivalent response and the "conflict" metric increment.

In `@internal/platform/bulkhead.go`:
- Around line 27-39: The WaitTimeout field on BulkheadOpts is documented to
default to 5s but NewBulkhead currently treats zero as “use context deadline”;
update NewBulkhead to apply the documented default: if opts.WaitTimeout == 0 set
it to 5*time.Second before assigning to the Bulkhead.timeout field (and ensure
the comment on WaitTimeout matches this behavior); reference the WaitTimeout
field on BulkheadOpts, the NewBulkhead function, and the Bulkhead.timeout field
when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 66a88b06-6736-4d62-b629-2e550acf8a50

📥 Commits

Reviewing files that changed from the base of the PR and between a999e74 and 5010b8e.

📒 Files selected for processing (11)
  • compute-roadmap.md
  • docs/FEATURES.md
  • docs/swagger/docs.go
  • docs/swagger/swagger.json
  • docs/swagger/swagger.yaml
  • internal/core/services/instance.go
  • internal/core/services/instance_unit_test.go
  • internal/handlers/instance_handler.go
  • internal/handlers/instance_handler_test.go
  • internal/platform/bulkhead.go
  • internal/repositories/libvirt/adapter.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • docs/FEATURES.md
  • internal/handlers/instance_handler_test.go
  • docs/swagger/swagger.yaml

Comment thread compute-roadmap.md Outdated
Comment thread compute-roadmap.md Outdated
Comment thread compute-roadmap.md Outdated
Comment thread internal/core/services/instance_unit_test.go
Comment thread internal/core/services/instance.go Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

poyrazK added 3 commits April 30, 2026 12:26
- bulkhead: apply 5s default WaitTimeout when zero (matches docs)
- instance: add conflict detection in ResumeInstance (matches PauseInstance)
- instance: fix rollback order - compensating backend call first, then repo
- instance_unit_test: refactor pause/resume tests to table-driven with fresh
  mocks per case; add ResumeInstance_ConflictError test case
- docker adapter: fix indentation regression in ContainerUpdate interface method
- libvirt adapter: improve error messages with human-readable state names
  instead of raw numeric values (e.g., "domain is PAUSED" not "domain is 3")
Copilot AI review requested due to automatic review settings April 30, 2026 10:03
Fixes lint failure:
  internal/core/services/instance_unit_test.go:1012:12: test helper
  function should start from t.Helper() (thelper)
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

poyrazK added 4 commits April 30, 2026 13:35
Fixes lint failure across all 9 assert closures in the pause/resume
table-driven test cases (lines 1012, 1028, 1047, 1071, 1095, 1111, 1131, 1148, 1172).
After merging main, testComputeBackend was missing the new interface methods.
Copilot AI review requested due to automatic review settings April 30, 2026 11:08
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

The previous log said "failed to resume container, rolling back"
which was misleading since no backend rollback occurs when
ResumeInstance fails - the container stays PAUSED. Updated to
"failed to resume container, instance left in PAUSED state".
Copy link
Copy Markdown
Owner Author

@poyrazK poyrazK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ready to merge

@poyrazK poyrazK merged commit b4819c0 into main Apr 30, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants