fix(orthrus): stop Dockhand flapping and wire uptime to session liveness#1039
Conversation
feat: Implement Proxy Groups for Better Organization
…dule entries in go.sum
…rsion 4.0.4, and other dependencies
…servers Uptime monitoring incorrectly reported Orthrus-managed remote servers as DOWN because SyncMonitors() created TCP monitors using the ExternalProxyPort (e.g. 2375) as the target on the remote machine. That port is bound on the Charon server's loopback, not on the remote machine, so every TCP dial resulted in ECONNREFUSED and a false failure. Once the host was marked DOWN, all co-located TCP monitors (e.g. Dockhand on port 3001) were short-circuited without being checked, cascading the false DOWN state across all services at that Tailscale IP. Introduces a first-class "orthrus" monitor type that replaces TCP dials with an in-memory WebSocket session state lookup. A monitor is UP if and only if the Orthrus agent's WebSocket tunnel is connected and its local Docker proxy listener is operational. No network I/O is required for the check. Adds a pre-flight fast-path in checkHost() that exits immediately for Orthrus-only hosts, eliminating wasted retry-loop sleep time proportional to MaxRetries × 2s per poll cycle. Existing stale TCP monitors for Orthrus servers are migrated automatically to type="orthrus" on the next SyncMonitors() run — no database migration required. Fixes uptime reporting for Orthrus remote server hosts and all co-located TCP service monitors (resolves Dockhand port 3001 false-DOWN cascade).
The Muzzle security filter was blocking all Docker client connections because /_ping was missing from the allowlist. Docker clients universally send GET /_ping or HEAD /_ping as the first request to verify daemon reachability — without it, every client reports a failed connection even though the proxy itself was running correctly. Added /_ping to the Muzzle path allowlist and added explicit HEAD method support for /_ping, which is the standard Docker client health check variant. All write operations remain blocked (POST/DELETE/PUT), and all paths not on the allowlist continue to return 403 Forbidden. This preserves Orthrus's read-only security posture while enabling Docker clients to establish connections.
The Orthrus agent's Muzzle filter did not include /_ping or /v*/_ping in its allowedPatterns, causing every Docker client health check from Dockhand to be blocked with a bare HTTP/1.1 403 (no Content-Type, Content-Length: 0). This was the root cause of Dockhand's failure to connect through the Charon proxy — the server-side Muzzle was already correct, but the agent-side Muzzle silently rejected the request before it reached the Docker socket. Agent Muzzle: - Add /_ping and /v*/_ping so Docker health checks pass end-to-end - Add /v*/volumes, /v*/volumes/*, /v*/networks, /v*/networks/* for full read-only Dockhand support (listing and inspection) Server Muzzle: - Add /events so Dockhand's ~3s event polling is no longer blocked at the first layer before the request even reaches the agent - Add /volumes and /networks to the exact-match set - Add path.Match pattern checking for /containers/*/json, /volumes/*, and /networks/* to handle dynamic Docker resource IDs Both Muzzles must remain in sync: the server strips the version prefix and uses exact + pattern matching; the agent uses path.Match with /v* prefixes.
…containers, images, volumes, networks, and system disk usage
The Performance Regression Check was failing with "could not read Username for 'https://github.com': terminal prompts disabled" on every PR build. For pull_request events github.sha resolves to the ephemeral merge commit GitHub creates at refs/pull/N/merge. This ref is not advertised in the normal ref listing, so git fetch --depth=1 origin <SHA> cannot resolve it. Git then attempts interactive credential prompting, which is disabled in CI, producing a misleading authentication error. Fixed by using github.event.pull_request.head.sha as the primary ref for PR events, which points to the actual branch HEAD commit that is directly fetchable. The workflow_run and push/dispatch fallbacks are unchanged.
Dockhand's Go subprocess worker sends Docker API requests without the version prefix (e.g., /containers/json instead of /v1.47/containers/json) and its Node.js events client sends bare /events. The agent-side muzzle only matched versioned /v*/... patterns, so every container list poll and event stream subscription was blocked with 403. Adds unversioned equivalents for all 15 Docker API path patterns in the agent muzzle, mirroring the allowlist already present on the server-side muzzle. Both forms are now accepted; write methods and unsupported paths remain blocked (fail-closed). Also adds 62-case muzzle unit test suite covering GET on every versioned and unversioned path, HEAD restricted to _ping only, all write methods blocked, and path.Clean traversal normalisation. Two Playwright specs (33 tests total) lock in regression coverage for the Orthrus agent management UI and Docker proxy connection states, running against mocked API responses so they are independent of a live tunnel.
…skip-TCP branch - Add TestHandleWebSocket_DisplacesExistingSession to cover the type-assertion block in HandleWebSocket that closes an old AgentSession when a new connection displaces it (server.go lines 98-100) - Add TestCheckHost_NonOrthrusMonitorNoPort_SkipsTCPDial to cover the early- return guard in checkHost when all non-orthrus monitors have no extractable port (uptime_service.go lines 611-612) - Add TestSyncMonitors_OrthrusRemoteServer_NilAgentUUID_SkipsMonitorCreation and TestCheckMonitor_OrthrusType_EmptyURL_ReturnsDown for orthrus edge cases Achieves 100% patch coverage on feature/hecate (25/25 changed lines)
…mination fix(orthrus): disable keep-alives in external proxy transport for better resource management test(orthrus): add tests for external proxy transport to verify keep-alive behavior
- Ensure pull request image reference is produced through a strict producer contract with deterministic matching and fail-fast behavior - Prevent downstream security scan execution when producer output is missing, so failures surface at the true origin - Remove order-dependent tag selection in artifact paths to avoid metadata ordering drift and intermittent CI regressions - Enforce PR vulnerability gate behavior while still preserving SARIF upload and traceability summaries for triage - Add focused QA reporting and manual validation scenarios to verify pull request, push, and workflow-run behavior remains correct
… version consistency
…nhanced reliability
Hardened the PR security scan flow to consume only the build output digest from the same workflow run. This removes mutable tag fallback behavior and prevents stale image scans when outputs are missing or delayed. Behavior changes: - PR scan now requires a non-empty, valid sha256 digest before scanning. - PR image reference is constructed strictly as an immutable digest reference. - Freshness validation now fails if image revision metadata is missing or does not match the expected PR head SHA. - PR scan summary now always records digest and revision label SHA for traceable provenance. Why this was necessary: - Previous fallback behavior could scan an older mutable PR tag, causing commit-to-image mismatch and misleading vulnerability results. - Failing early on contract violations is safer than silently scanning a potentially stale artifact. Side effects: - PR jobs now hard-fail when digest output propagation breaks, which is intentional to preserve scan integrity and reproducibility. - Non-PR scanning behavior remains unchanged.
Added a shared x/crypto version pin and applied it directly in the Caddy builder patch flow so Caddy artifacts are built with the intended patched crypto module version. Behavior changes: - Caddy builder now consumes a dedicated shared x/crypto pin argument. - Caddy Stage 2 dependency patching explicitly applies golang.org/x/crypto at the pinned version. - Resulting Caddy binary metadata now resolves x/crypto to the pinned target. Why this was necessary: - Security findings were showing crypto CVE drift despite existing patching in another component path. - Caddy and CrowdSec have independent dependency graphs, so each path must pin and verify security-sensitive modules explicitly. Important side effects: - Build behavior remains deterministic and scoped; CrowdSec logic is unchanged. - This improves reproducibility for scanner validation by ensuring Caddy binary provenance includes the expected crypto module version.
…cement Improved PR security scan observability by introducing a diagnostics step that parses SARIF output and reports unsuppressed HIGH/CRITICAL blocker identifiers before the enforcement gate runs. Behavior changes: - When PR image scan SARIF exists, the workflow now prints blocker IDs and a concise count in the job summary. - The summary now provides direct evidence of what is still blocking merge, reducing ambiguity during CI triage. Why this was necessary: - Recent runs failed the PR security gate without clear, immediate visibility into the exact unsuppressed findings. - Faster root-cause identification is required to distinguish real new regressions from expected temporary upstream exceptions. Important considerations: - Enforcement policy is unchanged; failures still block exactly as before. - Severity thresholds and ignore policy were not relaxed. - This is diagnostics-only hardening to make CI outcomes actionable.
Improve CI security-scan behavior by enforcing deterministic traceability output, making diagnostics resilient, and removing legacy contract ambiguity. - Traceability now executes the expected runtime binary directly and falls back safely when the absolute entrypoint is unavailable. - Diagnostics now run in a best-effort, non-blocking mode and always emit a summary section, including explicit fallback reasons when scan artifacts are missing or malformed. - Legacy output-contract semantics were removed in favor of digest/image reference language to prevent confusion and reduce false contract expectations. This change was necessary to stop noisy or misleading PR scan output, prevent diagnostic parser failures from obscuring true gate results, and keep the security gate decision path clear and reliable.--- fix: harden PR image scan traceability and diagnostics behavior Improve CI security-scan behavior by enforcing deterministic traceability output, making diagnostics resilient, and removing legacy contract ambiguity. - Traceability now executes the expected runtime binary directly and falls back safely when the absolute entrypoint is unavailable. - Diagnostics now run in a best-effort, non-blocking mode and always emit a summary section, including explicit fallback reasons when scan artifacts are missing or malformed. - Legacy output-contract semantics were removed in favor of digest/image reference language to prevent confusion and reduce false contract expectations. This change was necessary to stop noisy or misleading PR scan output, prevent diagnostic parser failures from obscuring true gate results, and keep the security gate decision path clear and reliable.
fix: harden PR Trivy SARIF diagnostics classification and parsing Improve PR image scan diagnostics so fallback states are precise and actionable without altering enforcement authority. - Added explicit fallback categories for missing, unreadable, invalid, schema, and parser-failure SARIF conditions - Standardized SARIF validity checks to treat runs-array payloads as parseable - Ensured valid scans with empty results report parsed count zero instead of degrading to fallback - Strengthened blocker identifier extraction to improve reliability across result/rule shapes - Preserved enforcement behavior so gate outcomes continue to depend only on the Trivy scan result and not diagnostics parsing This change was necessary to remove ambiguous fallback reporting, improve triage quality for CI failures, and keep security gate decisions deterministic and trustworthy.
Normalize PR image security scanning so table and SARIF paths evaluate the same vulnerability surface and produce consistent suppression behavior. - Enabled vulnerability scanner scope explicitly in both PR table and PR SARIF scan steps to prevent divergent results for the same image - Improved diagnostics fallback observability by surfacing parser exit code and first parser hint when SARIF parsing fails - Preserved enforcement authority so merge blocking remains controlled only by the Trivy blocking step outcome This change was necessary because repeated CI failures showed ignore rules were loaded, but scan outputs still diverged and diagnostics lacked actionable parser context, making it difficult to distinguish true unsuppressed blockers from parsing-path noise.
fix(orthrus): stop Dockhand flapping and wire uptime to session liveness
There was a problem hiding this comment.
Pull request overview
This pull request stabilizes Orthrus-backed Docker forwarding to prevent Dockhand online/offline flapping, and updates uptime semantics so Orthrus monitors reflect agent session liveness (not raw TCP reachability).
Changes:
- Harden Orthrus proxying (transport configuration + expanded Docker API allowlists) and improve agent-side Docker stream lifecycle handling to reduce timeout-driven disconnects.
- Wire uptime checks for
orthrusmonitor type to agent/session liveness via an injected Orthrus status resolver, with unit coverage for sync/check behavior. - Add Playwright coverage for Orthrus agents UI, Orthrus proxy-path behavior, and Orthrus uptime monitor UX/data flow.
Reviewed changes
Copilot reviewed 23 out of 24 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| backend/internal/orthrus/session.go | Configures the Orthrus external reverse-proxy transport to avoid keep-alive reuse that contributed to flapping. |
| backend/internal/orthrus/muzzle.go | Expands/normalizes Orthrus Docker API allowlists for forwarded Docker requests. |
| backend/internal/api/handlers/docker_handler.go | Adds Orthrus resolver wiring so Docker endpoints can route via Orthrus agent proxy state. |
| backend/internal/api/handlers/docker_handler_test.go | Adds regression/unit coverage for resolver wiring and Orthrus-specific Docker handler behavior (including typed-nil guard). |
| backend/internal/api/routes/routes.go | Wires Orthrus resolver injection from route setup into dependent services/handlers. |
| backend/internal/services/uptime_service.go | Introduces Orthrus status resolver integration so uptime uses session liveness for Orthrus monitors. |
| backend/internal/services/uptime_service_test.go | Adds unit tests for Orthrus monitor sync + check semantics and resolver edge cases. |
| agent/muzzle/muzzle.go | Adjusts agent-side proxy/muzzle behavior to better handle connection lifecycle and streaming. |
| agent/muzzle/muzzle_test.go | Adds/updates tests for allowlist filtering and edge cases like traversal normalization. |
| tests/uptime-orthrus.spec.ts | Adds E2E coverage for Orthrus uptime monitor display/state behavior. |
| tests/orthrus-proxy-paths.spec.ts | Adds E2E coverage for Orthrus Docker source selection and proxy-path-driven container listing UX. |
| tests/orthrus-agents.spec.ts | Adds E2E coverage for Orthrus agents UI rendering, accessible labels, and key interactions. |
| .trivyignore | Adds/updates vulnerability suppressions relevant to the updated runtime surface and CI scanning. |
| SECURITY.md | Updates the security ledger with current known/patched vulnerability tracking and review date. |
| Dockerfile | Adjusts build-time pinning/verification (including dependency version constraints) supporting stable releases and supply-chain hygiene. |
|
You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool. What Enabling Code Scanning Means:
For more information about GitHub Code Scanning, check out the documentation. |
✅ Supply Chain Verification Results✅ PASSED 📦 SBOM Summary
🔍 Vulnerability Scan
📎 Artifacts
Generated by Supply Chain Verification workflow • View Details |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Summary
This PR replaces the placeholder description with the actual Orthrus stabilization work and uptime wiring.
It addresses repeated Dockhand online/offline flapping caused by tunnel request timeouts and ensures Orthrus-backed uptime checks use agent session liveness instead of raw TCP reachability.
Root cause addressed
Dockhand polls every 30 seconds. External proxy keep-alive reuse combined with a long-lived Docker socket response path could leave subsequent requests blocked until timeout (
context deadline exceeded).What changed
Orthrus proxy and tunnel hardening
Agent-side Docker stream handling
Uptime service wiring
E2E coverage
Why this matters
Validation
Rollout notes
PR context
This branch also contains supporting hardening and regression coverage added while diagnosing and stabilizing Orthrus reconnect/session behavior.