fix(orthrus): stop Dockhand flapping and wire uptime to session liveness#1035
Conversation
feat: Implement Proxy Groups for Better Organization
|
You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool. What Enabling Code Scanning Means:
For more information about GitHub Code Scanning, check out the documentation. |
✅ Supply Chain Verification Results✅ PASSED 📦 SBOM Summary
🔍 Vulnerability Scan
📎 Artifacts
Generated by Supply Chain Verification workflow • View Details |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…dule entries in go.sum
…rsion 4.0.4, and other dependencies
…servers Uptime monitoring incorrectly reported Orthrus-managed remote servers as DOWN because SyncMonitors() created TCP monitors using the ExternalProxyPort (e.g. 2375) as the target on the remote machine. That port is bound on the Charon server's loopback, not on the remote machine, so every TCP dial resulted in ECONNREFUSED and a false failure. Once the host was marked DOWN, all co-located TCP monitors (e.g. Dockhand on port 3001) were short-circuited without being checked, cascading the false DOWN state across all services at that Tailscale IP. Introduces a first-class "orthrus" monitor type that replaces TCP dials with an in-memory WebSocket session state lookup. A monitor is UP if and only if the Orthrus agent's WebSocket tunnel is connected and its local Docker proxy listener is operational. No network I/O is required for the check. Adds a pre-flight fast-path in checkHost() that exits immediately for Orthrus-only hosts, eliminating wasted retry-loop sleep time proportional to MaxRetries × 2s per poll cycle. Existing stale TCP monitors for Orthrus servers are migrated automatically to type="orthrus" on the next SyncMonitors() run — no database migration required. Fixes uptime reporting for Orthrus remote server hosts and all co-located TCP service monitors (resolves Dockhand port 3001 false-DOWN cascade).
The Muzzle security filter was blocking all Docker client connections because /_ping was missing from the allowlist. Docker clients universally send GET /_ping or HEAD /_ping as the first request to verify daemon reachability — without it, every client reports a failed connection even though the proxy itself was running correctly. Added /_ping to the Muzzle path allowlist and added explicit HEAD method support for /_ping, which is the standard Docker client health check variant. All write operations remain blocked (POST/DELETE/PUT), and all paths not on the allowlist continue to return 403 Forbidden. This preserves Orthrus's read-only security posture while enabling Docker clients to establish connections.
The Orthrus agent's Muzzle filter did not include /_ping or /v*/_ping in its allowedPatterns, causing every Docker client health check from Dockhand to be blocked with a bare HTTP/1.1 403 (no Content-Type, Content-Length: 0). This was the root cause of Dockhand's failure to connect through the Charon proxy — the server-side Muzzle was already correct, but the agent-side Muzzle silently rejected the request before it reached the Docker socket. Agent Muzzle: - Add /_ping and /v*/_ping so Docker health checks pass end-to-end - Add /v*/volumes, /v*/volumes/*, /v*/networks, /v*/networks/* for full read-only Dockhand support (listing and inspection) Server Muzzle: - Add /events so Dockhand's ~3s event polling is no longer blocked at the first layer before the request even reaches the agent - Add /volumes and /networks to the exact-match set - Add path.Match pattern checking for /containers/*/json, /volumes/*, and /networks/* to handle dynamic Docker resource IDs Both Muzzles must remain in sync: the server strips the version prefix and uses exact + pattern matching; the agent uses path.Match with /v* prefixes.
…containers, images, volumes, networks, and system disk usage
The Performance Regression Check was failing with "could not read Username for 'https://github.com': terminal prompts disabled" on every PR build. For pull_request events github.sha resolves to the ephemeral merge commit GitHub creates at refs/pull/N/merge. This ref is not advertised in the normal ref listing, so git fetch --depth=1 origin <SHA> cannot resolve it. Git then attempts interactive credential prompting, which is disabled in CI, producing a misleading authentication error. Fixed by using github.event.pull_request.head.sha as the primary ref for PR events, which points to the actual branch HEAD commit that is directly fetchable. The workflow_run and push/dispatch fallbacks are unchanged.
Dockhand's Go subprocess worker sends Docker API requests without the version prefix (e.g., /containers/json instead of /v1.47/containers/json) and its Node.js events client sends bare /events. The agent-side muzzle only matched versioned /v*/... patterns, so every container list poll and event stream subscription was blocked with 403. Adds unversioned equivalents for all 15 Docker API path patterns in the agent muzzle, mirroring the allowlist already present on the server-side muzzle. Both forms are now accepted; write methods and unsupported paths remain blocked (fail-closed). Also adds 62-case muzzle unit test suite covering GET on every versioned and unversioned path, HEAD restricted to _ping only, all write methods blocked, and path.Clean traversal normalisation. Two Playwright specs (33 tests total) lock in regression coverage for the Orthrus agent management UI and Docker proxy connection states, running against mocked API responses so they are independent of a live tunnel.
…skip-TCP branch - Add TestHandleWebSocket_DisplacesExistingSession to cover the type-assertion block in HandleWebSocket that closes an old AgentSession when a new connection displaces it (server.go lines 98-100) - Add TestCheckHost_NonOrthrusMonitorNoPort_SkipsTCPDial to cover the early- return guard in checkHost when all non-orthrus monitors have no extractable port (uptime_service.go lines 611-612) - Add TestSyncMonitors_OrthrusRemoteServer_NilAgentUUID_SkipsMonitorCreation and TestCheckMonitor_OrthrusType_EmptyURL_ReturnsDown for orthrus edge cases Achieves 100% patch coverage on feature/hecate (25/25 changed lines)
…mination fix(orthrus): disable keep-alives in external proxy transport for better resource management test(orthrus): add tests for external proxy transport to verify keep-alive behavior
- Ensure pull request image reference is produced through a strict producer contract with deterministic matching and fail-fast behavior - Prevent downstream security scan execution when producer output is missing, so failures surface at the true origin - Remove order-dependent tag selection in artifact paths to avoid metadata ordering drift and intermittent CI regressions - Enforce PR vulnerability gate behavior while still preserving SARIF upload and traceability summaries for triage - Add focused QA reporting and manual validation scenarios to verify pull request, push, and workflow-run behavior remains correct
… version consistency
…nhanced reliability
Hardened the PR security scan flow to consume only the build output digest from the same workflow run. This removes mutable tag fallback behavior and prevents stale image scans when outputs are missing or delayed. Behavior changes: - PR scan now requires a non-empty, valid sha256 digest before scanning. - PR image reference is constructed strictly as an immutable digest reference. - Freshness validation now fails if image revision metadata is missing or does not match the expected PR head SHA. - PR scan summary now always records digest and revision label SHA for traceable provenance. Why this was necessary: - Previous fallback behavior could scan an older mutable PR tag, causing commit-to-image mismatch and misleading vulnerability results. - Failing early on contract violations is safer than silently scanning a potentially stale artifact. Side effects: - PR jobs now hard-fail when digest output propagation breaks, which is intentional to preserve scan integrity and reproducibility. - Non-PR scanning behavior remains unchanged.
Added a shared x/crypto version pin and applied it directly in the Caddy builder patch flow so Caddy artifacts are built with the intended patched crypto module version. Behavior changes: - Caddy builder now consumes a dedicated shared x/crypto pin argument. - Caddy Stage 2 dependency patching explicitly applies golang.org/x/crypto at the pinned version. - Resulting Caddy binary metadata now resolves x/crypto to the pinned target. Why this was necessary: - Security findings were showing crypto CVE drift despite existing patching in another component path. - Caddy and CrowdSec have independent dependency graphs, so each path must pin and verify security-sensitive modules explicitly. Important side effects: - Build behavior remains deterministic and scoped; CrowdSec logic is unchanged. - This improves reproducibility for scanner validation by ensuring Caddy binary provenance includes the expected crypto module version.
…cement Improved PR security scan observability by introducing a diagnostics step that parses SARIF output and reports unsuppressed HIGH/CRITICAL blocker identifiers before the enforcement gate runs. Behavior changes: - When PR image scan SARIF exists, the workflow now prints blocker IDs and a concise count in the job summary. - The summary now provides direct evidence of what is still blocking merge, reducing ambiguity during CI triage. Why this was necessary: - Recent runs failed the PR security gate without clear, immediate visibility into the exact unsuppressed findings. - Faster root-cause identification is required to distinguish real new regressions from expected temporary upstream exceptions. Important considerations: - Enforcement policy is unchanged; failures still block exactly as before. - Severity thresholds and ignore policy were not relaxed. - This is diagnostics-only hardening to make CI outcomes actionable.
Improve CI security-scan behavior by enforcing deterministic traceability output, making diagnostics resilient, and removing legacy contract ambiguity. - Traceability now executes the expected runtime binary directly and falls back safely when the absolute entrypoint is unavailable. - Diagnostics now run in a best-effort, non-blocking mode and always emit a summary section, including explicit fallback reasons when scan artifacts are missing or malformed. - Legacy output-contract semantics were removed in favor of digest/image reference language to prevent confusion and reduce false contract expectations. This change was necessary to stop noisy or misleading PR scan output, prevent diagnostic parser failures from obscuring true gate results, and keep the security gate decision path clear and reliable.--- fix: harden PR image scan traceability and diagnostics behavior Improve CI security-scan behavior by enforcing deterministic traceability output, making diagnostics resilient, and removing legacy contract ambiguity. - Traceability now executes the expected runtime binary directly and falls back safely when the absolute entrypoint is unavailable. - Diagnostics now run in a best-effort, non-blocking mode and always emit a summary section, including explicit fallback reasons when scan artifacts are missing or malformed. - Legacy output-contract semantics were removed in favor of digest/image reference language to prevent confusion and reduce false contract expectations. This change was necessary to stop noisy or misleading PR scan output, prevent diagnostic parser failures from obscuring true gate results, and keep the security gate decision path clear and reliable.
fix: harden PR Trivy SARIF diagnostics classification and parsing Improve PR image scan diagnostics so fallback states are precise and actionable without altering enforcement authority. - Added explicit fallback categories for missing, unreadable, invalid, schema, and parser-failure SARIF conditions - Standardized SARIF validity checks to treat runs-array payloads as parseable - Ensured valid scans with empty results report parsed count zero instead of degrading to fallback - Strengthened blocker identifier extraction to improve reliability across result/rule shapes - Preserved enforcement behavior so gate outcomes continue to depend only on the Trivy scan result and not diagnostics parsing This change was necessary to remove ambiguous fallback reporting, improve triage quality for CI failures, and keep security gate decisions deterministic and trustworthy.
Normalize PR image security scanning so table and SARIF paths evaluate the same vulnerability surface and produce consistent suppression behavior. - Enabled vulnerability scanner scope explicitly in both PR table and PR SARIF scan steps to prevent divergent results for the same image - Improved diagnostics fallback observability by surfacing parser exit code and first parser hint when SARIF parsing fails - Preserved enforcement authority so merge blocking remains controlled only by the Trivy blocking step outcome This change was necessary because repeated CI failures showed ignore rules were loaded, but scan outputs still diverged and diagnostics lacked actionable parser context, making it difficult to distinguish true unsuppressed blockers from parsing-path noise.
There was a problem hiding this comment.
Pull request overview
This PR stabilizes Orthrus Docker forwarding (reducing Dockhand online/offline flapping) and updates uptime monitoring so Orthrus-backed targets report availability based on agent session liveness rather than raw TCP reachability. It also adds substantial regression coverage across backend, agent, and Playwright E2E paths.
Changes:
- Harden Orthrus session/proxy behavior (disable external-proxy keep-alives, displace old sessions on reconnect, prevent stale heartbeat goroutines from evicting new sessions, expand Docker API allowlists).
- Wire Orthrus session resolution into
UptimeServiceand introduce an"orthrus"monitor type checked via live session state. - Add/extend tests (Go unit tests + Playwright suites) for Orthrus agents, proxy paths, and uptime UI/behavior; plus supporting security/build metadata tweaks.
Reviewed changes
Copilot reviewed 23 out of 24 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/uptime-orthrus.spec.ts | Playwright coverage for Orthrus uptime monitor UI/status semantics. |
| tests/orthrus-proxy-paths.spec.ts | Playwright coverage for Orthrus remote-server selection and Docker source behavior in ProxyHostForm. |
| tests/orthrus-agents.spec.ts | Playwright coverage for Orthrus Agent Manager UI states and accessibility. |
| SECURITY.md | Updates “Last reviewed” and adds a new known vulnerability entry. |
| scripts/caddy-compat-matrix.sh | Bumps default candidate Caddy version used for compatibility matrix runs. |
| docs/reports/qa_report.md | Rewrites QA report content (currently workflow-scoped content). |
| Dockerfile | Adjusts Go module pinning/patching in the Caddy build flow and adds an assertion for resolved Caddy version. |
| backend/internal/services/uptime_service.go | Adds Orthrus resolver injection and implements "orthrus" monitor behavior and TCP pre-check bypass rules. |
| backend/internal/services/uptime_service_test.go | Adds unit tests for Orthrus monitor sync/check behavior and host pre-check logic. |
| backend/internal/orthrus/session.go | Disables keep-alive reuse for the external reverse-proxy transport. |
| backend/internal/orthrus/session_test.go | Regression test verifying keep-alives are not reused by the external proxy. |
| backend/internal/orthrus/server.go | Fixes session replacement + heartbeat race behavior (displacement before binding; CompareAndDelete on heartbeat). |
| backend/internal/orthrus/server_test.go | Regression tests covering stale heartbeat goroutine behavior and current-session eviction/offline marking. |
| backend/internal/orthrus/server_coverage_test.go | Coverage test ensuring reconnect displaces (closes) the prior session. |
| backend/internal/orthrus/muzzle.go | Expands Docker API allowlists (static + dynamic paths) and allows HEAD /_ping. |
| backend/internal/orthrus/muzzle_test.go | Tests for new allowlist paths, HEAD /_ping, and dynamic-path matching. |
| backend/internal/api/routes/routes.go | Injects Orthrus resolver into UptimeService from route setup when Orthrus is enabled. |
| agent/muzzle/muzzle.go | Expands agent-side Docker allowlist and forces Connection: close to avoid hanging on keep-alives. |
| agent/muzzle/muzzle_test.go | Adds allowlist + proxy lifecycle/streaming tests for agent muzzle behavior. |
| .trivyignore | Adds suppression entry for the new CVE/GHSA noted in SECURITY.md. |
| .gitignore | Ignores new .tmp/ artifacts related to caddy pin cleanup. |
| .github/workflows/benchmark.yml | Ensures checkout uses PR head SHA for pull_request events (not ephemeral merge SHA). |
Summary
This PR replaces the placeholder description with the actual Orthrus stabilization work and uptime wiring.
It addresses repeated Dockhand online/offline flapping caused by tunnel request timeouts and ensures Orthrus-backed uptime checks use agent session liveness instead of raw TCP reachability.
Root cause addressed
Dockhand polls every 30 seconds. External proxy keep-alive reuse combined with a long-lived Docker socket response path could leave subsequent requests blocked until timeout (
context deadline exceeded).What changed
Orthrus proxy and tunnel hardening
Agent-side Docker stream handling
Uptime service wiring
E2E coverage
Why this matters
Validation
Rollout notes
PR context
This branch also contains supporting hardening and regression coverage added while diagnosing and stabilizing Orthrus reconnect/session behavior.