Skip to content

fix(orthrus): stop Dockhand flapping and wire uptime to session liveness#1035

Merged
Wikid82 merged 38 commits into
developmentfrom
feature/hecate
May 25, 2026
Merged

fix(orthrus): stop Dockhand flapping and wire uptime to session liveness#1035
Wikid82 merged 38 commits into
developmentfrom
feature/hecate

Conversation

@Wikid82
Copy link
Copy Markdown
Owner

@Wikid82 Wikid82 commented May 23, 2026

Summary

This PR replaces the placeholder description with the actual Orthrus stabilization work and uptime wiring.

It addresses repeated Dockhand online/offline flapping caused by tunnel request timeouts and ensures Orthrus-backed uptime checks use agent session liveness instead of raw TCP reachability.

Root cause addressed

Dockhand polls every 30 seconds. External proxy keep-alive reuse combined with a long-lived Docker socket response path could leave subsequent requests blocked until timeout (context deadline exceeded).

What changed

Orthrus proxy and tunnel hardening

  • Disabled keep-alive reuse on the external reverse-proxy transport used for Orthrus Docker forwarding.
  • Expanded and normalized Orthrus Docker API allowlists (versioned and unversioned paths, dynamic container/network/volume paths, ping handling).
  • Added regression coverage around session displacement and heartbeat race behavior.

Agent-side Docker stream handling

  • Updated the agent-side muzzle/proxy behavior and tests to better cover connection lifecycle, request filtering, and streaming edge cases.

Uptime service wiring

  • Added an Orthrus status resolver interface in uptime service.
  • Wired resolver injection from route setup so uptime checks can evaluate Orthrus monitors by agent/session state.
  • Added unit tests for Orthrus monitor sync and check behavior.

E2E coverage

  • Added Playwright suites for Orthrus agents, Orthrus proxy path behavior, and Orthrus uptime monitor UX/data flow.

Why this matters

  • Stops false DOWN transitions caused by tunnel-path timeouts.
  • Reduces noisy/incorrect availability notifications.
  • Aligns uptime semantics for Orthrus-managed targets with how the tunnel actually works.

Validation

  • Backend unit/integration tests updated and passing for affected Orthrus and uptime paths.
  • New and existing Playwright scenarios cover Orthrus agent visibility, proxy-path resolution, and uptime monitor behavior.

Rollout notes

  • Server-side changes improve behavior immediately after deploy.
  • Agent-side improvements require rolling out the rebuilt Orthrus agent image to HomeLab to take full effect.

PR context

This branch also contains supporting hardening and regression coverage added while diagnosing and stabilizing Orthrus reconnect/session behavior.

@Wikid82 Wikid82 self-assigned this May 23, 2026
@Wikid82 Wikid82 added this to Charon May 23, 2026
@github-project-automation github-project-automation Bot moved this to Backlog in Charon May 23, 2026
@Wikid82 Wikid82 moved this from Backlog to In Progress in Charon May 23, 2026
@github-advanced-security
Copy link
Copy Markdown
Contributor

You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool.

What Enabling Code Scanning Means:

  • The 'Security' tab will display more code scanning analysis results (e.g., for the default branch).
  • Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results.
  • You will be able to see the analysis results for the pull request's branch on this overview once the scans have completed and the checks have passed.

For more information about GitHub Code Scanning, check out the documentation.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 23, 2026

✅ Supply Chain Verification Results

PASSED

📦 SBOM Summary

  • Components: 1487

🔍 Vulnerability Scan

Severity Count
🔴 Critical 0
🟠 High 0
🟡 Medium 4
🟢 Low 2
Total 6

📎 Artifacts

  • SBOM (CycloneDX JSON) and Grype results available in workflow artifacts

Generated by Supply Chain Verification workflow • View Details

@codecov
Copy link
Copy Markdown

codecov Bot commented May 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@Wikid82 Wikid82 changed the title fix: Orthrus uptime reporting down fix(uptime): add OrthrusStatusChecker interface and DI wiring May 23, 2026
Wikid82 and others added 18 commits May 22, 2026 21:18
…servers

Uptime monitoring incorrectly reported Orthrus-managed remote servers as
DOWN because SyncMonitors() created TCP monitors using the ExternalProxyPort
(e.g. 2375) as the target on the remote machine. That port is bound on the
Charon server's loopback, not on the remote machine, so every TCP dial
resulted in ECONNREFUSED and a false failure.

Once the host was marked DOWN, all co-located TCP monitors (e.g. Dockhand
on port 3001) were short-circuited without being checked, cascading the
false DOWN state across all services at that Tailscale IP.

Introduces a first-class "orthrus" monitor type that replaces TCP dials with
an in-memory WebSocket session state lookup. A monitor is UP if and only if
the Orthrus agent's WebSocket tunnel is connected and its local Docker proxy
listener is operational. No network I/O is required for the check.

Adds a pre-flight fast-path in checkHost() that exits immediately for
Orthrus-only hosts, eliminating wasted retry-loop sleep time proportional
to MaxRetries × 2s per poll cycle.

Existing stale TCP monitors for Orthrus servers are migrated automatically
to type="orthrus" on the next SyncMonitors() run — no database migration
required.

Fixes uptime reporting for Orthrus remote server hosts and all co-located
TCP service monitors (resolves Dockhand port 3001 false-DOWN cascade).
The Muzzle security filter was blocking all Docker client connections
because /_ping was missing from the allowlist. Docker clients universally
send GET /_ping or HEAD /_ping as the first request to verify daemon
reachability — without it, every client reports a failed connection even
though the proxy itself was running correctly.

Added /_ping to the Muzzle path allowlist and added explicit HEAD method
support for /_ping, which is the standard Docker client health check
variant. All write operations remain blocked (POST/DELETE/PUT), and all
paths not on the allowlist continue to return 403 Forbidden. This
preserves Orthrus's read-only security posture while enabling Docker
clients to establish connections.
The Orthrus agent's Muzzle filter did not include /_ping or /v*/_ping in
its allowedPatterns, causing every Docker client health check from Dockhand
to be blocked with a bare HTTP/1.1 403 (no Content-Type, Content-Length: 0).
This was the root cause of Dockhand's failure to connect through the Charon
proxy — the server-side Muzzle was already correct, but the agent-side
Muzzle silently rejected the request before it reached the Docker socket.

Agent Muzzle:
- Add /_ping and /v*/_ping so Docker health checks pass end-to-end
- Add /v*/volumes, /v*/volumes/*, /v*/networks, /v*/networks/* for full
  read-only Dockhand support (listing and inspection)

Server Muzzle:
- Add /events so Dockhand's ~3s event polling is no longer blocked at the
  first layer before the request even reaches the agent
- Add /volumes and /networks to the exact-match set
- Add path.Match pattern checking for /containers/*/json, /volumes/*, and
  /networks/* to handle dynamic Docker resource IDs

Both Muzzles must remain in sync: the server strips the version prefix and
uses exact + pattern matching; the agent uses path.Match with /v* prefixes.
…containers, images, volumes, networks, and system disk usage
The Performance Regression Check was failing with "could not read Username
for 'https://github.com': terminal prompts disabled" on every PR build.

For pull_request events github.sha resolves to the ephemeral merge commit
GitHub creates at refs/pull/N/merge. This ref is not advertised in the
normal ref listing, so git fetch --depth=1 origin <SHA> cannot resolve it.
Git then attempts interactive credential prompting, which is disabled in CI,
producing a misleading authentication error.

Fixed by using github.event.pull_request.head.sha as the primary ref for
PR events, which points to the actual branch HEAD commit that is directly
fetchable. The workflow_run and push/dispatch fallbacks are unchanged.
Dockhand's Go subprocess worker sends Docker API requests without the
version prefix (e.g., /containers/json instead of /v1.47/containers/json)
and its Node.js events client sends bare /events. The agent-side muzzle
only matched versioned /v*/... patterns, so every container list poll and
event stream subscription was blocked with 403.

Adds unversioned equivalents for all 15 Docker API path patterns in the
agent muzzle, mirroring the allowlist already present on the server-side
muzzle. Both forms are now accepted; write methods and unsupported paths
remain blocked (fail-closed).

Also adds 62-case muzzle unit test suite covering GET on every versioned
and unversioned path, HEAD restricted to _ping only, all write methods
blocked, and path.Clean traversal normalisation. Two Playwright specs
(33 tests total) lock in regression coverage for the Orthrus agent
management UI and Docker proxy connection states, running against mocked
API responses so they are independent of a live tunnel.
…skip-TCP branch

- Add TestHandleWebSocket_DisplacesExistingSession to cover the type-assertion
  block in HandleWebSocket that closes an old AgentSession when a new connection
  displaces it (server.go lines 98-100)
- Add TestCheckHost_NonOrthrusMonitorNoPort_SkipsTCPDial to cover the early-
  return guard in checkHost when all non-orthrus monitors have no extractable
  port (uptime_service.go lines 611-612)
- Add TestSyncMonitors_OrthrusRemoteServer_NilAgentUUID_SkipsMonitorCreation
  and TestCheckMonitor_OrthrusType_EmptyURL_ReturnsDown for orthrus edge cases

Achieves 100% patch coverage on feature/hecate (25/25 changed lines)
…mination

fix(orthrus): disable keep-alives in external proxy transport for better resource management
test(orthrus): add tests for external proxy transport to verify keep-alive behavior
@Wikid82 Wikid82 moved this from In Progress to In Review in Charon May 24, 2026
@Wikid82 Wikid82 changed the title fix(uptime): add OrthrusStatusChecker interface and DI wiring fix(orthrus): stop Dockhand flapping and wire uptime to session liveness May 24, 2026
- Ensure pull request image reference is produced through a strict producer contract with deterministic matching and fail-fast behavior
- Prevent downstream security scan execution when producer output is missing, so failures surface at the true origin
- Remove order-dependent tag selection in artifact paths to avoid metadata ordering drift and intermittent CI regressions
- Enforce PR vulnerability gate behavior while still preserving SARIF upload and traceability summaries for triage
- Add focused QA reporting and manual validation scenarios to verify pull request, push, and workflow-run behavior remains correct
Hardened the PR security scan flow to consume only the build output digest from the same workflow run.
This removes mutable tag fallback behavior and prevents stale image scans when outputs are missing or delayed.

Behavior changes:
- PR scan now requires a non-empty, valid sha256 digest before scanning.
- PR image reference is constructed strictly as an immutable digest reference.
- Freshness validation now fails if image revision metadata is missing or does not match the expected PR head SHA.
- PR scan summary now always records digest and revision label SHA for traceable provenance.

Why this was necessary:
- Previous fallback behavior could scan an older mutable PR tag, causing commit-to-image mismatch and misleading vulnerability results.
- Failing early on contract violations is safer than silently scanning a potentially stale artifact.

Side effects:
- PR jobs now hard-fail when digest output propagation breaks, which is intentional to preserve scan integrity and reproducibility.
- Non-PR scanning behavior remains unchanged.
Added a shared x/crypto version pin and applied it directly in the Caddy builder patch flow so Caddy artifacts are built with the intended patched crypto module version.

Behavior changes:
- Caddy builder now consumes a dedicated shared x/crypto pin argument.
- Caddy Stage 2 dependency patching explicitly applies golang.org/x/crypto at the pinned version.
- Resulting Caddy binary metadata now resolves x/crypto to the pinned target.

Why this was necessary:
- Security findings were showing crypto CVE drift despite existing patching in another component path.
- Caddy and CrowdSec have independent dependency graphs, so each path must pin and verify security-sensitive modules explicitly.

Important side effects:
- Build behavior remains deterministic and scoped; CrowdSec logic is unchanged.
- This improves reproducibility for scanner validation by ensuring Caddy binary provenance includes the expected crypto module version.
…cement

Improved PR security scan observability by introducing a diagnostics step that parses SARIF output and reports unsuppressed HIGH/CRITICAL blocker identifiers before the enforcement gate runs.

Behavior changes:
- When PR image scan SARIF exists, the workflow now prints blocker IDs and a concise count in the job summary.
- The summary now provides direct evidence of what is still blocking merge, reducing ambiguity during CI triage.

Why this was necessary:
- Recent runs failed the PR security gate without clear, immediate visibility into the exact unsuppressed findings.
- Faster root-cause identification is required to distinguish real new regressions from expected temporary upstream exceptions.

Important considerations:
- Enforcement policy is unchanged; failures still block exactly as before.
- Severity thresholds and ignore policy were not relaxed.
- This is diagnostics-only hardening to make CI outcomes actionable.
Improve CI security-scan behavior by enforcing deterministic traceability output,
making diagnostics resilient, and removing legacy contract ambiguity.

- Traceability now executes the expected runtime binary directly and falls back
  safely when the absolute entrypoint is unavailable.
- Diagnostics now run in a best-effort, non-blocking mode and always emit a
  summary section, including explicit fallback reasons when scan artifacts are
  missing or malformed.
- Legacy output-contract semantics were removed in favor of digest/image
  reference language to prevent confusion and reduce false contract expectations.

This change was necessary to stop noisy or misleading PR scan output, prevent
diagnostic parser failures from obscuring true gate results, and keep the
security gate decision path clear and reliable.---

fix: harden PR image scan traceability and diagnostics behavior

Improve CI security-scan behavior by enforcing deterministic traceability output,
making diagnostics resilient, and removing legacy contract ambiguity.

- Traceability now executes the expected runtime binary directly and falls back
  safely when the absolute entrypoint is unavailable.
- Diagnostics now run in a best-effort, non-blocking mode and always emit a
  summary section, including explicit fallback reasons when scan artifacts are
  missing or malformed.
- Legacy output-contract semantics were removed in favor of digest/image
  reference language to prevent confusion and reduce false contract expectations.

This change was necessary to stop noisy or misleading PR scan output, prevent
diagnostic parser failures from obscuring true gate results, and keep the
security gate decision path clear and reliable.
fix: harden PR Trivy SARIF diagnostics classification and parsing

Improve PR image scan diagnostics so fallback states are precise and actionable
without altering enforcement authority.

- Added explicit fallback categories for missing, unreadable, invalid, schema,
  and parser-failure SARIF conditions
- Standardized SARIF validity checks to treat runs-array payloads as parseable
- Ensured valid scans with empty results report parsed count zero instead of
  degrading to fallback
- Strengthened blocker identifier extraction to improve reliability across
  result/rule shapes
- Preserved enforcement behavior so gate outcomes continue to depend only on the
  Trivy scan result and not diagnostics parsing

This change was necessary to remove ambiguous fallback reporting, improve
triage quality for CI failures, and keep security gate decisions deterministic
and trustworthy.
Normalize PR image security scanning so table and SARIF paths evaluate the same
vulnerability surface and produce consistent suppression behavior.

- Enabled vulnerability scanner scope explicitly in both PR table and PR SARIF
  scan steps to prevent divergent results for the same image
- Improved diagnostics fallback observability by surfacing parser exit code and
  first parser hint when SARIF parsing fails
- Preserved enforcement authority so merge blocking remains controlled only by
  the Trivy blocking step outcome

This change was necessary because repeated CI failures showed ignore rules were
loaded, but scan outputs still diverged and diagnostics lacked actionable parser
context, making it difficult to distinguish true unsuppressed blockers from
parsing-path noise.
@Wikid82 Wikid82 marked this pull request as ready for review May 25, 2026 15:25
Copilot AI review requested due to automatic review settings May 25, 2026 15:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR stabilizes Orthrus Docker forwarding (reducing Dockhand online/offline flapping) and updates uptime monitoring so Orthrus-backed targets report availability based on agent session liveness rather than raw TCP reachability. It also adds substantial regression coverage across backend, agent, and Playwright E2E paths.

Changes:

  • Harden Orthrus session/proxy behavior (disable external-proxy keep-alives, displace old sessions on reconnect, prevent stale heartbeat goroutines from evicting new sessions, expand Docker API allowlists).
  • Wire Orthrus session resolution into UptimeService and introduce an "orthrus" monitor type checked via live session state.
  • Add/extend tests (Go unit tests + Playwright suites) for Orthrus agents, proxy paths, and uptime UI/behavior; plus supporting security/build metadata tweaks.

Reviewed changes

Copilot reviewed 23 out of 24 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/uptime-orthrus.spec.ts Playwright coverage for Orthrus uptime monitor UI/status semantics.
tests/orthrus-proxy-paths.spec.ts Playwright coverage for Orthrus remote-server selection and Docker source behavior in ProxyHostForm.
tests/orthrus-agents.spec.ts Playwright coverage for Orthrus Agent Manager UI states and accessibility.
SECURITY.md Updates “Last reviewed” and adds a new known vulnerability entry.
scripts/caddy-compat-matrix.sh Bumps default candidate Caddy version used for compatibility matrix runs.
docs/reports/qa_report.md Rewrites QA report content (currently workflow-scoped content).
Dockerfile Adjusts Go module pinning/patching in the Caddy build flow and adds an assertion for resolved Caddy version.
backend/internal/services/uptime_service.go Adds Orthrus resolver injection and implements "orthrus" monitor behavior and TCP pre-check bypass rules.
backend/internal/services/uptime_service_test.go Adds unit tests for Orthrus monitor sync/check behavior and host pre-check logic.
backend/internal/orthrus/session.go Disables keep-alive reuse for the external reverse-proxy transport.
backend/internal/orthrus/session_test.go Regression test verifying keep-alives are not reused by the external proxy.
backend/internal/orthrus/server.go Fixes session replacement + heartbeat race behavior (displacement before binding; CompareAndDelete on heartbeat).
backend/internal/orthrus/server_test.go Regression tests covering stale heartbeat goroutine behavior and current-session eviction/offline marking.
backend/internal/orthrus/server_coverage_test.go Coverage test ensuring reconnect displaces (closes) the prior session.
backend/internal/orthrus/muzzle.go Expands Docker API allowlists (static + dynamic paths) and allows HEAD /_ping.
backend/internal/orthrus/muzzle_test.go Tests for new allowlist paths, HEAD /_ping, and dynamic-path matching.
backend/internal/api/routes/routes.go Injects Orthrus resolver into UptimeService from route setup when Orthrus is enabled.
agent/muzzle/muzzle.go Expands agent-side Docker allowlist and forces Connection: close to avoid hanging on keep-alives.
agent/muzzle/muzzle_test.go Adds allowlist + proxy lifecycle/streaming tests for agent muzzle behavior.
.trivyignore Adds suppression entry for the new CVE/GHSA noted in SECURITY.md.
.gitignore Ignores new .tmp/ artifacts related to caddy pin cleanup.
.github/workflows/benchmark.yml Ensures checkout uses PR head SHA for pull_request events (not ephemeral merge SHA).

Comment thread backend/internal/services/uptime_service.go
Comment thread docs/reports/qa_report.md
@Wikid82 Wikid82 merged commit f57b630 into development May 25, 2026
40 checks passed
@github-project-automation github-project-automation Bot moved this from In Review to Done in Charon May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants