Skip to content

fix(orthrus): stop Dockhand flapping and wire uptime to session liveness#1039

Merged
Wikid82 merged 42 commits into
mainfrom
development
May 26, 2026
Merged

fix(orthrus): stop Dockhand flapping and wire uptime to session liveness#1039
Wikid82 merged 42 commits into
mainfrom
development

Conversation

@Wikid82
Copy link
Copy Markdown
Owner

@Wikid82 Wikid82 commented May 25, 2026

Summary

This PR replaces the placeholder description with the actual Orthrus stabilization work and uptime wiring.

It addresses repeated Dockhand online/offline flapping caused by tunnel request timeouts and ensures Orthrus-backed uptime checks use agent session liveness instead of raw TCP reachability.

Root cause addressed

Dockhand polls every 30 seconds. External proxy keep-alive reuse combined with a long-lived Docker socket response path could leave subsequent requests blocked until timeout (context deadline exceeded).

What changed

Orthrus proxy and tunnel hardening

  • Disabled keep-alive reuse on the external reverse-proxy transport used for Orthrus Docker forwarding.
  • Expanded and normalized Orthrus Docker API allowlists (versioned and unversioned paths, dynamic container/network/volume paths, ping handling).
  • Added regression coverage around session displacement and heartbeat race behavior.

Agent-side Docker stream handling

  • Updated the agent-side muzzle/proxy behavior and tests to better cover connection lifecycle, request filtering, and streaming edge cases.

Uptime service wiring

  • Added an Orthrus status resolver interface in uptime service.
  • Wired resolver injection from route setup so uptime checks can evaluate Orthrus monitors by agent/session state.
  • Added unit tests for Orthrus monitor sync and check behavior.

E2E coverage

  • Added Playwright suites for Orthrus agents, Orthrus proxy path behavior, and Orthrus uptime monitor UX/data flow.

Why this matters

  • Stops false DOWN transitions caused by tunnel-path timeouts.
  • Reduces noisy/incorrect availability notifications.
  • Aligns uptime semantics for Orthrus-managed targets with how the tunnel actually works.

Validation

  • Backend unit/integration tests updated and passing for affected Orthrus and uptime paths.
  • New and existing Playwright scenarios cover Orthrus agent visibility, proxy-path resolution, and uptime monitor behavior.

Rollout notes

  • Server-side changes improve behavior immediately after deploy.
  • Agent-side improvements require rolling out the rebuilt Orthrus agent image to HomeLab to take full effect.

PR context

This branch also contains supporting hardening and regression coverage added while diagnosing and stabilizing Orthrus reconnect/session behavior.

Wikid82 and others added 30 commits May 22, 2026 20:27
feat: Implement Proxy Groups for Better Organization
…servers

Uptime monitoring incorrectly reported Orthrus-managed remote servers as
DOWN because SyncMonitors() created TCP monitors using the ExternalProxyPort
(e.g. 2375) as the target on the remote machine. That port is bound on the
Charon server's loopback, not on the remote machine, so every TCP dial
resulted in ECONNREFUSED and a false failure.

Once the host was marked DOWN, all co-located TCP monitors (e.g. Dockhand
on port 3001) were short-circuited without being checked, cascading the
false DOWN state across all services at that Tailscale IP.

Introduces a first-class "orthrus" monitor type that replaces TCP dials with
an in-memory WebSocket session state lookup. A monitor is UP if and only if
the Orthrus agent's WebSocket tunnel is connected and its local Docker proxy
listener is operational. No network I/O is required for the check.

Adds a pre-flight fast-path in checkHost() that exits immediately for
Orthrus-only hosts, eliminating wasted retry-loop sleep time proportional
to MaxRetries × 2s per poll cycle.

Existing stale TCP monitors for Orthrus servers are migrated automatically
to type="orthrus" on the next SyncMonitors() run — no database migration
required.

Fixes uptime reporting for Orthrus remote server hosts and all co-located
TCP service monitors (resolves Dockhand port 3001 false-DOWN cascade).
The Muzzle security filter was blocking all Docker client connections
because /_ping was missing from the allowlist. Docker clients universally
send GET /_ping or HEAD /_ping as the first request to verify daemon
reachability — without it, every client reports a failed connection even
though the proxy itself was running correctly.

Added /_ping to the Muzzle path allowlist and added explicit HEAD method
support for /_ping, which is the standard Docker client health check
variant. All write operations remain blocked (POST/DELETE/PUT), and all
paths not on the allowlist continue to return 403 Forbidden. This
preserves Orthrus's read-only security posture while enabling Docker
clients to establish connections.
The Orthrus agent's Muzzle filter did not include /_ping or /v*/_ping in
its allowedPatterns, causing every Docker client health check from Dockhand
to be blocked with a bare HTTP/1.1 403 (no Content-Type, Content-Length: 0).
This was the root cause of Dockhand's failure to connect through the Charon
proxy — the server-side Muzzle was already correct, but the agent-side
Muzzle silently rejected the request before it reached the Docker socket.

Agent Muzzle:
- Add /_ping and /v*/_ping so Docker health checks pass end-to-end
- Add /v*/volumes, /v*/volumes/*, /v*/networks, /v*/networks/* for full
  read-only Dockhand support (listing and inspection)

Server Muzzle:
- Add /events so Dockhand's ~3s event polling is no longer blocked at the
  first layer before the request even reaches the agent
- Add /volumes and /networks to the exact-match set
- Add path.Match pattern checking for /containers/*/json, /volumes/*, and
  /networks/* to handle dynamic Docker resource IDs

Both Muzzles must remain in sync: the server strips the version prefix and
uses exact + pattern matching; the agent uses path.Match with /v* prefixes.
…containers, images, volumes, networks, and system disk usage
The Performance Regression Check was failing with "could not read Username
for 'https://github.com': terminal prompts disabled" on every PR build.

For pull_request events github.sha resolves to the ephemeral merge commit
GitHub creates at refs/pull/N/merge. This ref is not advertised in the
normal ref listing, so git fetch --depth=1 origin <SHA> cannot resolve it.
Git then attempts interactive credential prompting, which is disabled in CI,
producing a misleading authentication error.

Fixed by using github.event.pull_request.head.sha as the primary ref for
PR events, which points to the actual branch HEAD commit that is directly
fetchable. The workflow_run and push/dispatch fallbacks are unchanged.
Dockhand's Go subprocess worker sends Docker API requests without the
version prefix (e.g., /containers/json instead of /v1.47/containers/json)
and its Node.js events client sends bare /events. The agent-side muzzle
only matched versioned /v*/... patterns, so every container list poll and
event stream subscription was blocked with 403.

Adds unversioned equivalents for all 15 Docker API path patterns in the
agent muzzle, mirroring the allowlist already present on the server-side
muzzle. Both forms are now accepted; write methods and unsupported paths
remain blocked (fail-closed).

Also adds 62-case muzzle unit test suite covering GET on every versioned
and unversioned path, HEAD restricted to _ping only, all write methods
blocked, and path.Clean traversal normalisation. Two Playwright specs
(33 tests total) lock in regression coverage for the Orthrus agent
management UI and Docker proxy connection states, running against mocked
API responses so they are independent of a live tunnel.
…skip-TCP branch

- Add TestHandleWebSocket_DisplacesExistingSession to cover the type-assertion
  block in HandleWebSocket that closes an old AgentSession when a new connection
  displaces it (server.go lines 98-100)
- Add TestCheckHost_NonOrthrusMonitorNoPort_SkipsTCPDial to cover the early-
  return guard in checkHost when all non-orthrus monitors have no extractable
  port (uptime_service.go lines 611-612)
- Add TestSyncMonitors_OrthrusRemoteServer_NilAgentUUID_SkipsMonitorCreation
  and TestCheckMonitor_OrthrusType_EmptyURL_ReturnsDown for orthrus edge cases

Achieves 100% patch coverage on feature/hecate (25/25 changed lines)
…mination

fix(orthrus): disable keep-alives in external proxy transport for better resource management
test(orthrus): add tests for external proxy transport to verify keep-alive behavior
- Ensure pull request image reference is produced through a strict producer contract with deterministic matching and fail-fast behavior
- Prevent downstream security scan execution when producer output is missing, so failures surface at the true origin
- Remove order-dependent tag selection in artifact paths to avoid metadata ordering drift and intermittent CI regressions
- Enforce PR vulnerability gate behavior while still preserving SARIF upload and traceability summaries for triage
- Add focused QA reporting and manual validation scenarios to verify pull request, push, and workflow-run behavior remains correct
Hardened the PR security scan flow to consume only the build output digest from the same workflow run.
This removes mutable tag fallback behavior and prevents stale image scans when outputs are missing or delayed.

Behavior changes:
- PR scan now requires a non-empty, valid sha256 digest before scanning.
- PR image reference is constructed strictly as an immutable digest reference.
- Freshness validation now fails if image revision metadata is missing or does not match the expected PR head SHA.
- PR scan summary now always records digest and revision label SHA for traceable provenance.

Why this was necessary:
- Previous fallback behavior could scan an older mutable PR tag, causing commit-to-image mismatch and misleading vulnerability results.
- Failing early on contract violations is safer than silently scanning a potentially stale artifact.

Side effects:
- PR jobs now hard-fail when digest output propagation breaks, which is intentional to preserve scan integrity and reproducibility.
- Non-PR scanning behavior remains unchanged.
actions-user and others added 9 commits May 24, 2026 22:55
Added a shared x/crypto version pin and applied it directly in the Caddy builder patch flow so Caddy artifacts are built with the intended patched crypto module version.

Behavior changes:
- Caddy builder now consumes a dedicated shared x/crypto pin argument.
- Caddy Stage 2 dependency patching explicitly applies golang.org/x/crypto at the pinned version.
- Resulting Caddy binary metadata now resolves x/crypto to the pinned target.

Why this was necessary:
- Security findings were showing crypto CVE drift despite existing patching in another component path.
- Caddy and CrowdSec have independent dependency graphs, so each path must pin and verify security-sensitive modules explicitly.

Important side effects:
- Build behavior remains deterministic and scoped; CrowdSec logic is unchanged.
- This improves reproducibility for scanner validation by ensuring Caddy binary provenance includes the expected crypto module version.
…cement

Improved PR security scan observability by introducing a diagnostics step that parses SARIF output and reports unsuppressed HIGH/CRITICAL blocker identifiers before the enforcement gate runs.

Behavior changes:
- When PR image scan SARIF exists, the workflow now prints blocker IDs and a concise count in the job summary.
- The summary now provides direct evidence of what is still blocking merge, reducing ambiguity during CI triage.

Why this was necessary:
- Recent runs failed the PR security gate without clear, immediate visibility into the exact unsuppressed findings.
- Faster root-cause identification is required to distinguish real new regressions from expected temporary upstream exceptions.

Important considerations:
- Enforcement policy is unchanged; failures still block exactly as before.
- Severity thresholds and ignore policy were not relaxed.
- This is diagnostics-only hardening to make CI outcomes actionable.
Improve CI security-scan behavior by enforcing deterministic traceability output,
making diagnostics resilient, and removing legacy contract ambiguity.

- Traceability now executes the expected runtime binary directly and falls back
  safely when the absolute entrypoint is unavailable.
- Diagnostics now run in a best-effort, non-blocking mode and always emit a
  summary section, including explicit fallback reasons when scan artifacts are
  missing or malformed.
- Legacy output-contract semantics were removed in favor of digest/image
  reference language to prevent confusion and reduce false contract expectations.

This change was necessary to stop noisy or misleading PR scan output, prevent
diagnostic parser failures from obscuring true gate results, and keep the
security gate decision path clear and reliable.---

fix: harden PR image scan traceability and diagnostics behavior

Improve CI security-scan behavior by enforcing deterministic traceability output,
making diagnostics resilient, and removing legacy contract ambiguity.

- Traceability now executes the expected runtime binary directly and falls back
  safely when the absolute entrypoint is unavailable.
- Diagnostics now run in a best-effort, non-blocking mode and always emit a
  summary section, including explicit fallback reasons when scan artifacts are
  missing or malformed.
- Legacy output-contract semantics were removed in favor of digest/image
  reference language to prevent confusion and reduce false contract expectations.

This change was necessary to stop noisy or misleading PR scan output, prevent
diagnostic parser failures from obscuring true gate results, and keep the
security gate decision path clear and reliable.
fix: harden PR Trivy SARIF diagnostics classification and parsing

Improve PR image scan diagnostics so fallback states are precise and actionable
without altering enforcement authority.

- Added explicit fallback categories for missing, unreadable, invalid, schema,
  and parser-failure SARIF conditions
- Standardized SARIF validity checks to treat runs-array payloads as parseable
- Ensured valid scans with empty results report parsed count zero instead of
  degrading to fallback
- Strengthened blocker identifier extraction to improve reliability across
  result/rule shapes
- Preserved enforcement behavior so gate outcomes continue to depend only on the
  Trivy scan result and not diagnostics parsing

This change was necessary to remove ambiguous fallback reporting, improve
triage quality for CI failures, and keep security gate decisions deterministic
and trustworthy.
Normalize PR image security scanning so table and SARIF paths evaluate the same
vulnerability surface and produce consistent suppression behavior.

- Enabled vulnerability scanner scope explicitly in both PR table and PR SARIF
  scan steps to prevent divergent results for the same image
- Improved diagnostics fallback observability by surfacing parser exit code and
  first parser hint when SARIF parsing fails
- Preserved enforcement authority so merge blocking remains controlled only by
  the Trivy blocking step outcome

This change was necessary because repeated CI failures showed ignore rules were
loaded, but scan outputs still diverged and diagnostics lacked actionable parser
context, making it difficult to distinguish true unsuppressed blockers from
parsing-path noise.
fix(orthrus): stop Dockhand flapping and wire uptime to session liveness
Copilot AI review requested due to automatic review settings May 25, 2026 17:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request stabilizes Orthrus-backed Docker forwarding to prevent Dockhand online/offline flapping, and updates uptime semantics so Orthrus monitors reflect agent session liveness (not raw TCP reachability).

Changes:

  • Harden Orthrus proxying (transport configuration + expanded Docker API allowlists) and improve agent-side Docker stream lifecycle handling to reduce timeout-driven disconnects.
  • Wire uptime checks for orthrus monitor type to agent/session liveness via an injected Orthrus status resolver, with unit coverage for sync/check behavior.
  • Add Playwright coverage for Orthrus agents UI, Orthrus proxy-path behavior, and Orthrus uptime monitor UX/data flow.

Reviewed changes

Copilot reviewed 23 out of 24 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
backend/internal/orthrus/session.go Configures the Orthrus external reverse-proxy transport to avoid keep-alive reuse that contributed to flapping.
backend/internal/orthrus/muzzle.go Expands/normalizes Orthrus Docker API allowlists for forwarded Docker requests.
backend/internal/api/handlers/docker_handler.go Adds Orthrus resolver wiring so Docker endpoints can route via Orthrus agent proxy state.
backend/internal/api/handlers/docker_handler_test.go Adds regression/unit coverage for resolver wiring and Orthrus-specific Docker handler behavior (including typed-nil guard).
backend/internal/api/routes/routes.go Wires Orthrus resolver injection from route setup into dependent services/handlers.
backend/internal/services/uptime_service.go Introduces Orthrus status resolver integration so uptime uses session liveness for Orthrus monitors.
backend/internal/services/uptime_service_test.go Adds unit tests for Orthrus monitor sync + check semantics and resolver edge cases.
agent/muzzle/muzzle.go Adjusts agent-side proxy/muzzle behavior to better handle connection lifecycle and streaming.
agent/muzzle/muzzle_test.go Adds/updates tests for allowlist filtering and edge cases like traversal normalization.
tests/uptime-orthrus.spec.ts Adds E2E coverage for Orthrus uptime monitor display/state behavior.
tests/orthrus-proxy-paths.spec.ts Adds E2E coverage for Orthrus Docker source selection and proxy-path-driven container listing UX.
tests/orthrus-agents.spec.ts Adds E2E coverage for Orthrus agents UI rendering, accessible labels, and key interactions.
.trivyignore Adds/updates vulnerability suppressions relevant to the updated runtime surface and CI scanning.
SECURITY.md Updates the security ledger with current known/patched vulnerability tracking and review date.
Dockerfile Adjusts build-time pinning/verification (including dependency version constraints) supporting stable releases and supply-chain hygiene.

Comment thread backend/internal/orthrus/muzzle.go
Comment thread backend/internal/orthrus/session.go
Comment thread .trivyignore
@github-advanced-security
Copy link
Copy Markdown
Contributor

You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool.

What Enabling Code Scanning Means:

  • The 'Security' tab will display more code scanning analysis results (e.g., for the default branch).
  • Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results.
  • You will be able to see the analysis results for the pull request's branch on this overview once the scans have completed and the checks have passed.

For more information about GitHub Code Scanning, check out the documentation.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 25, 2026

✅ Supply Chain Verification Results

PASSED

📦 SBOM Summary

  • Components: 1487

🔍 Vulnerability Scan

Severity Count
🔴 Critical 0
🟠 High 0
🟡 Medium 4
🟢 Low 2
Total 6

📎 Artifacts

  • SBOM (CycloneDX JSON) and Grype results available in workflow artifacts

Generated by Supply Chain Verification workflow • View Details

@codecov
Copy link
Copy Markdown

codecov Bot commented May 25, 2026

Codecov Report

❌ Patch coverage is 96.61017% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
backend/internal/services/uptime_service.go 95.12% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@Wikid82 Wikid82 merged commit 0a8c856 into main May 26, 2026
43 of 45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants