Skip to content

feat(daemon): terminate TLS on the API listener (HTTP, OTLP, gRPC) (#128)#146

Merged
mostafa merged 11 commits into
mainfrom
feat/daemon-tls
May 22, 2026
Merged

feat(daemon): terminate TLS on the API listener (HTTP, OTLP, gRPC) (#128)#146
mostafa merged 11 commits into
mainfrom
feat/daemon-tls

Conversation

@mostafa
Copy link
Copy Markdown
Member

@mostafa mostafa commented May 22, 2026

Closes #128. Adds in-process server-side TLS termination for the engine daemon API listener.

Surface

All three protocols that already share --api-addr (Axum HTTP REST, OTLP/HTTP, OTLP/gRPC) now negotiate TLS on one socket via tokio-rustls with the aws-lc-rs provider. ALPN advertises both h2 and http/1.1, so REST and gRPC keep sharing the listener.

Six new flags, all gated behind the optional daemon-tls Cargo feature:

Flag Purpose
--tls-cert / --tls-key Terminate TLS in-process.
--tls-client-ca Enable mTLS verification of inbound clients.
--tls-min-version (1.2/1.3, default 1.3) Floor the negotiated TLS version.
--tls-key-password / RSIGMA_TLS_KEY_PASSWORD Reserved; currently rejected with an openssl rsa hint for offline decryption.
--allow-plaintext Opt-in for plaintext on a non-loopback --api-addr. Loopback always allows plaintext.

When daemon-tls is built in, the daemon refuses to start on any non-loopback address unless TLS or --allow-plaintext is supplied.

Cross-platform cert hot-reload

The reload path funnels through the daemon's central debounced reload task. POST /api/v1/reload (works on every platform, including Windows), SIGHUP (Unix), and YAML file-watcher events all atomically swap the active rustls::ServerConfig via Arc<ArcSwap<…>> without dropping inflight connections. Failed reloads keep the previous cert active and bump rsigma_reloads_failed_total.

Observability

Two new Prometheus gauges, both registered as soon as TLS is configured:

  • rsigma_tls_certificate_expiry_seconds (signed; negative once expired)
  • rsigma_tls_active_connections

A single WARN is logged at startup and after every reload when the active cert expires within 30 days.

Out of scope

ACME / Let's Encrypt automation and encrypted private keys. Operators rotate cert files (cert-manager, certbot, Vault PKI, …) and trigger a reload.

Tests (12, all in cli_daemon_tls.rs)

  • Plaintext refusal on 0.0.0.0, loopback bypass, --allow-plaintext opt-in.
  • HTTPS happy path: /healthz and POST /api/v1/events.
  • mTLS reject without client cert; accept with CA-signed cert.
  • /metrics exposes the two new gauges once TLS is configured.
  • Missing cert file and --tls-key-password produce clear startup errors.
  • Cross-platform cert rotation via POST /api/v1/reload flips rsigma_tls_certificate_expiry_seconds from ~30 to ~365 days.
  • A reload with a corrupted cert file keeps the previous chain live.

All certs minted at test time with rcgen.

Verification

  • cargo test -p rsigma --no-default-features --features daemon-tls --test cli_daemon_tls --test-threads=1 (12 passed)
  • cargo clippy -p rsigma --all-features --all-targets -- -D warnings (macOS + x86_64-pc-windows-gnu cross-target)
  • cargo fmt --all -- --check
  • mkdocs build --strict
  • cargo audit confirms the swap from rustls-pemfile (RUSTSEC-2025-0134, unmaintained) to rustls-pki-types 1.9 PemObject

Docs

  • docs/cli/engine/daemon.md, docs/reference/security.md: TLS flag table + cross-platform hot-reload section.
  • docs/guide/otlp-integration.md: Alloy, Vector, Fluent Bit, OpenTelemetry Collector mTLS recipes.
  • docs/guide/streaming-detection.md, docs/getting-started/quick-start.md: the plaintext refusal startup check.
  • docs/reference/feature-flags.md: new daemon-tls row.
  • docs/reference/metrics.md: documents the two TLS metrics + two recommended alerts.
  • README.md and crates/rsigma-cli/README.md: TLS feature bullet, recipes, and metric rows.
  • CHANGELOG.md: Unreleased entry.

mostafa added 11 commits May 22, 2026 12:57
Introduces a new optional `daemon-tls` Cargo feature on `rsigma-cli`
that pulls in `rustls` (with the `aws-lc-rs` provider), `tokio-rustls`,
`rustls-pemfile`, `rustls-pki-types`, `x509-parser`, and `hyper`/
`hyper-util` for the upcoming server-side TLS termination support for
the daemon HTTP REST, OTLP/HTTP, and OTLP/gRPC endpoints.

Dev-only `rcgen` is added so the planned `cli_daemon_tls.rs` integration
test can mint a self-signed CA on the fly.

Implements part of #128.
Adds server-side TLS termination on the daemon's `--api-addr` listener
for the Axum HTTP REST API, the Prometheus `/metrics` endpoint, and
both OTLP/HTTP and OTLP/gRPC transports. All three protocols already
share one socket; this change wraps that socket in `tokio-rustls` (with
the `aws-lc-rs` provider) and unifies serving through `axum::serve`
plus `tonic::service::Routes::into_axum_router`, so the same router
dispatches both `/v1/logs` over HTTP/1.1 + HTTP/2 and gRPC via ALPN.

New flags on `engine daemon` (all gated behind the `daemon-tls`
feature):

- `--tls-cert` / `--tls-key` to terminate TLS in-process.
- `--tls-client-ca` to enable mTLS verification of inbound clients.
- `--tls-min-version` (`1.2` or `1.3`, default `1.3`).
- `--tls-key-password` / `RSIGMA_TLS_KEY_PASSWORD`, reserved for a
  future release; currently rejected with a clear error pointing at
  `openssl rsa` to decrypt offline.
- `--allow-plaintext` to keep the historical behavior on non-loopback
  addresses. Loopback (`127.0.0.0/8`, `::1`) always allows plaintext.
  Public binds without TLS or this flag now refuse to start.

Hot-reload of the certificate and key is wired into the existing SIGHUP
path via `Arc<ArcSwap<rustls::ServerConfig>>`, so cert rotation takes
effect on the next handshake without dropping inflight connections.

Observability gains two Prometheus metrics:

- `rsigma_tls_certificate_expiry_seconds` (seconds until `not_after`)
- `rsigma_tls_active_connections`

The daemon logs a single WARN at startup and after every reload if the
active cert expires within 30 days.

Implements #128.
Adds 10 E2E tests covering the new server-side TLS surface:

- Plaintext refusal: `0.0.0.0` bind without TLS or `--allow-plaintext`
  fails to start with a clear stderr message; loopback keeps working;
  `--allow-plaintext` flips the opt-in.
- HTTPS happy path: `/healthz` and `POST /api/v1/events` succeed over
  a TLS connection negotiated against a self-signed CA minted with
  rcgen at test setup.
- mTLS: a client that does not present a certificate is rejected at
  the TLS handshake; a client signed by the same CA succeeds.
- Metrics: the `/metrics` endpoint exposes both
  `rsigma_tls_certificate_expiry_seconds` and
  `rsigma_tls_active_connections` after TLS is configured.
- Misconfiguration: a missing cert file surfaces a clear startup
  error; `--tls-key-password` is rejected with a hint pointing at
  `openssl rsa` for offline decryption.

The TLS module also gains a small fix: `WebPkiClientVerifier` is now
built with an explicit aws-lc-rs `CryptoProvider`, which avoids a
panic when both `ring` and `aws-lc-rs` end up in the dependency tree
(via reqwest, jsonschema, etc).

Companion `spawn_expect_failure` helper in tests/common/mod.rs lets
other tests assert that a misconfigured daemon refuses to start with
a specific stderr line.

Implements the verification matrix from #128.
Adds the user-facing surface for the new `daemon-tls` feature:

- `docs/cli/engine/daemon.md`: full TLS flag table and a mutual-TLS
  example.
- `docs/reference/security.md`: replaces the "future work" pointer to
  #128 with the TLS termination section (in-process termination,
  mTLS, hot-reload, expiry observability, ACME and encrypted-key
  scope notes).
- `docs/guide/streaming-detection.md`: the HTTP API section calls
  out the TLS-or-`--allow-plaintext` startup check; the production
  checklist now reflects the in-process TLS option.
- `docs/guide/otlp-integration.md`: replaces the reverse-proxy-only
  TLS section with the new `--tls-*` flags plus Alloy, Vector,
  Fluent Bit, and OpenTelemetry Collector mTLS recipes.
- `docs/getting-started/quick-start.md`: short note about the
  loopback bypass and how to enable TLS for a public bind.
- `docs/reference/feature-flags.md`: new `daemon-tls` row in the
  feature matrix and CI matrix.
- `docs/reference/metrics.md`: documents
  `rsigma_tls_certificate_expiry_seconds` and
  `rsigma_tls_active_connections`, plus two recommended alerts.
- `crates/rsigma-cli/README.md`: TLS flag table and mTLS example
  block.

Verified with `mkdocs build --strict`.
Adds the Unreleased entry covering the new `daemon-tls` feature: flag
table, plaintext refusal policy, unified serving path, SIGHUP-triggered
cert hot-reload, the two new Prometheus metrics, and pointers to the
updated reference / guide / quick-start pages.
…types

`rustls-pemfile` was flagged as unmaintained by RUSTSEC-2025-0134 on
2025-11-28 (repo archived in August 2025). The advisory itself notes
that the crate is "in fact a thin wrapper around the same code used
in rustls-pki-types" and points consumers at the `PemObject` trait.

Migrates the daemon TLS module to call `CertificateDer::pem_file_iter`
and `PrivateKeyDer::from_pem_file` directly, bumps the
`rustls-pki-types` minimum to 1.9 (where the `pem` module landed),
and removes `rustls-pemfile` from both the `daemon-tls` feature
dependency list and the dev-dependencies (the tests already used
`rustls-pki-types::pem::PemObject`).

`cargo tree -i rustls-pemfile` now reports the crate is no longer in
the dependency graph; `cargo audit` no longer reports
RUSTSEC-2025-0134.
`TlsState::cli` and `TlsState::reload` are only exercised from the
SIGHUP listener in `daemon::reload`, which lives behind `#[cfg(unix)]`.
On Windows the non-Unix stub variant of `sighup_listener` accepts the
TLS state but never invokes `reload`, so the dead-code lint flags both
the field and the method.

Scopes the existing fix-by-design (keep the public type
platform-agnostic so the rest of the daemon code doesn't need extra
cfg gates) with a targeted `#[cfg_attr(not(unix), allow(dead_code))]`
on just the two items the lint complains about, leaving the rest of
the struct under normal dead-code analysis.

Verified with `cargo clippy -p rsigma --all-features --all-targets
--target x86_64-pc-windows-gnu -- -D warnings`.
Drops the platform-specific TLS reload path that previously lived in the
SIGHUP listener and folds it into the same reload task that already
handles rules + enrichers. Every reload trigger (file watcher, SIGHUP,
`POST /api/v1/reload`) now funnels through one debounced task, so:

- Windows daemons can rotate certificates via the HTTP endpoint, with
  the exact same semantics as a Unix `kill -HUP`. This was the
  motivating gap; previously the `cfg(unix)` SIGHUP path was the only
  way to trigger `TlsState::reload`.
- The dead-code workaround on `TlsState::cli` and `TlsState::reload`
  goes away: both are now exercised cross-platform.
- A failing reload bumps `rsigma_reloads_failed_total`, matching the
  enrichers contract: the previous certificate stays active so a typo
  in the cert path cannot black-hole the listener.

The SIGHUP handler now just routes the signal into `reload_tx` like the
file watcher and HTTP handler do, with no TLS-specific knowledge.

Two new integration tests in `cli_daemon_tls`:

- `http_reload_endpoint_rotates_tls_certificate` mints a 30-day leaf,
  overwrites the cert/key files in place with a 365-day leaf signed by
  the same CA, POSTs `/api/v1/reload`, and asserts that
  `rsigma_tls_certificate_expiry_seconds` jumps from ~30 to ~365 days
  via `/metrics` polling.
- `http_reload_with_invalid_cert_keeps_previous_one` corrupts the cert
  file on disk and confirms a subsequent HTTPS GET still succeeds
  because the previous chain stayed live.

`time = "0.3"` is added to dev-dependencies so the reload test can set
explicit `not_before` / `not_after` on the rcgen-minted certificates.

Verified with:
- `cargo test -p rsigma --no-default-features --features daemon-tls
  --test cli_daemon_tls --test-threads=1` (12 passed)
- `cargo clippy -p rsigma --all-features --all-targets -- -D warnings`
- `cargo clippy -p rsigma --all-features --all-targets
  --target x86_64-pc-windows-gnu -- -D warnings`
Updates the TLS docs (CLI reference, security reference, CHANGELOG)
to reflect that cert rotation no longer requires SIGHUP. Every reload
trigger (HTTP POST, SIGHUP on Unix, file watcher) funnels through the
same debounced task, so the HTTP endpoint works as the Windows-friendly
equivalent of `kill -HUP`. Adds an explicit Windows-compatible rotation
recipe to the mutual-TLS example, and points the failed-reload note at
\`rsigma_reloads_failed_total\` so operators can wire an alert.
Top-level README:
- Adds in-process TLS termination (with cross-platform cert hot-reload)
  to the feature list so it shows up in the at-a-glance bullet section.
- New "TLS Termination" subsection alongside NATS / OTLP with a plain
  HTTPS and an mTLS recipe, plus a note about the cert hot-reload path
  and the two new Prometheus gauges.

CLI README:
- After the TLS flag table, documents the plaintext refusal policy,
  the shared HTTP/1 + HTTP/2 socket via ALPN, the aws-lc-rs provider,
  and the cross-platform hot-reload story.
- Updates the "Hot-reload triggers" section to call out that
  `POST /api/v1/reload` is the cross-platform recommendation
  (especially for Windows where SIGHUP does not exist) and that every
  trigger now also rotates the TLS certificate alongside rules,
  pipelines, and enrichers.
- Adds the two new TLS metrics (`rsigma_tls_certificate_expiry_seconds`
  and `rsigma_tls_active_connections`) to the Prometheus metrics
  table.
- Broadens the `/api/v1/reload` endpoint description from "manual
  rule reload" to the full set of components it now rotates.

The runtime crate README is intentionally unchanged: the `daemon-tls`
feature lives on `rsigma-cli`, and the runtime crate has no awareness
of TLS configuration.

Verified with `mkdocs build --strict`.
`public_bind_with_allow_plaintext_starts` failed on Windows because
the daemon binds to `0.0.0.0:0` and the test then tries to TCP-connect
back to the captured `0.0.0.0:<port>`. Linux and macOS silently route
that to loopback; Windows returns `WSAEADDRNOTAVAIL`, so the readiness
probe (and any subsequent `http_get` via `daemon.url(...)`) times out.

Adds a small `rewrite_wildcard_to_loopback` helper in the shared test
harness that converts `0.0.0.0:PORT` -> `127.0.0.1:PORT` and
`[::]:PORT` -> `[::1]:PORT` before storing the address. The daemon is
listening on every interface anyway, so loopback is always reachable.

No production code changes.

Verified with:
- `cargo test -p rsigma --no-default-features --features daemon-tls
  --test cli_daemon_tls --test-threads=1` (12 passed)
- `cargo clippy -p rsigma --all-features --all-targets
  --target x86_64-pc-windows-gnu -- -D warnings`
@mostafa mostafa merged commit 5f6ba7d into main May 22, 2026
14 checks passed
@mostafa mostafa deleted the feat/daemon-tls branch May 22, 2026 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

daemon: TLS for HTTP REST, OTLP/HTTP, and OTLP/gRPC endpoints

1 participant