feat(daemon): terminate TLS on the API listener (HTTP, OTLP, gRPC) (#128)#146
Merged
Conversation
Introduces a new optional `daemon-tls` Cargo feature on `rsigma-cli` that pulls in `rustls` (with the `aws-lc-rs` provider), `tokio-rustls`, `rustls-pemfile`, `rustls-pki-types`, `x509-parser`, and `hyper`/ `hyper-util` for the upcoming server-side TLS termination support for the daemon HTTP REST, OTLP/HTTP, and OTLP/gRPC endpoints. Dev-only `rcgen` is added so the planned `cli_daemon_tls.rs` integration test can mint a self-signed CA on the fly. Implements part of #128.
Adds server-side TLS termination on the daemon's `--api-addr` listener for the Axum HTTP REST API, the Prometheus `/metrics` endpoint, and both OTLP/HTTP and OTLP/gRPC transports. All three protocols already share one socket; this change wraps that socket in `tokio-rustls` (with the `aws-lc-rs` provider) and unifies serving through `axum::serve` plus `tonic::service::Routes::into_axum_router`, so the same router dispatches both `/v1/logs` over HTTP/1.1 + HTTP/2 and gRPC via ALPN. New flags on `engine daemon` (all gated behind the `daemon-tls` feature): - `--tls-cert` / `--tls-key` to terminate TLS in-process. - `--tls-client-ca` to enable mTLS verification of inbound clients. - `--tls-min-version` (`1.2` or `1.3`, default `1.3`). - `--tls-key-password` / `RSIGMA_TLS_KEY_PASSWORD`, reserved for a future release; currently rejected with a clear error pointing at `openssl rsa` to decrypt offline. - `--allow-plaintext` to keep the historical behavior on non-loopback addresses. Loopback (`127.0.0.0/8`, `::1`) always allows plaintext. Public binds without TLS or this flag now refuse to start. Hot-reload of the certificate and key is wired into the existing SIGHUP path via `Arc<ArcSwap<rustls::ServerConfig>>`, so cert rotation takes effect on the next handshake without dropping inflight connections. Observability gains two Prometheus metrics: - `rsigma_tls_certificate_expiry_seconds` (seconds until `not_after`) - `rsigma_tls_active_connections` The daemon logs a single WARN at startup and after every reload if the active cert expires within 30 days. Implements #128.
Adds 10 E2E tests covering the new server-side TLS surface: - Plaintext refusal: `0.0.0.0` bind without TLS or `--allow-plaintext` fails to start with a clear stderr message; loopback keeps working; `--allow-plaintext` flips the opt-in. - HTTPS happy path: `/healthz` and `POST /api/v1/events` succeed over a TLS connection negotiated against a self-signed CA minted with rcgen at test setup. - mTLS: a client that does not present a certificate is rejected at the TLS handshake; a client signed by the same CA succeeds. - Metrics: the `/metrics` endpoint exposes both `rsigma_tls_certificate_expiry_seconds` and `rsigma_tls_active_connections` after TLS is configured. - Misconfiguration: a missing cert file surfaces a clear startup error; `--tls-key-password` is rejected with a hint pointing at `openssl rsa` for offline decryption. The TLS module also gains a small fix: `WebPkiClientVerifier` is now built with an explicit aws-lc-rs `CryptoProvider`, which avoids a panic when both `ring` and `aws-lc-rs` end up in the dependency tree (via reqwest, jsonschema, etc). Companion `spawn_expect_failure` helper in tests/common/mod.rs lets other tests assert that a misconfigured daemon refuses to start with a specific stderr line. Implements the verification matrix from #128.
Adds the user-facing surface for the new `daemon-tls` feature: - `docs/cli/engine/daemon.md`: full TLS flag table and a mutual-TLS example. - `docs/reference/security.md`: replaces the "future work" pointer to #128 with the TLS termination section (in-process termination, mTLS, hot-reload, expiry observability, ACME and encrypted-key scope notes). - `docs/guide/streaming-detection.md`: the HTTP API section calls out the TLS-or-`--allow-plaintext` startup check; the production checklist now reflects the in-process TLS option. - `docs/guide/otlp-integration.md`: replaces the reverse-proxy-only TLS section with the new `--tls-*` flags plus Alloy, Vector, Fluent Bit, and OpenTelemetry Collector mTLS recipes. - `docs/getting-started/quick-start.md`: short note about the loopback bypass and how to enable TLS for a public bind. - `docs/reference/feature-flags.md`: new `daemon-tls` row in the feature matrix and CI matrix. - `docs/reference/metrics.md`: documents `rsigma_tls_certificate_expiry_seconds` and `rsigma_tls_active_connections`, plus two recommended alerts. - `crates/rsigma-cli/README.md`: TLS flag table and mTLS example block. Verified with `mkdocs build --strict`.
Adds the Unreleased entry covering the new `daemon-tls` feature: flag table, plaintext refusal policy, unified serving path, SIGHUP-triggered cert hot-reload, the two new Prometheus metrics, and pointers to the updated reference / guide / quick-start pages.
…types `rustls-pemfile` was flagged as unmaintained by RUSTSEC-2025-0134 on 2025-11-28 (repo archived in August 2025). The advisory itself notes that the crate is "in fact a thin wrapper around the same code used in rustls-pki-types" and points consumers at the `PemObject` trait. Migrates the daemon TLS module to call `CertificateDer::pem_file_iter` and `PrivateKeyDer::from_pem_file` directly, bumps the `rustls-pki-types` minimum to 1.9 (where the `pem` module landed), and removes `rustls-pemfile` from both the `daemon-tls` feature dependency list and the dev-dependencies (the tests already used `rustls-pki-types::pem::PemObject`). `cargo tree -i rustls-pemfile` now reports the crate is no longer in the dependency graph; `cargo audit` no longer reports RUSTSEC-2025-0134.
`TlsState::cli` and `TlsState::reload` are only exercised from the SIGHUP listener in `daemon::reload`, which lives behind `#[cfg(unix)]`. On Windows the non-Unix stub variant of `sighup_listener` accepts the TLS state but never invokes `reload`, so the dead-code lint flags both the field and the method. Scopes the existing fix-by-design (keep the public type platform-agnostic so the rest of the daemon code doesn't need extra cfg gates) with a targeted `#[cfg_attr(not(unix), allow(dead_code))]` on just the two items the lint complains about, leaving the rest of the struct under normal dead-code analysis. Verified with `cargo clippy -p rsigma --all-features --all-targets --target x86_64-pc-windows-gnu -- -D warnings`.
Drops the platform-specific TLS reload path that previously lived in the SIGHUP listener and folds it into the same reload task that already handles rules + enrichers. Every reload trigger (file watcher, SIGHUP, `POST /api/v1/reload`) now funnels through one debounced task, so: - Windows daemons can rotate certificates via the HTTP endpoint, with the exact same semantics as a Unix `kill -HUP`. This was the motivating gap; previously the `cfg(unix)` SIGHUP path was the only way to trigger `TlsState::reload`. - The dead-code workaround on `TlsState::cli` and `TlsState::reload` goes away: both are now exercised cross-platform. - A failing reload bumps `rsigma_reloads_failed_total`, matching the enrichers contract: the previous certificate stays active so a typo in the cert path cannot black-hole the listener. The SIGHUP handler now just routes the signal into `reload_tx` like the file watcher and HTTP handler do, with no TLS-specific knowledge. Two new integration tests in `cli_daemon_tls`: - `http_reload_endpoint_rotates_tls_certificate` mints a 30-day leaf, overwrites the cert/key files in place with a 365-day leaf signed by the same CA, POSTs `/api/v1/reload`, and asserts that `rsigma_tls_certificate_expiry_seconds` jumps from ~30 to ~365 days via `/metrics` polling. - `http_reload_with_invalid_cert_keeps_previous_one` corrupts the cert file on disk and confirms a subsequent HTTPS GET still succeeds because the previous chain stayed live. `time = "0.3"` is added to dev-dependencies so the reload test can set explicit `not_before` / `not_after` on the rcgen-minted certificates. Verified with: - `cargo test -p rsigma --no-default-features --features daemon-tls --test cli_daemon_tls --test-threads=1` (12 passed) - `cargo clippy -p rsigma --all-features --all-targets -- -D warnings` - `cargo clippy -p rsigma --all-features --all-targets --target x86_64-pc-windows-gnu -- -D warnings`
Updates the TLS docs (CLI reference, security reference, CHANGELOG) to reflect that cert rotation no longer requires SIGHUP. Every reload trigger (HTTP POST, SIGHUP on Unix, file watcher) funnels through the same debounced task, so the HTTP endpoint works as the Windows-friendly equivalent of `kill -HUP`. Adds an explicit Windows-compatible rotation recipe to the mutual-TLS example, and points the failed-reload note at \`rsigma_reloads_failed_total\` so operators can wire an alert.
Top-level README: - Adds in-process TLS termination (with cross-platform cert hot-reload) to the feature list so it shows up in the at-a-glance bullet section. - New "TLS Termination" subsection alongside NATS / OTLP with a plain HTTPS and an mTLS recipe, plus a note about the cert hot-reload path and the two new Prometheus gauges. CLI README: - After the TLS flag table, documents the plaintext refusal policy, the shared HTTP/1 + HTTP/2 socket via ALPN, the aws-lc-rs provider, and the cross-platform hot-reload story. - Updates the "Hot-reload triggers" section to call out that `POST /api/v1/reload` is the cross-platform recommendation (especially for Windows where SIGHUP does not exist) and that every trigger now also rotates the TLS certificate alongside rules, pipelines, and enrichers. - Adds the two new TLS metrics (`rsigma_tls_certificate_expiry_seconds` and `rsigma_tls_active_connections`) to the Prometheus metrics table. - Broadens the `/api/v1/reload` endpoint description from "manual rule reload" to the full set of components it now rotates. The runtime crate README is intentionally unchanged: the `daemon-tls` feature lives on `rsigma-cli`, and the runtime crate has no awareness of TLS configuration. Verified with `mkdocs build --strict`.
`public_bind_with_allow_plaintext_starts` failed on Windows because the daemon binds to `0.0.0.0:0` and the test then tries to TCP-connect back to the captured `0.0.0.0:<port>`. Linux and macOS silently route that to loopback; Windows returns `WSAEADDRNOTAVAIL`, so the readiness probe (and any subsequent `http_get` via `daemon.url(...)`) times out. Adds a small `rewrite_wildcard_to_loopback` helper in the shared test harness that converts `0.0.0.0:PORT` -> `127.0.0.1:PORT` and `[::]:PORT` -> `[::1]:PORT` before storing the address. The daemon is listening on every interface anyway, so loopback is always reachable. No production code changes. Verified with: - `cargo test -p rsigma --no-default-features --features daemon-tls --test cli_daemon_tls --test-threads=1` (12 passed) - `cargo clippy -p rsigma --all-features --all-targets --target x86_64-pc-windows-gnu -- -D warnings`
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #128. Adds in-process server-side TLS termination for the
engine daemonAPI listener.Surface
All three protocols that already share
--api-addr(Axum HTTP REST, OTLP/HTTP, OTLP/gRPC) now negotiate TLS on one socket viatokio-rustlswith theaws-lc-rsprovider. ALPN advertises bothh2andhttp/1.1, so REST and gRPC keep sharing the listener.Six new flags, all gated behind the optional
daemon-tlsCargo feature:--tls-cert/--tls-key--tls-client-ca--tls-min-version(1.2/1.3, default1.3)--tls-key-password/RSIGMA_TLS_KEY_PASSWORDopenssl rsahint for offline decryption.--allow-plaintext--api-addr. Loopback always allows plaintext.When
daemon-tlsis built in, the daemon refuses to start on any non-loopback address unless TLS or--allow-plaintextis supplied.Cross-platform cert hot-reload
The reload path funnels through the daemon's central debounced reload task.
POST /api/v1/reload(works on every platform, including Windows),SIGHUP(Unix), and YAML file-watcher events all atomically swap the activerustls::ServerConfigviaArc<ArcSwap<…>>without dropping inflight connections. Failed reloads keep the previous cert active and bumprsigma_reloads_failed_total.Observability
Two new Prometheus gauges, both registered as soon as TLS is configured:
rsigma_tls_certificate_expiry_seconds(signed; negative once expired)rsigma_tls_active_connectionsA single WARN is logged at startup and after every reload when the active cert expires within 30 days.
Out of scope
ACME / Let's Encrypt automation and encrypted private keys. Operators rotate cert files (cert-manager, certbot, Vault PKI, …) and trigger a reload.
Tests (12, all in
cli_daemon_tls.rs)0.0.0.0, loopback bypass,--allow-plaintextopt-in./healthzandPOST /api/v1/events./metricsexposes the two new gauges once TLS is configured.--tls-key-passwordproduce clear startup errors.POST /api/v1/reloadflipsrsigma_tls_certificate_expiry_secondsfrom ~30 to ~365 days.All certs minted at test time with
rcgen.Verification
cargo test -p rsigma --no-default-features --features daemon-tls --test cli_daemon_tls --test-threads=1(12 passed)cargo clippy -p rsigma --all-features --all-targets -- -D warnings(macOS +x86_64-pc-windows-gnucross-target)cargo fmt --all -- --checkmkdocs build --strictcargo auditconfirms the swap fromrustls-pemfile(RUSTSEC-2025-0134, unmaintained) torustls-pki-types1.9PemObjectDocs
docs/cli/engine/daemon.md,docs/reference/security.md: TLS flag table + cross-platform hot-reload section.docs/guide/otlp-integration.md: Alloy, Vector, Fluent Bit, OpenTelemetry Collector mTLS recipes.docs/guide/streaming-detection.md,docs/getting-started/quick-start.md: the plaintext refusal startup check.docs/reference/feature-flags.md: newdaemon-tlsrow.docs/reference/metrics.md: documents the two TLS metrics + two recommended alerts.README.mdandcrates/rsigma-cli/README.md: TLS feature bullet, recipes, and metric rows.CHANGELOG.md: Unreleased entry.