Conversation
PR B item 1. Lets a late corroborator finalize a signal group on its own path by re-resolving the playbook-specific override at recompute time. Eliminates the v0.16.0 dependency on 'another primary event must fire within the window' for late corroborators to take effect. Schema: - migrations/012_signal_groups_playbook.sql adds nullable signal_groups.playbook_name. No SQL backfill (match_json doesn't carry a playbook reference); the daemon backfills on the next primary event for each group via a COALESCE update. Code: - SignalGroup gains pub playbook_name: Option<String>; serde-default None for forward/backward compatibility. - handle_ban writes playbook_name = matching_playbook.map(|p| p.name) on group create AND backfills it on existing groups when the column is NULL. Update path uses COALESCE to never wipe a previously-set value. - recompute_group_aggregates re-resolves the playbook by name from the live state.playbooks snapshot, applies the override via check_corroboration_with_primary, and is now allowed to flip corroboration_met=true. Logs a structured 'awaiting next primary' info event when the flip happens. - Conservative fallback preserved: NULL playbook_name OR resolved playbook missing from the live config => keep previous corroboration_met (no flip from corroborator path). Matches v0.16.0 behavior on stale state. - Mitigation actuation deliberately stays single-sourced through handle_ban; corroborator-side recompute updates state but does not fire the mitigation. The next primary event picks up the flag and triggers normally. Repository changes: - All 7 SignalGroup-returning queries (insert, get, find, list, find_open, find_expired, find_open_by_dimensions) updated to project playbook_name. - update_signal_group uses COALESCE($7, playbook_name). - SignalGroupRow + From impl carry the column. Tests: - test_late_corroborator_finalizes_with_playbook_override: primary event below override threshold + corroborator → corroboration_met flips true. - test_late_corroborator_skips_when_playbook_name_is_stale: stored playbook removed from config → conservative no-flip behavior. cargo fmt + cargo clippy --all-targets -- -D warnings clean. 222 unit + 123 integration + 16 postgres tests green.
PR B items 2 + 5. Restores attribution on the expired counter and adds
a new gauge for runaway-cache alerting.
Repository:
- CorroboratorSweepStats.unattached_expired is now
HashMap<String, u64> (per-source) instead of a flat u64. Helper
CorroboratorSweepStats::unattached_total() preserves the old summing
use case for tests/log lines.
- Postgres delete_expired_corroborating_signals uses a single
DELETE-RETURNING with GROUP BY source, so attribution is collected
in the same query that performs the delete (no read-then-delete
race window).
- Mock implementation aggregates by source while retain()-ing.
- New trait method count_cached_corroborators_by_source(now) returns
Vec<(source, count)> for the gauge refresh; Postgres GROUPs in SQL,
Mock buckets in-memory.
Metrics:
- CORROBORATOR_EXPIRED_TOTAL regains its &['source'] label set.
Operator note: this is a label change in CHANGELOG; existing
PromQL queries against the v0.16.0 unlabelled counter must add a
sum() if they want the previous shape.
- New gauge prefixd_corroborator_cache_size{source} registered and
Lazy::force-d in init_metrics.
Scheduler:
- ReconciliationLoop carries last_cache_sources (tokio Mutex<HashSet>)
to zero-out gauges for sources that drained between ticks; otherwise
Prometheus would keep stale non-zero values forever.
- After each sweep, sets per-source CORROBORATOR_CACHE_SIZE from
count_cached_corroborators_by_source(now) and explicitly clears
sources that fell out of the live set.
Tests:
- test_expired_sweep_attributes_per_source: 2 expired-unattached from
router-cpu, 1 from pop-utilization → unattached_expired map keyed
correctly.
- test_count_cached_corroborators_by_source: covers the four
exclusion cases (attached, expired, unattached-unexpired, two
sources) and asserts the live counts.
- Existing test_expired_sweep_splits_attached_vs_unattached updated
to use the HashMap shape via .unattached_total() / .get().
cargo fmt + cargo clippy --all-targets clean. 222 unit + 125
integration + 16 postgres tests green.
…anel
PR B items 3 + 4. Surfaces the corroborator cache so L1 ops can debug
mode=corroborating sources that ingest but never attach.
Backend:
- list_cached_corroborators trait method gains an optional
source filter (Postgres pushes it down with a NULL-aware
predicate; Mock filters in-memory). Used to build a per-source
drilldown without overfetching.
- New admin-only handler GET /v1/signals/corroborator/cache
returns { now, total, by_source[], signals[] }. Listing limits
to unattached + unexpired and clamps the page size to [1, 1000].
Auth flows through require_role(Admin); both unauthenticated
and non-admin paths surface as 401 to match adjacent admin
endpoints.
- OpenAPI schemas added: CachedCorroboratorsResponse,
CachedCorroboratorBySource. Route registered in api_routes().
Frontend:
- New CachedCorroboratorsResponse / CachedCorroboratorSignal
types in lib/api.ts; getCachedCorroborators() helper.
- useCachedCorroborators hook with 30s SWR refresh.
- New CacheTab component on the Correlation page (added between
Groups and Config). Renders a 'cache size by source' badge
row plus a dense table of cached signals with relative
ingested/expires timestamps and dimension chips.
Tests:
- test_cache_listing_endpoint_returns_unattached_signals: posts
one unattached + one attached row, asserts the attached row
is filtered out and by_source counts match.
- All 87 frontend Vitest cases still green; 'bun run build'
compiles the new tab cleanly.
cargo fmt + cargo clippy --all-targets clean. 222 unit + 126
integration + 16 postgres tests green.
PR B item 4 + documentation pass for the full PR B set.
API breaking change (v0.17.0):
- CorroboratorResponse no longer carries a 'cached: bool' field; the
always-true v0.16.0 boolean added no information given that
status ∈ {attached, cached} is the canonical discriminator.
- prefixdctl's CorroboratorResponse mirror struct also drops 'cached'.
- frontend/lib/api.ts CorroboratorResponse type drops 'cached'.
- Doc comment on CorroboratorResponse.status now points at the
removal explicitly.
Docs:
- CHANGELOG: new 'Unreleased' entry covering all five PR B items as
a coordinated set, with operator-facing notes for each (label
change on _expired_total, new cache_size gauge, breaking field
removal).
- ROADMAP: 'Corroborating signals v2 (PR B)' block flipped to [x] with
per-item descriptions of what landed.
- ADR 021: 'Known limits / deferred to PR B' replaced with a 'PR B
addenda (v0.17.0)' section that documents the playbook-name
resolution path, the conservative fallback for stale playbooks, the
per-source attribution implementation, the gauge-zeroing trick the
scheduler uses, the new cache endpoint, and the field removal.
- docs/api.md: example response no longer shows 'cached'; new
callout explaining the breaking change. New '/v1/signals/corroborator/cache'
section documents query params, response shape, and pairing with the
cache_size gauge.
- docs/detectors/corroborating-signals.md: example response updated;
'Known limits' section reframed as residual operator notes since the
PR B items are no longer pending.
- AGENTS.md: endpoint catalog gains the cache listing line.
Verification: cargo fmt + cargo clippy --all-targets -- -D warnings
clean. 222 unit + 126 integration + 16 postgres + 87 frontend tests
green. 'bun run build' compiles.
CI's Security Audit job was failing on three new RustSec advisories that landed against our dependency tree since v0.16.0. All are fixable via lock or trivial dev-dependency bumps; no production consumer code changes required. Fixed advisories: - RUSTSEC-2026-0104 (rustls-webpki 0.103.12 reachable panic in CRL parsing): `cargo update -p rustls-webpki` -> 0.103.13. Lock-only. - RUSTSEC-2026-0066, RUSTSEC-2026-0112, RUSTSEC-2026-0113 (astral-tokio-tar 0.5.6 PAX header desync, validation gaps, unpack_in symlink chmod): bump testcontainers 0.26 -> 0.27 and testcontainers-modules 0.14 -> 0.15 (dev-deps only). Pulls in astral-tokio-tar 0.6.x via the new release. No test changes. Also rolled in safe patch/minor lock-only bumps for actively maintained primary deps: - axum 0.8.8 -> 0.8.9 (patch) - axum-macros 0.5.0 -> 0.5.1 (patch) - tokio 1.51.0 -> 1.52.1 (minor) - uuid 1.22.0 -> 1.23.1 (minor) - tokio-tungstenite/tungstenite 0.28 -> 0.29 (transitive) Skipped on this PR (saved for follow-up where they need real review): - password-hash 0.5 -> 0.6 (touches crypto code) - sha2 0.10 -> 0.11 (transitive break risk) - vite 7 -> 8 / vitejs-plugin-react 5 -> 6 (frontend major) - typescript 5 -> 6 (frontend major) - jsdom 28 -> 29 (frontend major) - lucide-react 0.575 -> 1.x (frontend major; icon API changes) CI ignore list pruned: removed RUSTSEC-2025-0111 (tokio-tar) and RUSTSEC-2026-0066 (astral-tokio-tar) since both are now resolved upstream. Kept RUSTSEC-2023-0071 (rsa via unused sqlx-mysql), RUSTSEC-2025-0134 (rustls-pemfile unmaintained, no fix yet), and RUSTSEC-2026-0097 (rand custom-logger unsoundness, no fix; we don't use a custom logger). Verification: cargo audit clean modulo the three documented ignores; cargo fmt + cargo clippy --all-targets -- -D warnings clean; 222 unit + 126 integration + 16 postgres tests green.
There was a problem hiding this comment.
Pull request overview
This PR ships the “Corroborating signals v2 (PR B)” follow-ups from ADR 021: playbook-aware corroborator finalization, per-source attribution/metrics for cache expiry and cache size, an admin cache listing endpoint with a new dashboard tab, and removal of the redundant CorroboratorResponse.cached field.
Changes:
- Persist
signal_groups.playbook_nameand re-resolve playbook overrides during corroborator-side recompute to allow late corroborators to flipcorroboration_met(without triggering mitigation actuation). - Restore per-source attribution on
prefixd_corroborator_expired_total{source}and addprefixd_corroborator_cache_size{source}gauge updated each reconcile tick (with stale-label zeroing). - Add admin-only
GET /v1/signals/corroborator/cacheendpoint + frontend “Cache” tab; removeCorroboratorResponse.cachedacross backend/CLI/frontend/docs.
Reviewed changes
Copilot reviewed 25 out of 26 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/integration.rs | Adds/updates integration tests for per-source expiry attribution, cache counts, cache listing endpoint, and playbook-aware late finalization. |
| src/scheduler/reconcile.rs | Updates corroborator cache sweep to increment expired counter per-source and refresh per-source cache-size gauge (including zeroing stale labels). |
| src/observability/metrics.rs | Adds {source} label to expired counter and introduces CORROBORATOR_CACHE_SIZE gauge vec. |
| src/db/traits.rs | Changes sweep stats to per-source map; extends cached-corroborator listing signature; adds per-source cache count method. |
| src/db/repository.rs | Plumbs playbook_name through signal_groups queries; implements per-source expiry delete attribution and per-source cache counts; adds optional source filter to cache listing query. |
| src/db/mod.rs | Registers migration 012 for signal_groups.playbook_name. |
| src/db/mock.rs | Updates mock repo to track per-source sweep stats, support source-filtered cache listings, and provide per-source cache counts. |
| src/correlation/engine.rs | Extends SignalGroup with playbook_name for later playbook override re-resolution. |
| src/bin/prefixdctl.rs | Removes CorroboratorResponse.cached from CLI response model. |
| src/api/routes.rs | Adds route for GET /v1/signals/corroborator/cache. |
| src/api/openapi.rs | Exposes new cache-listing handler and schemas in OpenAPI. |
| src/api/handlers.rs | Persists playbook name on groups; adds cache listing handler; updates corroborator response schema; makes recompute playbook-aware. |
| migrations/012_signal_groups_playbook.sql | Adds nullable signal_groups.playbook_name column and an index. |
| frontend/lib/api.ts | Removes cached from CorroboratorResponse; adds typed client for cached corroborators endpoint. |
| frontend/hooks/use-api.ts | Adds SWR hook for cached corroborators with refresh interval. |
| frontend/components/dashboard/correlation/cache-tab.tsx | New “Cache” tab UI showing per-source counts and cached signal table. |
| frontend/app/(dashboard)/correlation/page.tsx | Adds Cache tab to Correlation page navigation. |
| docs/detectors/corroborating-signals.md | Updates response docs to remove cached boolean; updates known-limits/operator notes for v0.17.0 behavior. |
| docs/api.md | Documents cached field removal and adds documentation for cache listing endpoint. |
| docs/adr/021-corroborating-signals.md | Marks PR B items as shipped and documents the v0.17.0 addenda behavior. |
| ROADMAP.md | Checks off PR B items as shipped in v0.17.0. |
| Cargo.toml | Bumps dev-deps testcontainers / testcontainers-modules. |
| Cargo.lock | Locks updated dependency graph for the dev-dep bumps. |
| CHANGELOG.md | Adds Unreleased entry describing PR B v2 feature set and breaking change. |
| AGENTS.md | Documents new admin cache endpoint and cache-size gauge. |
| .github/workflows/ci.yml | Updates RustSec ignore list (removing previously-ignored advisories now addressed by dep bumps). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+5489
to
+5496
| if let Err(status) = require_role(&state, &auth_session, auth_header, OperatorRole::Admin) { | ||
| let msg = if status == StatusCode::UNAUTHORIZED { | ||
| "authentication required" | ||
| } else { | ||
| "admin role required" | ||
| }; | ||
| return Err(AppError(PrefixdError::Unauthorized(msg.into()))); | ||
| } |
Comment on lines
+5498
to
+5513
| let now = chrono::Utc::now(); | ||
| let signals = state | ||
| .repo | ||
| .list_cached_corroborators(now, limit, query.source.as_deref()) | ||
| .await | ||
| .map_err(AppError)?; | ||
| let by_source_rows = state | ||
| .repo | ||
| .count_cached_corroborators_by_source(now) | ||
| .await | ||
| .map_err(AppError)?; | ||
| let total = by_source_rows.iter().map(|(_, n)| *n).sum(); | ||
| let by_source = by_source_rows | ||
| .into_iter() | ||
| .map(|(source, count)| CachedCorroboratorBySource { source, count }) | ||
| .collect(); |
Comment on lines
+14
to
+43
| -- Backfill: best-effort, copy the playbook name from any mitigation that | ||
| -- was triggered by this group. If no mitigation exists yet (e.g. the | ||
| -- group is below threshold and only has a single primary event), the | ||
| -- column stays NULL and the corroborator recompute falls back to the | ||
| -- v0.16.0 conservative behavior. The next primary event for the same | ||
| -- group will fill it in. | ||
|
|
||
| ALTER TABLE signal_groups | ||
| ADD COLUMN IF NOT EXISTS playbook_name TEXT; | ||
|
|
||
| WITH agg AS ( | ||
| SELECT m.signal_group_id AS group_id, | ||
| -- All mitigations from one signal group should share a playbook | ||
| -- (groups are keyed by vector and playbooks fan out by vector), | ||
| -- so MIN() is safe and deterministic for the rare race where a | ||
| -- vector matched two playbooks. | ||
| MIN(m.match_json) AS sample_match_json | ||
| FROM mitigations m | ||
| WHERE m.signal_group_id IS NOT NULL | ||
| GROUP BY m.signal_group_id | ||
| ) | ||
| UPDATE signal_groups sg | ||
| SET playbook_name = agg.group_id::text -- placeholder, replaced below | ||
| FROM agg | ||
| WHERE 1 = 0; | ||
| -- The actual backfill happens at runtime: the daemon will resolve and | ||
| -- populate `playbook_name` on the next primary event for any group that | ||
| -- still has it NULL. Doing this in SQL is brittle because match_json | ||
| -- doesn't carry a playbook reference; we'd be re-running the matcher. | ||
|
|
Comment on lines
+5867
to
+5871
| let group = repo.get_signal_group(group_id).await.unwrap().unwrap(); | ||
| assert!( | ||
| !group.corroboration_met, | ||
| "stale playbook_name should not allow promotion" | ||
| ); |
Four valid review comments from copilot-pull-request-reviewer. 1. handlers.rs: list_cached_corroborators_handler returned 401 even on the role-mismatch (FORBIDDEN) path because both StatusCode values were funneled through PrefixdError::Unauthorized which always serializes as 401. Refactored to return Result<Json, StatusCode> directly (matching the existing list_operators / update_password admin-handler shape) so 403 actually reaches the wire. 2. handlers.rs: when ?source= was passed, only the signals[] array was filtered; total + by_source[] were still computed from the unfiltered global counts, so a client requesting source=foo would get rows for foo but a 'total' covering everything. Now the per-source aggregation rows are filtered to match the same source parameter; total is summed from the filtered set. Added regression test test_cache_listing_endpoint_source_filter_scopes_aggregates. 3. migrations/012: dropped the leftover WITH agg AS (...) UPDATE … WHERE 1 = 0 no-op block. The header originally implied a SQL backfill, but the actual approach is runtime COALESCE backfill in handle_ban; the dead CTE was just confusing. Header rewritten to describe the runtime-backfill approach explicitly and point at the call site. 4. integration.rs: test_late_corroborator_skips_when_playbook_name_is_stale previously inserted records directly into the mock repo and never exercised recompute_group_aggregates, so the assertion would have passed even with the stale-playbook guard removed. Rewritten to drive the flow through POST /v1/signals/corroborator with two distinct corroborating sources, asserting that aggregates do advance (source_count → 2) but corroboration_met stays false because the stored playbook name doesn't resolve. Verification: cargo fmt + cargo clippy --all-targets -- -D warnings clean. 222 unit + 127 integration + 16 postgres tests green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ships all five PR B follow-ups from the ADR 021 review as a coordinated v0.17.0-bound set. Branch is 4 commits, each independently green on cargo fmt + cargo clippy --all-targets -- -D warnings + 222 unit + 126 integration + 16 postgres + 87 frontend tests.
Commits
d3d9431feat(correlation): playbook-aware corroborator finalization — migration 012 adds nullablesignal_groups.playbook_name. The corroborator-side aggregate recompute re-resolves the stored playbook from live state and is now allowed to flipcorroboration_met=trueusing the override min_sources/threshold. Conservative fallback when the stored playbook is missing from live config (NULL or admin-removed). Mitigation actuation deliberately stays single-sourced throughhandle_ban.dc5cf65refactor(metrics): per-source corroborator expiry + cache_size gauge —prefixd_corroborator_expired_totalregains its{source}label set via a singleDELETE … RETURNING source GROUP BY source. Newprefixd_corroborator_cache_size{source}gauge updated by the reconcile loop after each sweep, with explicit zeroing of stale labels when a source's cache drains between ticks.b1ac873feat(correlation): cached corroborator listing endpoint + dashboard panel — admin-onlyGET /v1/signals/corroborator/cachereturns{ now, total, by_source[], signals[] }filtered to unattached + unexpired rows; optional?source=and?limit=(clamped to 1..1000). New 'Cache' tab on the Correlation page with per-source counts and a dense signal table.294f1ddrefactor(api)!: drop CorroboratorResponse.cached + PR B docs — removes the always-truecachedfield;status ∈ {attached, cached}is the discriminator. Full doc pass: CHANGELOG, ADR 021 'PR B addenda' section, ROADMAP flipped to checked, api.md callouts, detector quickstart, AGENTS.md.Operator notes
prefixd_corroborator_expired_totalmust addsum()to recover the previous shape.CorroboratorResponse.cachedis gone. Branch onstatus === 'cached'instead.012_signal_groups_playbook.sqlruns after011. No SQL backfill — the daemon backfillsplaybook_nameon the next primary event for each pre-upgrade group viaCOALESCE.Closes
The five PR B items previously documented in CHANGELOG / ROADMAP / ADR 021 from PR #109 review.