Skip to content

Corroborating signals v2 (PR B): playbook-aware finalization + per-source attribution + cache UI#117

Merged
lance0 merged 6 commits intomainfrom
feature/corroborating-signals-v2
Apr 29, 2026
Merged

Corroborating signals v2 (PR B): playbook-aware finalization + per-source attribution + cache UI#117
lance0 merged 6 commits intomainfrom
feature/corroborating-signals-v2

Conversation

@lance0
Copy link
Copy Markdown
Owner

@lance0 lance0 commented Apr 29, 2026

Ships all five PR B follow-ups from the ADR 021 review as a coordinated v0.17.0-bound set. Branch is 4 commits, each independently green on cargo fmt + cargo clippy --all-targets -- -D warnings + 222 unit + 126 integration + 16 postgres + 87 frontend tests.

Commits

  1. d3d9431 feat(correlation): playbook-aware corroborator finalization — migration 012 adds nullable signal_groups.playbook_name. The corroborator-side aggregate recompute re-resolves the stored playbook from live state and is now allowed to flip corroboration_met=true using the override min_sources/threshold. Conservative fallback when the stored playbook is missing from live config (NULL or admin-removed). Mitigation actuation deliberately stays single-sourced through handle_ban.

  2. dc5cf65 refactor(metrics): per-source corroborator expiry + cache_size gaugeprefixd_corroborator_expired_total regains its {source} label set via a single DELETE … RETURNING source GROUP BY source. New prefixd_corroborator_cache_size{source} gauge updated by the reconcile loop after each sweep, with explicit zeroing of stale labels when a source's cache drains between ticks.

  3. b1ac873 feat(correlation): cached corroborator listing endpoint + dashboard panel — admin-only GET /v1/signals/corroborator/cache returns { now, total, by_source[], signals[] } filtered to unattached + unexpired rows; optional ?source= and ?limit= (clamped to 1..1000). New 'Cache' tab on the Correlation page with per-source counts and a dense signal table.

  4. 294f1dd refactor(api)!: drop CorroboratorResponse.cached + PR B docs — removes the always-true cached field; status ∈ {attached, cached} is the discriminator. Full doc pass: CHANGELOG, ADR 021 'PR B addenda' section, ROADMAP flipped to checked, api.md callouts, detector quickstart, AGENTS.md.

Operator notes

  • Label change on existing counter: PromQL queries written against the v0.16.0 unlabelled prefixd_corroborator_expired_total must add sum() to recover the previous shape.
  • Breaking API change: CorroboratorResponse.cached is gone. Branch on status === 'cached' instead.
  • Migration ordering: 012_signal_groups_playbook.sql runs after 011. No SQL backfill — the daemon backfills playbook_name on the next primary event for each pre-upgrade group via COALESCE.

Closes

The five PR B items previously documented in CHANGELOG / ROADMAP / ADR 021 from PR #109 review.

lance0 added 5 commits April 29, 2026 14:12
PR B item 1. Lets a late corroborator finalize a signal group on its own
path by re-resolving the playbook-specific override at recompute time.
Eliminates the v0.16.0 dependency on 'another primary event must fire
within the window' for late corroborators to take effect.

Schema:
- migrations/012_signal_groups_playbook.sql adds nullable
  signal_groups.playbook_name. No SQL backfill (match_json doesn't carry
  a playbook reference); the daemon backfills on the next primary event
  for each group via a COALESCE update.

Code:
- SignalGroup gains pub playbook_name: Option<String>; serde-default
  None for forward/backward compatibility.
- handle_ban writes playbook_name = matching_playbook.map(|p| p.name)
  on group create AND backfills it on existing groups when the column
  is NULL. Update path uses COALESCE to never wipe a previously-set
  value.
- recompute_group_aggregates re-resolves the playbook by name from the
  live state.playbooks snapshot, applies the override via
  check_corroboration_with_primary, and is now allowed to flip
  corroboration_met=true. Logs a structured 'awaiting next primary'
  info event when the flip happens.
- Conservative fallback preserved: NULL playbook_name OR resolved
  playbook missing from the live config => keep previous
  corroboration_met (no flip from corroborator path). Matches v0.16.0
  behavior on stale state.
- Mitigation actuation deliberately stays single-sourced through
  handle_ban; corroborator-side recompute updates state but does not
  fire the mitigation. The next primary event picks up the flag and
  triggers normally.

Repository changes:
- All 7 SignalGroup-returning queries (insert, get, find, list, find_open,
  find_expired, find_open_by_dimensions) updated to project
  playbook_name.
- update_signal_group uses COALESCE($7, playbook_name).
- SignalGroupRow + From impl carry the column.

Tests:
- test_late_corroborator_finalizes_with_playbook_override: primary
  event below override threshold + corroborator → corroboration_met
  flips true.
- test_late_corroborator_skips_when_playbook_name_is_stale: stored
  playbook removed from config → conservative no-flip behavior.

cargo fmt + cargo clippy --all-targets -- -D warnings clean.
222 unit + 123 integration + 16 postgres tests green.
PR B items 2 + 5. Restores attribution on the expired counter and adds
a new gauge for runaway-cache alerting.

Repository:
- CorroboratorSweepStats.unattached_expired is now
  HashMap<String, u64> (per-source) instead of a flat u64. Helper
  CorroboratorSweepStats::unattached_total() preserves the old summing
  use case for tests/log lines.
- Postgres delete_expired_corroborating_signals uses a single
  DELETE-RETURNING with GROUP BY source, so attribution is collected
  in the same query that performs the delete (no read-then-delete
  race window).
- Mock implementation aggregates by source while retain()-ing.
- New trait method count_cached_corroborators_by_source(now) returns
  Vec<(source, count)> for the gauge refresh; Postgres GROUPs in SQL,
  Mock buckets in-memory.

Metrics:
- CORROBORATOR_EXPIRED_TOTAL regains its &['source'] label set.
  Operator note: this is a label change in CHANGELOG; existing
  PromQL queries against the v0.16.0 unlabelled counter must add a
  sum() if they want the previous shape.
- New gauge prefixd_corroborator_cache_size{source} registered and
  Lazy::force-d in init_metrics.

Scheduler:
- ReconciliationLoop carries last_cache_sources (tokio Mutex<HashSet>)
  to zero-out gauges for sources that drained between ticks; otherwise
  Prometheus would keep stale non-zero values forever.
- After each sweep, sets per-source CORROBORATOR_CACHE_SIZE from
  count_cached_corroborators_by_source(now) and explicitly clears
  sources that fell out of the live set.

Tests:
- test_expired_sweep_attributes_per_source: 2 expired-unattached from
  router-cpu, 1 from pop-utilization → unattached_expired map keyed
  correctly.
- test_count_cached_corroborators_by_source: covers the four
  exclusion cases (attached, expired, unattached-unexpired, two
  sources) and asserts the live counts.
- Existing test_expired_sweep_splits_attached_vs_unattached updated
  to use the HashMap shape via .unattached_total() / .get().

cargo fmt + cargo clippy --all-targets clean. 222 unit + 125
integration + 16 postgres tests green.
…anel

PR B items 3 + 4. Surfaces the corroborator cache so L1 ops can debug
mode=corroborating sources that ingest but never attach.

Backend:
- list_cached_corroborators trait method gains an optional
  source filter (Postgres pushes it down with a NULL-aware
  predicate; Mock filters in-memory). Used to build a per-source
  drilldown without overfetching.
- New admin-only handler GET /v1/signals/corroborator/cache
  returns { now, total, by_source[], signals[] }. Listing limits
  to unattached + unexpired and clamps the page size to [1, 1000].
  Auth flows through require_role(Admin); both unauthenticated
  and non-admin paths surface as 401 to match adjacent admin
  endpoints.
- OpenAPI schemas added: CachedCorroboratorsResponse,
  CachedCorroboratorBySource. Route registered in api_routes().

Frontend:
- New CachedCorroboratorsResponse / CachedCorroboratorSignal
  types in lib/api.ts; getCachedCorroborators() helper.
- useCachedCorroborators hook with 30s SWR refresh.
- New CacheTab component on the Correlation page (added between
  Groups and Config). Renders a 'cache size by source' badge
  row plus a dense table of cached signals with relative
  ingested/expires timestamps and dimension chips.

Tests:
- test_cache_listing_endpoint_returns_unattached_signals: posts
  one unattached + one attached row, asserts the attached row
  is filtered out and by_source counts match.
- All 87 frontend Vitest cases still green; 'bun run build'
  compiles the new tab cleanly.

cargo fmt + cargo clippy --all-targets clean. 222 unit + 126
integration + 16 postgres tests green.
PR B item 4 + documentation pass for the full PR B set.

API breaking change (v0.17.0):
- CorroboratorResponse no longer carries a 'cached: bool' field; the
  always-true v0.16.0 boolean added no information given that
  status ∈ {attached, cached} is the canonical discriminator.
- prefixdctl's CorroboratorResponse mirror struct also drops 'cached'.
- frontend/lib/api.ts CorroboratorResponse type drops 'cached'.
- Doc comment on CorroboratorResponse.status now points at the
  removal explicitly.

Docs:
- CHANGELOG: new 'Unreleased' entry covering all five PR B items as
  a coordinated set, with operator-facing notes for each (label
  change on _expired_total, new cache_size gauge, breaking field
  removal).
- ROADMAP: 'Corroborating signals v2 (PR B)' block flipped to [x] with
  per-item descriptions of what landed.
- ADR 021: 'Known limits / deferred to PR B' replaced with a 'PR B
  addenda (v0.17.0)' section that documents the playbook-name
  resolution path, the conservative fallback for stale playbooks, the
  per-source attribution implementation, the gauge-zeroing trick the
  scheduler uses, the new cache endpoint, and the field removal.
- docs/api.md: example response no longer shows 'cached'; new
  callout explaining the breaking change. New '/v1/signals/corroborator/cache'
  section documents query params, response shape, and pairing with the
  cache_size gauge.
- docs/detectors/corroborating-signals.md: example response updated;
  'Known limits' section reframed as residual operator notes since the
  PR B items are no longer pending.
- AGENTS.md: endpoint catalog gains the cache listing line.

Verification: cargo fmt + cargo clippy --all-targets -- -D warnings
clean. 222 unit + 126 integration + 16 postgres + 87 frontend tests
green. 'bun run build' compiles.
CI's Security Audit job was failing on three new RustSec advisories
that landed against our dependency tree since v0.16.0. All are
fixable via lock or trivial dev-dependency bumps; no production
consumer code changes required.

Fixed advisories:
- RUSTSEC-2026-0104 (rustls-webpki 0.103.12 reachable panic in CRL
  parsing): `cargo update -p rustls-webpki` -> 0.103.13. Lock-only.
- RUSTSEC-2026-0066, RUSTSEC-2026-0112, RUSTSEC-2026-0113
  (astral-tokio-tar 0.5.6 PAX header desync, validation gaps,
  unpack_in symlink chmod): bump testcontainers 0.26 -> 0.27 and
  testcontainers-modules 0.14 -> 0.15 (dev-deps only). Pulls in
  astral-tokio-tar 0.6.x via the new release. No test changes.

Also rolled in safe patch/minor lock-only bumps for actively
maintained primary deps:
- axum 0.8.8 -> 0.8.9 (patch)
- axum-macros 0.5.0 -> 0.5.1 (patch)
- tokio 1.51.0 -> 1.52.1 (minor)
- uuid 1.22.0 -> 1.23.1 (minor)
- tokio-tungstenite/tungstenite 0.28 -> 0.29 (transitive)

Skipped on this PR (saved for follow-up where they need real review):
- password-hash 0.5 -> 0.6 (touches crypto code)
- sha2 0.10 -> 0.11 (transitive break risk)
- vite 7 -> 8 / vitejs-plugin-react 5 -> 6 (frontend major)
- typescript 5 -> 6 (frontend major)
- jsdom 28 -> 29 (frontend major)
- lucide-react 0.575 -> 1.x (frontend major; icon API changes)

CI ignore list pruned: removed RUSTSEC-2025-0111 (tokio-tar) and
RUSTSEC-2026-0066 (astral-tokio-tar) since both are now resolved
upstream. Kept RUSTSEC-2023-0071 (rsa via unused sqlx-mysql),
RUSTSEC-2025-0134 (rustls-pemfile unmaintained, no fix yet), and
RUSTSEC-2026-0097 (rand custom-logger unsoundness, no fix; we don't
use a custom logger).

Verification: cargo audit clean modulo the three documented
ignores; cargo fmt + cargo clippy --all-targets -- -D warnings
clean; 222 unit + 126 integration + 16 postgres tests green.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ships the “Corroborating signals v2 (PR B)” follow-ups from ADR 021: playbook-aware corroborator finalization, per-source attribution/metrics for cache expiry and cache size, an admin cache listing endpoint with a new dashboard tab, and removal of the redundant CorroboratorResponse.cached field.

Changes:

  • Persist signal_groups.playbook_name and re-resolve playbook overrides during corroborator-side recompute to allow late corroborators to flip corroboration_met (without triggering mitigation actuation).
  • Restore per-source attribution on prefixd_corroborator_expired_total{source} and add prefixd_corroborator_cache_size{source} gauge updated each reconcile tick (with stale-label zeroing).
  • Add admin-only GET /v1/signals/corroborator/cache endpoint + frontend “Cache” tab; remove CorroboratorResponse.cached across backend/CLI/frontend/docs.

Reviewed changes

Copilot reviewed 25 out of 26 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/integration.rs Adds/updates integration tests for per-source expiry attribution, cache counts, cache listing endpoint, and playbook-aware late finalization.
src/scheduler/reconcile.rs Updates corroborator cache sweep to increment expired counter per-source and refresh per-source cache-size gauge (including zeroing stale labels).
src/observability/metrics.rs Adds {source} label to expired counter and introduces CORROBORATOR_CACHE_SIZE gauge vec.
src/db/traits.rs Changes sweep stats to per-source map; extends cached-corroborator listing signature; adds per-source cache count method.
src/db/repository.rs Plumbs playbook_name through signal_groups queries; implements per-source expiry delete attribution and per-source cache counts; adds optional source filter to cache listing query.
src/db/mod.rs Registers migration 012 for signal_groups.playbook_name.
src/db/mock.rs Updates mock repo to track per-source sweep stats, support source-filtered cache listings, and provide per-source cache counts.
src/correlation/engine.rs Extends SignalGroup with playbook_name for later playbook override re-resolution.
src/bin/prefixdctl.rs Removes CorroboratorResponse.cached from CLI response model.
src/api/routes.rs Adds route for GET /v1/signals/corroborator/cache.
src/api/openapi.rs Exposes new cache-listing handler and schemas in OpenAPI.
src/api/handlers.rs Persists playbook name on groups; adds cache listing handler; updates corroborator response schema; makes recompute playbook-aware.
migrations/012_signal_groups_playbook.sql Adds nullable signal_groups.playbook_name column and an index.
frontend/lib/api.ts Removes cached from CorroboratorResponse; adds typed client for cached corroborators endpoint.
frontend/hooks/use-api.ts Adds SWR hook for cached corroborators with refresh interval.
frontend/components/dashboard/correlation/cache-tab.tsx New “Cache” tab UI showing per-source counts and cached signal table.
frontend/app/(dashboard)/correlation/page.tsx Adds Cache tab to Correlation page navigation.
docs/detectors/corroborating-signals.md Updates response docs to remove cached boolean; updates known-limits/operator notes for v0.17.0 behavior.
docs/api.md Documents cached field removal and adds documentation for cache listing endpoint.
docs/adr/021-corroborating-signals.md Marks PR B items as shipped and documents the v0.17.0 addenda behavior.
ROADMAP.md Checks off PR B items as shipped in v0.17.0.
Cargo.toml Bumps dev-deps testcontainers / testcontainers-modules.
Cargo.lock Locks updated dependency graph for the dev-dep bumps.
CHANGELOG.md Adds Unreleased entry describing PR B v2 feature set and breaking change.
AGENTS.md Documents new admin cache endpoint and cache-size gauge.
.github/workflows/ci.yml Updates RustSec ignore list (removing previously-ignored advisories now addressed by dep bumps).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/api/handlers.rs Outdated
Comment on lines +5489 to +5496
if let Err(status) = require_role(&state, &auth_session, auth_header, OperatorRole::Admin) {
let msg = if status == StatusCode::UNAUTHORIZED {
"authentication required"
} else {
"admin role required"
};
return Err(AppError(PrefixdError::Unauthorized(msg.into())));
}
Comment thread src/api/handlers.rs
Comment on lines +5498 to +5513
let now = chrono::Utc::now();
let signals = state
.repo
.list_cached_corroborators(now, limit, query.source.as_deref())
.await
.map_err(AppError)?;
let by_source_rows = state
.repo
.count_cached_corroborators_by_source(now)
.await
.map_err(AppError)?;
let total = by_source_rows.iter().map(|(_, n)| *n).sum();
let by_source = by_source_rows
.into_iter()
.map(|(source, count)| CachedCorroboratorBySource { source, count })
.collect();
Comment on lines +14 to +43
-- Backfill: best-effort, copy the playbook name from any mitigation that
-- was triggered by this group. If no mitigation exists yet (e.g. the
-- group is below threshold and only has a single primary event), the
-- column stays NULL and the corroborator recompute falls back to the
-- v0.16.0 conservative behavior. The next primary event for the same
-- group will fill it in.

ALTER TABLE signal_groups
ADD COLUMN IF NOT EXISTS playbook_name TEXT;

WITH agg AS (
SELECT m.signal_group_id AS group_id,
-- All mitigations from one signal group should share a playbook
-- (groups are keyed by vector and playbooks fan out by vector),
-- so MIN() is safe and deterministic for the rare race where a
-- vector matched two playbooks.
MIN(m.match_json) AS sample_match_json
FROM mitigations m
WHERE m.signal_group_id IS NOT NULL
GROUP BY m.signal_group_id
)
UPDATE signal_groups sg
SET playbook_name = agg.group_id::text -- placeholder, replaced below
FROM agg
WHERE 1 = 0;
-- The actual backfill happens at runtime: the daemon will resolve and
-- populate `playbook_name` on the next primary event for any group that
-- still has it NULL. Doing this in SQL is brittle because match_json
-- doesn't carry a playbook reference; we'd be re-running the matcher.

Comment thread tests/integration.rs
Comment on lines +5867 to +5871
let group = repo.get_signal_group(group_id).await.unwrap().unwrap();
assert!(
!group.corroboration_met,
"stale playbook_name should not allow promotion"
);
Four valid review comments from copilot-pull-request-reviewer.

1. handlers.rs: list_cached_corroborators_handler returned 401 even on
   the role-mismatch (FORBIDDEN) path because both StatusCode values
   were funneled through PrefixdError::Unauthorized which always
   serializes as 401. Refactored to return Result<Json, StatusCode>
   directly (matching the existing list_operators / update_password
   admin-handler shape) so 403 actually reaches the wire.

2. handlers.rs: when ?source= was passed, only the signals[] array
   was filtered; total + by_source[] were still computed from the
   unfiltered global counts, so a client requesting source=foo would
   get rows for foo but a 'total' covering everything. Now the
   per-source aggregation rows are filtered to match the same source
   parameter; total is summed from the filtered set. Added regression
   test test_cache_listing_endpoint_source_filter_scopes_aggregates.

3. migrations/012: dropped the leftover WITH agg AS (...) UPDATE …
   WHERE 1 = 0 no-op block. The header originally implied a SQL
   backfill, but the actual approach is runtime COALESCE backfill in
   handle_ban; the dead CTE was just confusing. Header rewritten to
   describe the runtime-backfill approach explicitly and point at the
   call site.

4. integration.rs: test_late_corroborator_skips_when_playbook_name_is_stale
   previously inserted records directly into the mock repo and never
   exercised recompute_group_aggregates, so the assertion would have
   passed even with the stale-playbook guard removed. Rewritten to
   drive the flow through POST /v1/signals/corroborator with two
   distinct corroborating sources, asserting that aggregates do
   advance (source_count → 2) but corroboration_met stays false
   because the stored playbook name doesn't resolve.

Verification: cargo fmt + cargo clippy --all-targets -- -D warnings
clean. 222 unit + 127 integration + 16 postgres tests green.
@lance0 lance0 merged commit ac848e8 into main Apr 29, 2026
5 checks passed
@lance0 lance0 deleted the feature/corroborating-signals-v2 branch April 29, 2026 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants