Skip to content

Internal REST API foundation: events, webhooks, alerting#72

Open
chrisbliss18 wants to merge 29 commits intov2from
v2-chris-events
Open

Internal REST API foundation: events, webhooks, alerting#72
chrisbliss18 wants to merge 29 commits intov2from
v2-chris-events

Conversation

@chrisbliss18
Copy link
Copy Markdown
Contributor

@chrisbliss18 chrisbliss18 commented Apr 25, 2026

This PR builds on the v2 integration branch (PR #61) to land the internal REST API foundation that the rest of the v2 site-health-platform vision needs to plug into. It does not implement the v2 vision in full — that's the broader effort tracked on v2 itself — but it ships every supporting concern below the check-taxonomy line.

Goal

Give the v2 branch a complete, testable, internal-only API surface so the remaining v2 work (endpoint hierarchy, multi-layer check taxonomy, Reverse Checks) has a durable place to write events into and a delivery layer to fan them out from.

Three architectural layers are introduced, each independently scaled and independently tested:

  • Event sourcingjetmon_events (current state per incident) + jetmon_event_transitions (append-only history); internal/eventstore is the single transactional writer for both.
  • Internal REST API/api/v1/... with per-consumer Bearer auth, three coarse scopes, per-key rate limiting, Stripe-style idempotency keys, and full coverage of sites/events/SLA/webhooks/alert-contacts.
  • Outbound delivery — two parallel workers (internal/webhooks and internal/alerting) that poll jetmon_event_transitions and dispatch to subscribers with a shared retry ladder.

The architectural rationale lives in docs/adr/ (seven ADRs covering the load-bearing decisions). Operational and forward-looking work is in API.md, ROADMAP.md, and CHANGELOG.md.

What's shipped

Event sourcing (migrations 9–11)

  • jetmon_events with a generated dedup_key column and UNIQUE KEY enforcing "one open event per (blog_id, endpoint_id, check_type, discriminator) tuple" without partial indexes.
  • jetmon_event_transitions append-only, one row per state/severity change, with severity_before/after, state_before/after, reason, source, JSON metadata, millisecond timestamps.
  • jetmon_audit_log narrowed to operational events only (per-probe data lives in jetmon_check_history; site state changes flow through events).
  • internal/eventstore is the sole writer; every mutation + transition row are written in a single transaction so the projection and history can never disagree.
  • Orchestrator integration: opens Seems Down on first failure, promotes to Down on verifier confirmation, closes with appropriate reason. The v1 site_status projection stays in lockstep so back-compat consumers are unaffected.

See ADR-0001.

Internal REST API (Phases 1, 2)

API design doc: API.md. Internal-only behind a gateway — see ADR-0002.

  • Auth + key management: jetmon_api_keys (sha256-hashed at rest), ./jetmon2 keys create|list|revoke|rotate CLI, per-key rate limiter with X-RateLimit-* headers, 429 Retry-After on exhaustion.
  • Read surface: /health, /me, /sites, /sites/{id}, /sites/{id}/events, /sites/{id}/events/{event_id}, /sites/{id}/events/{event_id}/transitions, /events/{event_id}, /sites/{id}/uptime, /sites/{id}/response-time, /sites/{id}/timing-breakdown. Cursor pagination, all the documented filters.
  • Write surface: POST/PATCH/DELETE /sites, POST /sites/{id}/{pause,resume,trigger-now}, POST /sites/{id}/events/{event_id}/close. Stripe-style idempotency keys on POST.
  • Audit: every authenticated request logged to jetmon_audit_log under event_type=api_access with consumer_name.

Webhooks (Phase 3)

API surface in API.md "Family 4." Decisions in ADR-0003, ADR-0004, ADR-0005.

  • jetmon_webhooks registry + jetmon_webhook_deliveries per-fire records.
  • Stripe-style HMAC-SHA256 signatures (t=<unix>,v1=<hex> over {ts}.{body}) with v1= reserved for future algorithm rotation.
  • Filter dimensions: events + site_filter + state_filter (AND across, whitelist within, empty=match all).
  • Delivery worker: per-webhook in-flight cap, shared dispatch pool, retry ladder 1m/5m/30m/1h/6h then abandon.
  • Frozen-at-fire-time payload contract.
  • Manual rotate-secret (immediate revocation) and manual retry of abandoned deliveries.

Alert contacts (Phase 3.x)

API surface in API.md "Family 5." Boundary with webhooks in ADR-0006.

  • Managed channels for human destinations: email (with wpcom/smtp/stub senders), PagerDuty (Events API v2 with severity mapping + trigger/resolve), Slack (Block Kit), Microsoft Teams (Adaptive Card).
  • Filter shape: site_filter + min_severity (default Down); per-contact max_per_hour rate cap.
  • POST /alert-contacts/{id}/test send-test exercises the same dispatch path as real alerts (idempotency-aware so retries don't double-page).
  • Plaintext credential storage (same outbound-dispatch rationale as webhook secrets — see ADR-0003).
  • Legacy WPCOM notification flow continues alongside; migration tracked in ROADMAP.md.

Verifier hardening

  • Server-side: Read/Write/Idle timeouts, graceful Shutdown(ctx) with 30s drain, request body cap, empty-token guard, optional StatsD metrics.
  • Client-side: tuned http.Transport, caller-supplied context deadlines (orchestrator wraps each escalation), RequestID correlation header.
  • Config: empty host/grpc_port are now startup errors with precise messages.
  • PID file: JETMON_PID_FILE defaults to a writable path so ./jetmon2 reload and ./jetmon2 drain work.

Cross-cutting polish (post-implementation)

  • Soft-lock fix for both deliver loops (ADR-0007). Prior behavior was three concurrent dispatches per row, collapsing the documented 7h36m retry window to ~1h.
  • dns_ms / tcp_ms / tls_ms overflow fix in internal/checker/. A connection-refused target was firing the partial-phase MySQL Out of range error every check round.
  • MIME header CRLF stripping in the email transport — defense-in-depth against header injection via untrusted monitor_url.
  • Alert-contact Update validation — empty label / negative max_per_hour now surface as 422 before any DB hit.
  • send-test endpoint idempotency — wrapped in withIdempotency like the other write POSTs.
  • AGENTS.md architecture diagram refreshed to show eventstore + API + delivery workers; ADRs added in docs/adr/; CHANGELOG brought current.

Verification

  • go build ./... and go vet ./... clean.
  • go test ./... clean (240+ tests).
  • go test -race ./... clean — including new tests covering the soft-lock contract, rate-limit window, MIME header stripping, and send-test idempotency.
  • Docker stack manually exercised end-to-end: migrations apply on a fresh DB; events open/promote/close correctly; site_status projection matches event state at every step; webhook + alert deliveries land at test receivers with correct signatures and rendering; retry path triggers when receivers are down; manual retry works on abandoned rows.

Out of scope (deferred to other PRs in v2)

  • Endpoint hierarchy (jetmon_endpoints registry). The eventstore schema already carries endpoint_id columns, but the registry itself, CRUD, and orchestrator iteration are out of scope for this PR.
  • Layer 1, 2, 3, 5 check coverage per TAXONOMY.md. Today only Layer 4 (HTTP status, redirect, timing breakdown, basic keyword) is implemented.
  • Reverse Checks (agent-reported signals from inside WordPress).
  • Multi-instance row-claim safety (SELECT … FOR UPDATE SKIP LOCKED). The soft-lock fix is single-instance correct; multi-instance is deferred until the deliverer-binary extraction. See ADR-0007.
  • Phase 4 polish: OpenAPI spec generation, bulk endpoints, per-region filters.
  • Verifier hardening punch-list items 7, 9, 10, 14–20 from API.md (TLS, multi-region quorum, vantage tag, capability advertising). The verifier is good enough for trusted-network deployment but not yet for an untrusted geo-distributed fleet.

Merge readiness (PR #72v2)

This PR targets the v2 integration branch (PR #61), not master. The bar is "is this work coherent and ready to land on v2," not "is the full v2 vision ready for production."

Gate State
mergeStateStatus clean (no conflicts with v2)
Tests pass on the head SHA ✅ (verified locally; CI not yet wired)
Architectural rationale documented ✅ (docs/adr/ 7 records)
Public API documented ✅ (API.md)
CHANGELOG up to date
Architecture diagram reflects shipped shape ✅ (AGENTS.md)
Code review ⏳ Pending reviewer
DO NOT MERGE label decision ⏳ The label was inherited from PR #61's "the full v2 vision isn't ready" gate. This PR is a chunk of that vision, not the whole thing — the label should come off this PR once code review passes; PR #61's label stays on until the broader v2 work is done.

Items in Out of scope above explicitly do not block this PR — they're tracked separately on v2 and will land in their own PRs.

Chris Jean added 5 commits April 25, 2026 00:47
The orchestrator's view of site state is now event-sourced across two new
tables: jetmon_events (one row per incident, mutable while open, frozen on
close) and jetmon_event_transitions (append-only history of every mutation
to an events row). Together they preserve full incident history including
intra-event severity bumps and state changes that the previous mutable-row
design silently overwrote.

Schema (migrations 9-11):
- jetmon_audit_log narrowed to operational-only (drop http_code, error_code,
  rtt_ms, old_status, new_status; add event_id, metadata JSON; relax blog_id
  to NULL; add idx_event_id, idx_event_type_created)
- jetmon_events with dedup_key generated column + UNIQUE KEY for one-open-
  event-per-tuple idempotency without partial indexes
- jetmon_event_transitions keyed on event_id with severity_before/after,
  state_before/after, reason, source, metadata, changed_at

New internal/eventstore package is the sole writer for both events tables.
Open/UpdateSeverity/UpdateState/Promote/LinkCause/Close all run their event
mutation and the matching transition row in a single transaction. A Tx
wrapper exposes the same surface for callers (the orchestrator) that need
to coordinate event writes with their own SQL — used to project v1
site_status onto jetpack_monitor_sites in the same transaction.

Orchestrator integration:
- handleFailure opens a Seems Down event on the first local failure and
  projects site_status=SITE_DOWN in the same tx
- confirmDown promotes Seems Down → Down with reason=verifier_confirmed
- false-positive branch closes with reason=false_alarm
- handleRecovery closes with reason=verifier_cleared (was Down) or
  probe_cleared (still in Seems Down)
- checkSSLAlerts opens a site-level tls_expiry event with severity laddered
  Warning (≤30/14 days) → Degraded (≤7 days), closes on cert renewal

Audit package refactored to operational-only. EventCheck, EventStatusTransition,
and EventVeriflierResult constants dropped (per-probe data lives in
jetmon_check_history; site-state changes flow through the events tables).
LogTransition removed. The Log signature is replaced with an Entry struct
carrying optional EventID and Metadata fields so audit rows can link to
incidents for operator drill-down.

Documentation: EVENTS.md, AGENTS.md, and TAXONOMY.md describe the two-table
split, the "open on first failure" lifecycle, the dedup_key idempotency
trick, transition reasons vocabulary, and the same-transaction invariant
that holds events, transitions, and the v1 site_status projection in sync.
Server-side hardening:
- Replace bare ListenAndServe with http.Server timeouts (ReadHeaderTimeout 5s,
  ReadTimeout 30s, WriteTimeout 35s, IdleTimeout 120s) so a slow client cannot
  pin a goroutine indefinitely
- Expose Shutdown(ctx) for graceful drain. veriflier2 binary's SIGINT/SIGTERM
  handler now drains in-flight checks for up to 30s before closing the
  listener instead of os.Exit(0)
- Optional StatsD metrics (verifier.checks.received.count,
  verifier.checks.duration.timer, verifier.auth.rejected.count) initialized
  from STATSD_ADDR env var; skipped cleanly when unset

Client-side performance:
- Tuned http.Transport: MaxIdleConns 100, MaxIdleConnsPerHost 20,
  IdleConnTimeout 90s, ForceAttemptHTTP2 true, explicit DialContext timeouts.
  The default MaxIdleConnsPerHost of 2 was forcing reconnects under any
  concurrency and was a latent bottleneck during outage waves
- Drop the hardcoded 30s http.Client.Timeout — caller-supplied ctx deadline
  is now the single source of truth. Orchestrator wraps each escalation with
  context.WithTimeout(NET_COMMS_TIMEOUT + 5s headroom) so a wedged verifier
  no longer hangs for the orchestrator's lifetime

Request correlation:
- Add RequestID field to CheckRequest and CheckResult (16-byte hex,
  crypto/rand backed). Client auto-generates if caller leaves it empty;
  server logs and echoes back. Orchestrator stamps the same id on the
  "escalating to N verifliers" audit row and on each verifier's reply row,
  so the full lifecycle of an escalation can be reconstructed via
  jetmon_audit_log.metadata.request_id without timestamp matching

Tests cover RequestID generation/echo, server graceful drain, and the
existing handler paths.
Two targeted fixes that surfaced during docker integration testing:

Verifier config validation: an empty grpc_port (typically a typo of "port"
instead of "grpc_port" in config.json) silently parsed to "" and the
orchestrator then dialed "host:" which resolves to port 80, producing a
generic connection-refused error. validate() now rejects any VERIFIERS[]
entry with empty host or grpc_port at startup with a precise message
naming the offending entry.

PID file location: run-jetmon.sh exported JETMON_PID_FILE=/jetmon/jetmon2.pid,
but /jetmon is owned by the jetmon user from the Dockerfile while the
container runs as ${JETMON_UID:-1000} via docker-compose, so the write
failed with permission denied and reload/drain commands could not find
the file. Move the PID file to /jetmon/stats/jetmon2.pid (the stats/
directory is chmod 0777 by the Dockerfile) and surface the env var via
docker-compose so docker compose exec ./jetmon2 reload picks it up too.
API.md is a design proposal for the internal Jetmon REST API. No code yet —
this drives review and alignment before Phase 1 implementation.

Scope and audience: Jetmon does not expose this API to end customers
directly. A separate gateway service handles tenant isolation, public-facing
errors, customer rate limiting, and plan-based feature gating, and calls
Jetmon over this internal interface. Only known internal systems (gateway,
operator dashboard, alerting workers, batch jobs) are direct callers.

Design principles documented: read API as source-of-truth not snapshot;
severity and state both first-class fields (not collapsed); cursor pagination
only; honest 401/403/404 distinction (no info-leak hiding); per-consumer
audit logging via the existing jetmon_audit_log; verbose error messages
for incident response.

Authentication: per-consumer Bearer tokens with three coarse scopes
(read/write/admin), sha256-hashed at rest in jetmon_api_keys. No live/test
key split, no OAuth, no self-service key management — keys are created and
revoked via an ops-only ./jetmon2 keys CLI.

Endpoints described across six families: sites + current state (Family 1),
events and history (Family 2), SLA and statistics (Family 3), webhooks
(Family 4), alert contacts (Family 5), identity and utility (Family 6).
Build order recommended in four phases.

Resolved design questions section captures the rationale behind: raw numeric
IDs everywhere (no type prefix or ULID); 200/page list cap with no
include_inactive flag; Stripe-style versioned HMAC for webhook signing;
synchronous trigger-now with 30s timeout; single metadata field per event
(gateway sanitizes before forwarding to customers).
Implements the read-only foundation described in API.md. The API server runs
on a dedicated port (config: API_PORT, 0 disables) inside the jetmon2
binary. Internal-only — a separate gateway service handles all customer-
facing concerns and calls this surface.

Schema (migration 12):
- jetmon_api_keys with sha256-hashed key_hash, consumer_name, scope enum
  (read|write|admin), rate_limit_per_minute, expires_at, revoked_at,
  last_used_at, created_at, created_by

internal/apikeys package:
- GenerateToken returns a 32-byte crypto/rand token, base32-encoded with
  jm_ prefix
- Lookup resolves a raw token to the Key record, distinguishing
  ErrInvalidToken / ErrKeyRevoked / ErrKeyExpired and touching last_used_at
- Create / List / Revoke / Rotate cover the full management lifecycle
- Scope.Includes enforces the read < write < admin hierarchy

CLI: ./jetmon2 keys create|list|revoke|rotate. The token is shown only
once at creation; the API has no /keys endpoints (key management is ops-
only by design).

internal/api package:
- Server.Listen / Shutdown with the verifier-style timeouts
- requireScope middleware: parses Bearer token, resolves via apikeys.Lookup,
  enforces scope, applies per-key token-bucket rate limiting (in-memory,
  with periodic GC), audits to jetmon_audit_log under event_type=api_access
  with consumer_name as source
- Standard X-RateLimit-{Limit,Remaining,Reset} headers; 429 with
  Retry-After when exceeded
- Honest 401 vs 403 vs 404 (no info-leak hiding) and verbose error
  messages for incident response — gateway sanitizes for customers

Endpoints (all GET, scope=read):
- /api/v1/health (unauthenticated)
- /api/v1/me — token introspection
- /api/v1/sites — cursor pagination, filters: state, severity__gte,
  monitor_active, q (URL substring)
- /api/v1/sites/{id} — single site with active_events array
- /api/v1/sites/{id}/events — incident history with duration_ms and
  transition_count, filters: state, check_type, started_at__gte/lt, active
- /api/v1/sites/{id}/events/{event_id} — event detail with embedded
  transitions (cross-site protection: event must belong to named site)
- /api/v1/sites/{id}/events/{event_id}/transitions — paginated transition list
- /api/v1/events/{event_id} — direct event lookup
- /api/v1/sites/{id}/uptime — uptime % from event durations, with
  per-state seconds, MTTR, MTBF over 1h/24h/7d/30d/90d window or from/to
- /api/v1/sites/{id}/response-time — p50/p95/p99/max/mean from
  jetmon_check_history.rtt_ms, sample cap 100k
- /api/v1/sites/{id}/timing-breakdown — same percentile shape per
  DNS/TCP/TLS/TTFB component

Tests use go-sqlmock with QueryMatcherEqual for precise SQL contract
assertions: 63 tests covering rate limiter behavior, auth middleware
(missing/invalid/revoked/expired/insufficient scope/rate-limited paths),
all read endpoint happy paths, 404s, cross-site protection, filter parsing,
cursor pagination, window math, and percentile correctness.

Audit package gains EventAPIAccess constant; main.go wires the API server
into runServe with graceful Shutdown(ctx) on SIGINT/SIGTERM. Keys CLI
shares the same db handle as the rest of the binary.
Chris Jean added 23 commits April 25, 2026 01:12
Implements the write-side endpoints described in API.md "Family 1 + 2"
for the build order's Phase 2. All endpoints require write scope and route
through an Idempotency-Key middleware so retries with the same key are
safe; PATCH/DELETE skip the middleware because they're inherently
idempotent on this schema.

Idempotency middleware (`internal/api/idempotency.go`):
- In-memory store keyed on (api_key_id, idempotency_key) — scoped by API
  key so two consumers can't collide on the same opaque value
- 24h TTL with hourly GC of expired entries
- On replay with same body: returns cached status/headers/body verbatim
  plus an `Idempotency-Replayed: true` header for debugging
- On same key + different body: 409 idempotency_conflict
- Only caches 2xx and 4xx responses — 5xx are re-attempted by retries
- State is bound to this jetmon2 instance; multi-instance would need
  Redis or a backing table. Adequate for the current single-instance
  internal API; documented as a future migration if scaling demands

Site write endpoints (`handlers_sites_write.go`):
- POST /api/v1/sites — caller-supplied blog_id (canonical from WPCOM),
  validates URL parses as http/https with non-empty host, validates
  redirect_policy ∈ {follow,alert,fail}, rejects duplicates with 409
  site_exists. Returns 201 with the full site record
- PATCH /api/v1/sites/{id} — partial update via dynamic SET clause from
  non-nil body fields. Empty body returns the current state (idempotent
  no-op). Validates inputs before existence check so bad shapes get 400
  even on nonexistent sites
- DELETE /api/v1/sites/{id} — soft delete: sets monitor_active=0 and
  closes any open events with reason=manual_override + metadata noting
  the deletion. Preserves audit trail and historical rows. Returns 204
- POST /api/v1/sites/{id}/pause — closes active events, sets
  monitor_active=0
- POST /api/v1/sites/{id}/resume — sets monitor_active=1. Does not
  reopen previously-closed events; the orchestrator's regular flow
  detects any current failure on the next round

Manual close + trigger-now (`handlers_events_write.go`):
- POST /api/v1/sites/{id}/events/{event_id}/close — closes an event with
  caller-supplied reason (defaults to manual_override) and stamps the
  optional note in metadata. Validates the event belongs to the named
  site (cross-site protection). Already-closed events return 200 with
  the existing record (idempotent close). If no other active events
  remain, projects site_status back to running
- POST /api/v1/sites/{id}/trigger-now — runs a checker.Check directly
  with a 30s context timeout, returns the raw timing result inline. On
  success, closes any open events with reason=probe_cleared (matches
  the orchestrator's no-verifier-on-recovery semantics from EVENTS.md).
  Does NOT open new events on failure — the orchestrator owns
  failure-detection state machine on its regular round

Tests (34 new, 97 total in the api package):
- Idempotency: hash stability, store lookup/store/expiry, middleware
  passthrough/cache-and-replay/conflict/key-isolation
- Site create: happy path, missing/invalid blog_id, bad URL variants
  (empty, malformed, ftp scheme, missing host), bad redirect_policy
  (422), duplicate (409)
- Site update: happy path, empty body returns current, not-found,
  validates URL/redirect_policy before existence check
- Site delete: soft-deletes with monitor_active=0 and closes events
- Pause: closes active events with manual_override + projects state;
  the underlying close transaction is asserted with sqlmock
- Resume: sets monitor_active=1
- Manual close: happy path with read-back, default reason fallback,
  not-found, cross-site rejection (404), already-closed idempotency,
  invalid id parsing
- Trigger-now: site-not-found 404, success path with no active events,
  success path that closes one active event via probe_cleared, invalid
  id 400
- Helpers: validateMonitorURL, encodeCustomHeaders, boolToTinyint,
  buildUpdateSetClause empty/full

Verified in docker against the running stack: create / patch /
pause / resume / trigger-now / 422 validation / 204 delete all returned
expected responses. Idempotency-Replayed header confirmed on a same-key
replay.
Closes items 7 and 10 from the verifier review punch-list.

Body size cap: handleCheck wraps r.Body in http.MaxBytesReader (10MB) before
the JSON decoder runs. An overlong payload now returns 413 Request Entity
Too Large rather than streaming through the decoder until something else
times out. 10MB is generous headroom — a typical 200-site batch is ~50KB.

Empty auth-token guard: veriflier2/cmd/main.go now refuses to start if the
resolved auth token is empty. Previously an empty token created a subtle
auth-bypass edge case where any request with the literal "Bearer " header
(no token after the space) would pass the equality check. Mirrors the same
pattern as the existing empty-port guard.

Item 11 (log.Fatalf on Listen failure) was reviewed and left as-is. Listen
only returns on startup port-binding failures (no in-flight work to drain)
or extremely rare mid-serve listener errors; clean shutdown via SIGINT goes
through srv.Shutdown which makes Listen return ErrServerClosed cleanly. The
current code is correct.
Three test races flagged by `go test -race ./...` predate this branch but
are easy enough to clean up while we're here. CI is now race-clean across
the entire module.

orchestrator: TestEscalateToVerifliersRecordsFalsePositiveWhenQuorumMissed
shared a `call` counter across the verifier-RPC goroutines escalateToVerifliers
spawns. Replace with sync/atomic.Int64.Add — same "first verifier returns
Success=false, subsequent ones return true" semantics, no race.

checker: TestQueueDepth, TestActiveCount, and TestScaleUpWhenQueueDeep all
used a two-Cleanup pattern that, due to LIFO ordering, restored the package-
level poolCheckFunc stub before the worker goroutines had finished reading
it. Consolidate into a single Cleanup that unblocks workers, drains the
pool to completion, and only then restores the stub. Functionally equivalent;
race-free.
The existing handler-level tests invoke handlers directly, bypassing the
requireScope middleware. These tests close that gap by going through
s.routes() so the middleware actually fires:

- TestPhase2WriteEndpointsRejectReadToken: a read-scope key on every
  Phase 2 write endpoint (POST /sites, PATCH /sites/{id}, DELETE,
  pause/resume, trigger-now, manual close) returns 403 insufficient_scope
- TestPhase2WriteEndpointsAcceptWriteToken: a write-scope key reaches
  the handler (asserts NOT 401/403; the handler may then 400/404 due to
  test-scoped DB state, but that's downstream of scope enforcement)
- TestPhase2ReadEndpointsAcceptReadToken: read scope passes on read
  endpoints
- TestPhase2WriteEndpointsRejectMissingToken: no Authorization header →
  401 missing_token across all write endpoints
- TestAdminTokenCanReachAllScopes: admin includes write includes read

Each subtest sets up the auth lookup expectations (key SELECT + last_used_at
UPDATE) via a small expectAuthLookup helper. That way the boilerplate
stays out of the test bodies and the scope assertion is the focus.
Resolved Phase 3 questions land in API.md's webhooks section:
- Detection: pull-based, 1s poll interval on jetmon_event_transitions.
  Long-term answer for the architecture; not a stepping stone toward push.
  Multi-instance via row-claim, no pub/sub layer needed.
- Retry/dead-letter: 6 attempts on the 1m/5m/30m/1h/6h/24h schedule;
  abandoned status in the same jetmon_webhook_deliveries table; manual
  retry endpoint for re-firing after a consumer fixes their endpoint.
- Filter semantics: empty = match all, AND across dimensions, whitelist
  only. Stripe/GitHub/Slack convention.
- Signing/rotation: HMAC-SHA256 over {timestamp}.{body}; immediate
  revocation only in v1.
- event.* webhook types fire 1:1 with jetmon_event_transitions rows.
  site.state_changed deferred.

Deferred items captured in ROADMAP.md:
- site.state_changed webhooks (rollup from events to site-row projection)
- Grace-period secret rotation (server signs both old + new for a window)
- Multi-repo / multi-binary split: orchestrator, API, deliverer, dashboard,
  and a renamed verifier as separately deployable services. Schema is
  already the implicit bus; split would extract each concern into its
  own cmd/ entry and move shared types out of internal/. "veriflier" is
  a long-standing typo and a split is the natural moment to rename
  (candidates: verifier, witness, probe-worker, vantage).

Each deferred entry includes the trigger condition that would prompt
revisiting and the upgrade path that keeps it non-breaking.
API.md webhooks section gains:
- Backpressure (Q6): shared 50-goroutine pool with per-webhook in-flight
  cap of 3, enforced via map[webhook_id]int counter under a mutex.
  Prevents a slow URL from monopolizing the pool and starving other
  webhooks' deliveries.
- Schema (Q7): jetmon_webhooks and jetmon_webhook_deliveries column
  layouts, the (status, next_attempt_at) and (webhook_id, created_at)
  indexes the worker and list-deliveries endpoints need, and the
  frozen-at-fire-time payload contract.
- Signing rationale (Q8): brief catalogue of the alternatives considered
  (GitHub-style, Slack-style, JWT, RFC 9421 HTTP Message Signatures,
  asymmetric Ed25519) with the conditions under which each would
  become attractive. Stripe-style HMAC-SHA256 over {timestamp}.{body}
  is the right call for our internal-API shape; asymmetric is the most
  compelling future migration if/when a public API without a gateway
  becomes a requirement.
- Webhook ownership (Q9): write-scope manages all webhooks today;
  created_by is audit-only. Section explicitly enumerates the
  ramifications if Jetmon ever becomes a public API: per-tenant
  ownership column, filtered queries, possibly a webhooks scope,
  backfill migration. created_by is forward-compatible.

ROADMAP.md gains a "Path to a public API" section under Architectural
roadmap that consolidates every internal-API design decision that would
need to change for direct customer access (auth scopes, error semantics,
error verbosity, webhook ownership and signing, rate limiting model,
idempotency key scoping, site id semantics). The migrations are
individually clean but touch most of the surface — public-API exposure
would be a project, not a flag flip.
Migrations 13-15 add jetmon_webhooks, jetmon_webhook_deliveries, and
jetmon_webhook_dispatch_progress. The webhooks table stores the HMAC
signing secret in plaintext rather than as a hash — unlike inbound API
keys (where a constant-time hash compare is sufficient), outbound HMAC
signing requires the actual key material in memory. The threat model is
documented inline on the migration.

internal/webhooks holds the package: Webhook + filter matching across
event/site/state dimensions (AND, empty=match-all), Delivery enqueue
with INSERT IGNORE on (webhook_id, transition_id), and a Worker that
polls jetmon_event_transitions on a high-water mark, builds + signs
payloads (Stripe-style "t=<unix>,v1=<hex>" over "{ts}.{body}"), and
dispatches them with a per-webhook in-flight cap (default 3) inside a
shared pool (default 50). Retry schedule: 1m, 5m, 30m, 1h, 6h — six
attempts then abandon.

API surface: full CRUD on /api/v1/webhooks, rotate-secret (revokes the
old key immediately and returns the new one once), list deliveries with
status filter + cursor, and manual retry on abandoned rows. Raw secrets
are only ever returned on create/rotate; subsequent reads expose only a
4-char preview.

Verified end-to-end in docker: opened a webhook, fired event.opened →
state_changed → closed transitions, watched signed POSTs land at a test
target, verified the signature consumer-side with Python HMAC, switched
the target to /fail to exercise the retry ladder, and exercised the
manual retry endpoint against an abandoned row.
The original entry positioned the deliverer as a webhook-only worker.
Webhooks, alert contacts, and WPCOM notifications share the same
plumbing — poll a transition source, freeze a payload, dispatch with
per-destination caps, retry, abandon — and only the transport differs.
Splitting them into separate binaries triples the operational surface
for what is fundamentally one job.

Reframes the binary scope, adds a "why one, not three" rationale and
a concrete sketch (Dispatcher interface, per-channel state tables,
shared circuit-breaker registry), and notes the trigger that makes the
split pay off: alert contacts ship as the second transport and WPCOM
notifications fold in as the third.
Replaces the Family 5 stub ("legacy bridge to WPCOM") with the actual
Phase 3.x design: managed channels for human destinations (email,
PagerDuty, Slack, Teams), parallel to the webhooks API rather than a
generalization of it.

Key design decisions captured:
- Boundary with webhooks: alert contacts deliver Jetmon-rendered
  notifications through Jetmon-owned transports; webhooks deliver
  the raw signed event stream for the consumer to render. Customers
  who want generic outbound webhooks register a webhook, not an
  alert contact — explicitly noted with deferral rationale.
- Filter shape: site_filter + min_severity (no event-type filter, no
  state filter). Severity gate is the primary control surface, with
  Critical default to avoid accidental noise.
- Per-contact max_per_hour cap as pager-storm insurance.
- Send-test endpoint (POST /alert-contacts/{id}/test) for verifying
  newly-configured contacts.
- Email delivery via swappable Sender interface with wpcom / smtp /
  stub implementations — wpcom for production, smtp for docker
  integration tests, stub for unit tests.
- Plaintext credential storage in destination JSON, same rationale as
  webhook secrets (outbound dispatch needs the raw value).
- Worker placement intentionally parallel (internal/alerting/) rather
  than extending internal/webhooks/.

ROADMAP additions under "Deferred from Phase 3.x":
- SMS, OpsGenie, quiet hours, alert ack, alert grouping, generic
  webhook transport — each with deferral rationale and revisit trigger.
- WPCOM notification migration: documented as the cleanup that pays
  off the deliverer-binary extraction, gated on alert contacts proving
  out in production.

Multi-binary-split section: explicit revisit point added for unifying
internal/alerting/ and internal/webhooks/ at deliverer-binary extraction
time, with rationale for why the split is right today (no shared
abstraction with only one known concrete user) and what enables the
later unification (third transport, two known shapes).
The first draft used placeholder severity names (Info/Notice/Warning/
Critical). The actual codebase severity ladder is Up < Warning <
Degraded < SeemsDown < Down (uint8 0-4 in internal/eventstore).
Aligned the JSON string names, default min_severity, severity gate
description, PagerDuty mapping table, and schema column type to match
what the code actually uses.
jetmon_alert_contacts mirrors jetmon_webhooks but with a simpler filter
model: site_filter (JSON, empty=all sites) + min_severity (TINYINT
matching internal/eventstore.Severity*, default 4=Down) + max_per_hour
rate cap. destination is JSON because each transport has a different
shape (email address, PagerDuty integration key, Slack/Teams webhook
URL); plaintext storage rationale matches jetmon_webhooks.secret —
outbound dispatch needs the raw value at every send.

jetmon_alert_deliveries is a per-fire record table mirroring
jetmon_webhook_deliveries, with the same fan-in shape (one transition
→ many deliveries, one contact gets at most one delivery per transition
via uk_alert_transition). Adds a severity column so the dispatcher can
render transport-specific severity (PagerDuty critical/warning/info)
without re-deriving from the payload.

jetmon_alert_dispatch_progress is a high-water-mark table for the alert
delivery worker, identical shape to jetmon_webhook_dispatch_progress.
Establishes the package boundary parallel to internal/webhooks: types,
sentinel errors, severity name <-> uint8 helpers, transport identifier
enum, and the Dispatcher interface that every concrete transport
implements. No DB or transport code yet — that's the next two tasks.

AlertContact.Matches encodes the full filter rule:
  site_id ∈ SiteFilter (or empty = match all sites)
  AND (
    new_severity >= MinSeverity            // escalation / sustained
    OR (prev_severity >= MinSeverity       // recovery from a
        AND new_severity == SeverityUp)    //   previously-paging state
  )

Recovery firing requires both prev and new severity because Matches
doesn't see the transition reason. The unit tests exercise both halves
of the gate, the recovery edge cases, the empty-site-filter "match all"
semantic, and the AND across all dimensions.

Notification is the rendered shape passed to Dispatcher.Send — the
worker builds it once per delivery from the frozen-at-fire-time
payload, transports translate it into channel-specific representations
(PagerDuty events-v2, Slack Block Kit, Teams Adaptive Card, email
subject + body).
Email is unique among alert-contact transports in that there is no
"post to this URL" — it requires a sender. emailDispatcher owns the
rendering (subject + plain + HTML body) and delegates the actual send
to a Sender, so swapping transports is a config change and rendering
logic stays in one place.

Three Sender implementations:
- StubSender — records messages in memory + log line; default for
  EMAIL_TRANSPORT="stub" and the basis of unit tests.
- SMTPSender — net/smtp send; for dev/staging with MailHog.
- WPCOMSender — Bearer-token POST to a WPCOM-owned email endpoint;
  same shape as the existing internal/wpcom client.

Rendering covers severity name, site URL, event id/type, state,
reason, and timestamp. Subject prefixes signal recovery and test
sends. HTML output escapes untrusted fields (site URL, reason)
against XSS in HTML email clients.

Config additions: EMAIL_TRANSPORT (default "stub"), EMAIL_FROM,
WPCOM_EMAIL_ENDPOINT/AUTH_TOKEN, SMTP_HOST/PORT/USERNAME/PASSWORD/
USE_TLS. Validation is lazy — happens at Dispatcher construction
time, not at config load — so a misconfigured email block doesn't
prevent the binary from starting if alerting is disabled.

Tests cover subject variants, plain/HTML body rendering, HTML escaping,
recovery and test banners, dispatcher destination parsing (happy path
and three rejection cases), MIME multipart shape, WPCOM HTTP request
shape (Bearer auth, JSON body, status surfacing), and StubSender
record/reset behavior.
Three concrete Dispatcher implementations, each rendering the
Notification into the channel-specific request shape:

PagerDuty (Events API v2): event_action of "trigger" or "resolve"
based on the recovery flag, dedup_key derived from the Jetmon event
id so all transitions of the same incident group under one alert.
Severity mapping: Down/SeemsDown → critical, Degraded → warning,
Warning → info. Test sends use a distinct dedup_key so they never
accidentally resolve a real alert.

Slack: Block Kit message with severity-keyed emoji header, fallback
text for old clients / mobile previews, and a fields section with
site id, event id, state, reason, time. Recovery and test variants
swap the header.

Teams: Adaptive Card 1.4 wrapped in the message envelope expected by
incoming-webhook URLs. FactSet for the same fields Slack uses.

Shared postJSON helper handles serialization, request building, and
status interpretation: 4xx/5xx returns an error so the worker
schedules a retry. Response bodies are read with a 4 KB cap and
truncated to last_response's column width before being returned.

24 tests cover happy paths, recovery action mapping (PagerDuty
resolve), test-send dedup isolation, severity ordering, bad
destination rejection (empty fields, malformed JSON), and upstream
error surfacing.
internal/alerting/contacts.go provides Create/Get/List/ListActive/
Update/Delete and a separate LoadDestination helper for the worker.
Mirrors the webhooks CRUD shape but with the simpler filter model
(site_filter + min_severity, no event-type/state filter) and a
transport-aware destination validator: each transport requires its
own field shape (email→address, pagerduty→integration_key,
slack/teams→webhook_url).

destination_preview is the last 4 chars of the credential field for
the contact's transport, so operators can identify a contact without
exposing the full credential. The destination itself is loaded
separately by the worker via LoadDestination, kept off the
AlertContact struct so it can't leak through serialization.

Update intentionally cannot change the transport — the destination
shape is transport-specific and validating cross-transport changes
is brittle. Operators who want to switch transports delete and
re-create.

API surface in internal/api/handlers_alerts.go: POST/GET/LIST/PATCH/
DELETE on /api/v1/alert-contacts plus POST /alert-contacts/{id}/test.
The send-test endpoint loads the contact, fetches the destination,
and dispatches a synthetic IsTest=true notification through the
configured Dispatcher with a 15s timeout. Returns the transport's
status_code and response_body so operators can verify connectivity;
transport errors surface as 502.

Server gains an alertDispatchers map (alerting.Transport → Dispatcher)
populated via SetAlertDispatchers — this is the same map the worker
will use, so a successful send-test exercises the real production
code path. nil map (alerting disabled) → 503 transport_not_configured.

12 new sqlmock tests cover create happy-path with default min_severity,
transport rejection (sms), missing-field rejection per transport,
invalid severity name (Critical instead of Down), get/update/delete
happy + 404, and send-test happy / transport-error / no-dispatcher
cases.
internal/alerting/deliveries.go provides the per-fire delivery store
mirroring internal/webhooks/deliveries.go: Enqueue (INSERT IGNORE on
uk_alert_transition), ClaimReady, MarkDelivered, ScheduleRetry,
GetDelivery, ListDeliveries, RetryDelivery for the operator manual-
retry path. Adds MarkSuppressed for the rate-cap exit — terminal,
status='abandoned', last_status_code=429, last_response identifies
why so deliveries don't look like normal abandonments in the audit
trail.

internal/alerting/worker.go runs two background goroutines:

  - dispatchTick polls jetmon_event_transitions on a high-water mark
    in jetmon_alert_dispatch_progress, then for each new transition
    matches every active contact via Matches(prev_severity,
    next_severity, site_id) — the recovery edge fires correctly
    because we read both severity columns from the transition row.

  - deliverTick claims pending deliveries respecting the per-contact
    in-flight cap, looks up the contact + destination, builds a
    Notification (frozen-payload-derived; site URL looked up live for
    display only), and dispatches through the configured Dispatcher
    map. Same retry ladder as webhooks (1m/5m/30m/1h/6h then abandon).

Per-contact rate cap enforced via an in-memory sliding window
(rateLimitWindow). When the cap is hit, the delivery is marked
suppressed rather than retried — by the time the cap reopens the
alert is stale, so retrying just delays the inevitable. Single-instance
caveat documented inline; matches the existing multi-instance row-claim
caveat in webhooks.

14 unit tests cover the retry schedule, defaults, in-flight cap (per-
contact and isolation), zero-count cleanup, rate-window admission and
expiry, contact isolation in the rate window, and the transition reason
to alert event type mapping. Race-clean.
GET /api/v1/alert-contacts/{id}/deliveries with status filter and
cursor pagination; POST /api/v1/alert-contacts/{id}/deliveries/
{delivery_id}/retry to reset abandoned rows back to pending. Same
shapes as the webhook delivery endpoints.

cmd/jetmon2/main.go now boots the alerting worker alongside the
webhooks worker (gated on API_PORT > 0). buildAlertDispatchers
constructs the per-transport map from runtime config: the three
HTTP-based transports (PagerDuty, Slack, Teams) always wire up;
email picks wpcom/smtp/stub via EMAIL_TRANSPORT and falls back to
stub if unset. The same dispatcher map is shared with the API
server via SetAlertDispatchers, so a successful send-test exercises
the real production code path.

5 sqlmock tests cover list happy path, bad status rejection, retry
happy path, wrong-contact 404 cross-check, and not-abandoned 409.

Verified end-to-end in docker: registered an email + a slack alert
contact, ran the send-test endpoint (Block Kit message landed at
the test target with the 🔍 test emoji), inserted a failing
test site, watched the worker enqueue + dispatch alert.opened ->
alert.state_changed -> alert.closed transitions through the slack
transport (status 200 round trip), stopped the target to exercise
the retry path (delivery rescheduled with last_response captured
the transport error), force-marked one delivery abandoned and
exercised the manual retry endpoint (200 with status=pending; second
call against pending row returns 409 delivery_not_retryable as
expected). Test data cleaned up.
ClaimReady now follows its SELECT with a per-row UPDATE pushing
next_attempt_at to NOW+60s before the dispatch goroutine even starts.
Without this, the 1-second deliver tick was re-claiming any still-
in-flight row up to the per-contact in-flight cap (default 3),
producing concurrent dispatches that all read d.Attempt at claim
time, computed retry delays from the same stale value, and ran
ScheduleRetry's `attempt = attempt + 1` SQL three times in a row.

The visible effect: the documented 1m / 5m / 30m / 1h / 6h ladder
was collapsing to roughly 1m → 1h → abandoned, ~1h total instead of
the intended 7h36m. A transient consumer outage during a deploy
window could exhaust all retries before the consumer was back.

The dispatch goroutine still owns the real next_attempt_at: it
overwrites the soft lock with NULL on success or the actual retry
time on failure. If a goroutine panics without recovery (or the
process is OOM-killed mid-dispatch), the 60s soft lock expires and
the row becomes claimable again — natural recovery without operator
intervention.

Same fix in both internal/webhooks and internal/alerting (they share
the bug because alerting's worker was modeled on webhooks). Tests
verify the contract: SELECT + N per-row UPDATEs, and an idle tick
with no candidates issues no UPDATE traffic.

Multi-instance row-claim (SELECT ... FOR UPDATE SKIP LOCKED) is
still tracked alongside the deliverer-binary extraction in ROADMAP.
CHANGELOG gets a comprehensive "v2 branch — site health platform"
section ahead of the rewrite section, covering everything that has
landed on this branch: event sourcing, the internal REST API,
Phase 3 webhooks, Phase 3.x alert contacts, verifier hardening, and
the soft-lock worker fix from the previous commit. Marked clearly
as not drop-in with Jetmon 1 (separate from the rewrite branch).

AGENTS.md's Key Files table picks up internal/api, internal/apikeys,
internal/webhooks, internal/alerting, plus references to API.md and
ROADMAP.md so future agents have a route to the design docs.
Architecture diagram intentionally left as-is — it reflects the
drop-in rewrite, and the v2 architecture is still mid-flight pending
the deliverer-binary extraction.
…otency

Three small hardenings caught while reviewing the recently-added code:

1) alerting.Update now validates label (must be non-empty) and
   max_per_hour (must be >= 0) at input time, before the DB lookup.
   Previously an empty label PATCH would silently persist and a
   negative max_per_hour would surface as a generic 500 from MySQL's
   INT UNSIGNED constraint instead of a clean 422. Validations that
   don't need the existing row run first so obviously bad bodies
   don't pay for a round-trip.

2) buildMIMEMessage and renderEmailSubject now strip CR/LF from
   anything that becomes a MIME header value (From, To, Subject,
   site URL in the subject). Defense-in-depth: monitor_url is
   operator-controlled but the column doesn't enforce CRLF-free,
   so a malicious or accidental URL with embedded CRLF could have
   added new header lines (Bcc:, X-headers, etc.) in outbound email.
   Body content with newlines is intentionally unaffected.

3) POST /api/v1/alert-contacts/{id}/test now goes through
   withIdempotency like the other write POSTs (and like
   webhooks/{id}/rotate-secret). A retried "click to test" during a
   network blip no longer double-pages the destination.

Tests: 2 new for the Update validation rejections (no DB hit because
validation fires first), 2 new for the MIME header strip (subject
strip + asserting no new header lines are created when the input
contains CRLF; body CRLF passes through unchanged).
API.md picks up a one-paragraph note on the send-test endpoint
explaining its Idempotency-Key support — same shape as the rest of
the write surface, called out specifically because operators are the
typical caller (a "click to test" UX expectation).

CHANGELOG gets a "Polish" subsection under the v2 branch entry
covering the three hardenings just landed: Update input validation,
MIME header CRLF stripping, and idempotent send-test.
Each httptrace phase has a Start hook and a Done hook. When the
connection fails mid-way (TCP refused, TLS handshake failure,
hostname resolution then disconnect, etc.), the Start hook fires but
the matching Done hook never does — leaving its *End timestamp as
the zero time.Time. The recording code then computed
zero_time.Sub(real_time), which is roughly the negative of the
original timestamp's nanoseconds — a huge negative duration that
overflows the INT NULL columns in jetmon_check_history.

The visible symptom was a flood of repeat log lines on every check
round whenever a monitored site refused TCP:

  orchestrator: record history blog_id=N: Error 1264 (22003):
    Out of range value for column 'dns_ms' at row 1

The fix is one line per phase: only record a duration if BOTH the
Start and Done hooks fired. A failed phase reports zero rather than
a misleading negative value. Zero is the right reporting because
the phase didn't successfully complete — there is no "duration" to
report.

Regression test added to TestCheckConnectionRefused asserting all
three phase durations are non-negative on a connection-refused
target. Without the fix, TCP would be ~ -unix-nanos.
The previous diagram described the original drop-in rewrite — three
boxes (orchestrator, check pool, gRPC server) talking to the same
internal channels. The v2 branch has three more independently-scaling
concerns the diagram didn't show: the REST API server, the webhook
delivery worker, and the alerting delivery worker — all consumers
of jetmon_event_transitions via the eventstore.

Updated diagram shows the layered shape:
  - Top tier: orchestrator + check pool + veriflier transport (the
    "monitor what's out there" half)
  - Middle: eventstore as the single writer for events / transitions
  - Bottom tier: API + webhook worker + alerting worker (the
    "tell the world about it" half)

Plus inline component descriptions for each new package and an
explicit forward-looking note about the deliverer-binary split
tracked in ROADMAP.md, so future agents have a route from "I see
two delivery workers" to "yes, intentionally — they unify when the
binary extracts."

WPCOM is shown as the legacy notification path coexisting with
alert contacts, matching the design decision documented in API.md
"Family 5 → Relationship to legacy WPCOM notifications."
Adds docs/adr/ with a README explaining the format and seven ADRs
covering the load-bearing decisions on this branch — the kind of
"why is X like this" question that has been re-explained more than
once in code review, commit messages, and inline design rationale.

Each ADR is short (Status / Context / Decision / Consequences, plus
Alternatives where useful) and cross-links to the others and to the
relevant code paths.

  0001 — Event-sourced state model with dedicated transitions table
  0002 — Internal-only API behind a gateway
  0003 — Plaintext credential storage for outbound dispatch
  0004 — Stripe-style HMAC-SHA256 webhook signatures
  0005 — Pull-only webhook and alerting delivery
  0006 — Separate internal/alerting and internal/webhooks packages
  0007 — Soft-lock claim vs SELECT ... FOR UPDATE SKIP LOCKED

AGENTS.md picks up a Key Files row pointing future agents at
docs/adr/. The README explains the conventions: append-only after
acceptance, one decision per ADR, cross-link generously, don't
backfill speculatively.

No code changes; pure docs.
@chrisbliss18 chrisbliss18 changed the title Event-sourced state model and Phase 1 read AP Internal REST API foundation: events, webhooks, alerting Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant