Skip to content

feat(events): add site-event lifecycle model and orchestrator hooks#70

Open
heydemoura wants to merge 5 commits intov2from
v2-add-site-events
Open

feat(events): add site-event lifecycle model and orchestrator hooks#70
heydemoura wants to merge 5 commits intov2from
v2-add-site-events

Conversation

@heydemoura
Copy link
Copy Markdown
Contributor

Summary

This branch layers a first-class site event lifecycle onto Jetmon 2 and wires it into the monitor flow, while also tightening migration behavior and expanding project docs/API planning relative to master.

Changes vs master

  • Added a dedicated jetmon_site_events table (migration 9) and introduced explicit event lifecycle/severity constants in internal/db/site_events.go:
    • lifecycle/event types: SeemsDown and ConfirmedDown
    • severities: Low and High
  • Integrated event lifecycle transitions into the orchestrator:
    • open on first local failure (openSiteEvent)
    • upgrade on verifier-confirmed downtime (upgradeOpenSiteEvent)
    • close on recovery or false-positive clear (closeOpenSiteEvent)
  • Added DB query helpers for event lifecycle operations (OpenSiteEvent, UpgradeOpenSiteEvent, CloseOpenSiteEvent) with tests.
  • Added/expanded API surface documentation for the planned authenticated /api/v1/* endpoints in ROADMAP.md, including:
    • state/event/history/uptime/read APIs
    • manage/trigger write APIs
    • auth model (Authorization: Bearer <token>, scoped API keys)
    • per-key rate limiting behavior and returned rate-limit headers
  • Included migration resilience improvements around migration application ordering/idempotency and migration 8 (last_checked_at, last_alert_sent_at, supporting index) so schema upgrades are safer on partially migrated environments.
  • Added major documentation updates (ARCHITECTURE.md, PROJECT.md, EVENTS.md, TAXONOMY.md, ROADMAP.md, and README updates), including clearer getting-started/onboarding references from the main docs.

Testing performed

  • go test ./...
  • Added and exercised targeted tests around:
    • orchestrator event open/upgrade/close flow
    • DB event query helpers
    • broader checker/config/dashboard/veriflier/wpcom/orchestrator paths included in this branch

@heydemoura heydemoura changed the base branch from master to v2 April 23, 2026 23:32
@heydemoura heydemoura marked this pull request as ready for review April 23, 2026 23:32
Copy link
Copy Markdown
Contributor

@chrisbliss18 chrisbliss18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes match what EVENTS.md and TAXONOMY.md describe: open at first failure, upgrade in place when the verifier confirms, close on recovery or false alarm. started_at is preserved on upgrade, the row isn't deleted on close, severity and state live in separate columns, and there's a foreign key with cascade. This all looks good, but I do have some recommended changes.

DBUpdatesEnable only gates half the writes. The toggle exists so a test instance can't clobber prod data. Site-row updates respect it; event writes don't. I think you should either limit event writes the same way or decide events are always on and make updates regardless of that setting.

At first sniff, the severity column in the jetmon_site_events table looked redundant with event_type, but then I realized that this is only because the current scope is small relative to the full scope.

Perhaps adding a comment about this could help clarify this and prevent confusion until more of the code is fleshed out. Such as:

  // Severity is intentionally separate from EventType. Today only two states
  // exist and severity tracks event_type 1:1, but the model is for severity                                                                                                                                                                                                                                     
  // to evolve within a state (e.g. cert-expiry warning at 30d vs 3d).                                                                                                                                                                                                                                           
  // See EVENTS.md "Severity vs. state".                                                                                                                                                                                                                                                                         
  const (                                                                                                                                                                                                                                                                                                        
      EventTypeSeemsDown     EventType = 1                                                                                                                                                                                                                                                                       
      EventTypeConfirmedDown EventType = 2                                                                                                                                                                                                                                                                       
  )     

auditTransition already records Up, SeemsDown, Down, Up. The new event table records the same. If both stay, postmortems have to join them by timestamp, which is the kind of thing that quietly drifts. Either pick a canonical one or store a cross-reference.

Aside: the PR content talks about the master branch, but I think you mean v2 branch. I interpreted everything relative to the v2 branch.

Comment thread internal/orchestrator/orchestrator.go Outdated
Comment on lines 419 to 423
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EVENTS.md says the projection on the site row gets updated in the same transaction as the event. Right now confirmDown writes the event upgrade, then writes site_status only if DBUpdatesEnable is true, and they're separate statements. A crash in between produces exactly the drift the docs say is a bug.

These database transactions should be wrapped in a transaction.

Comment thread internal/orchestrator/orchestrator.go Outdated
Comment on lines +324 to +326
if entry.failCount == 1 {
o.openSiteEvent(site, db.EventTypeSeemsDown, db.EventSeverityLow, entry.firstFailAt)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TAXONOMY.md lists maintenance-window suppression of event creation as a v1 rule. The notification path checks inMaintenance(site); the event path doesn't. With these changes, a check that fails during a known maintenance window still writes a SeemsDown row.

Comment thread internal/db/migrations.go Outdated
Comment on lines +94 to +109
{9, `CREATE TABLE IF NOT EXISTS jetmon_site_events (
id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
jetpack_monitor_site_id BIGINT UNSIGNED NOT NULL,
event_type TINYINT UNSIGNED NOT NULL,
severity TINYINT UNSIGNED NOT NULL,
started_at DATETIME NOT NULL,
ended_at DATETIME NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
INDEX idx_site_event_type (jetpack_monitor_site_id, event_type),
INDEX idx_event_type_started (event_type, started_at),
CONSTRAINT fk_jetmon_site_events_site
FOREIGN KEY (jetpack_monitor_site_id)
REFERENCES jetpack_monitor_sites (jetpack_monitor_site_id)
ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Columns specified in TAXONOMY.md and ROADMAP.md are missing. Adding the nullable ones now is cheap and saves three follow-up migrations:

  • resolution_reason - false alarm vs. verifier-cleared close look identical in the DB right now
  • endpoint_id (nullable) - the documented site → endpoint → check hierarchy needs it
  • check_type / probe_type - the API filters on it, and DNS/TCP/TLS probes are coming
  • cause_event_id - for causal links between events
  • metadata JSON - for check-specific data

The ON UPDATE CURRENT_TIMESTAMP of updated_at contradicts the immutability invariant. EVENTS.md invariant 5: "Closed events are never mutated." The column will silently bump every time anything touches the row. I think it's best to drop it or replace with an explicit closed_at that's only ever set once.

Comment thread internal/orchestrator/orchestrator.go Outdated
}
}

func (o *Orchestrator) closeOpenSiteEvent(site db.Site, endedAt time.Time) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add a third parameter here that records the reason for closing the open event, such as handling a recovery or a false alarm.

Comment on lines +3 to +9
const (
EventTypeSeemsDown = 1
EventTypeConfirmedDown = 2

EventSeverityLow = 1
EventSeverityHigh = 2
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be split up into separate types? Such as:

  type EventType uint8
  type EventSeverity uint8

  const (
      EventTypeSeemsDown     EventType = 1
      EventTypeConfirmedDown EventType = 2
  )

  const (
      EventSeverityLow  EventSeverity = 1
      EventSeverityHigh EventSeverity = 2
  )

This way EventTypeSeemsDown and EventSeverityLow can't be accidentally swapped but the code still compiles since they are the same number.

Comment thread internal/db/queries.go
Comment on lines +100 to +115
// OpenSiteEvent inserts a new open event for the site if none is currently open.
func OpenSiteEvent(ctx context.Context, siteID int64, eventType, severity uint8, startedAt time.Time) error {
_, err := db.ExecContext(ctx,
`INSERT INTO jetmon_site_events
(jetpack_monitor_site_id, event_type, severity, started_at)
SELECT ?, ?, ?, ?
WHERE NOT EXISTS (
SELECT 1
FROM jetmon_site_events
WHERE jetpack_monitor_site_id = ?
AND ended_at IS NULL
)`,
siteID, eventType, severity, startedAt.UTC(), siteID,
)
return err
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of only one open event per site breaks documented behavior. EVENTS.md and TAXONOMY.md describe a site that can have multiple open events at once: expiring cert and failing endpoint and stopped wp-cron, all visible at the same time. There should only be one open event of a specific event_type per site, but the current checks don't achieve this.

This function doesn't tell the caller whether it actually inserted. This means that we can't log "open suppressed because one was already open" since that detail isn't communicated to the caller. Checking on the number of rows affected would provide this information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants