feat(events): add site-event lifecycle model and orchestrator hooks#70
feat(events): add site-event lifecycle model and orchestrator hooks#70heydemoura wants to merge 5 commits intov2from
Conversation
058d16f to
fe5eec5
Compare
There was a problem hiding this comment.
The changes match what EVENTS.md and TAXONOMY.md describe: open at first failure, upgrade in place when the verifier confirms, close on recovery or false alarm. started_at is preserved on upgrade, the row isn't deleted on close, severity and state live in separate columns, and there's a foreign key with cascade. This all looks good, but I do have some recommended changes.
DBUpdatesEnable only gates half the writes. The toggle exists so a test instance can't clobber prod data. Site-row updates respect it; event writes don't. I think you should either limit event writes the same way or decide events are always on and make updates regardless of that setting.
At first sniff, the severity column in the jetmon_site_events table looked redundant with event_type, but then I realized that this is only because the current scope is small relative to the full scope.
Perhaps adding a comment about this could help clarify this and prevent confusion until more of the code is fleshed out. Such as:
// Severity is intentionally separate from EventType. Today only two states
// exist and severity tracks event_type 1:1, but the model is for severity
// to evolve within a state (e.g. cert-expiry warning at 30d vs 3d).
// See EVENTS.md "Severity vs. state".
const (
EventTypeSeemsDown EventType = 1
EventTypeConfirmedDown EventType = 2
)
auditTransition already records Up, SeemsDown, Down, Up. The new event table records the same. If both stay, postmortems have to join them by timestamp, which is the kind of thing that quietly drifts. Either pick a canonical one or store a cross-reference.
Aside: the PR content talks about the master branch, but I think you mean v2 branch. I interpreted everything relative to the v2 branch.
There was a problem hiding this comment.
EVENTS.md says the projection on the site row gets updated in the same transaction as the event. Right now confirmDown writes the event upgrade, then writes site_status only if DBUpdatesEnable is true, and they're separate statements. A crash in between produces exactly the drift the docs say is a bug.
These database transactions should be wrapped in a transaction.
| if entry.failCount == 1 { | ||
| o.openSiteEvent(site, db.EventTypeSeemsDown, db.EventSeverityLow, entry.firstFailAt) | ||
| } |
There was a problem hiding this comment.
TAXONOMY.md lists maintenance-window suppression of event creation as a v1 rule. The notification path checks inMaintenance(site); the event path doesn't. With these changes, a check that fails during a known maintenance window still writes a SeemsDown row.
| {9, `CREATE TABLE IF NOT EXISTS jetmon_site_events ( | ||
| id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, | ||
| jetpack_monitor_site_id BIGINT UNSIGNED NOT NULL, | ||
| event_type TINYINT UNSIGNED NOT NULL, | ||
| severity TINYINT UNSIGNED NOT NULL, | ||
| started_at DATETIME NOT NULL, | ||
| ended_at DATETIME NULL, | ||
| created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, | ||
| updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, | ||
| INDEX idx_site_event_type (jetpack_monitor_site_id, event_type), | ||
| INDEX idx_event_type_started (event_type, started_at), | ||
| CONSTRAINT fk_jetmon_site_events_site | ||
| FOREIGN KEY (jetpack_monitor_site_id) | ||
| REFERENCES jetpack_monitor_sites (jetpack_monitor_site_id) | ||
| ON DELETE CASCADE | ||
| ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`}, |
There was a problem hiding this comment.
Columns specified in TAXONOMY.md and ROADMAP.md are missing. Adding the nullable ones now is cheap and saves three follow-up migrations:
resolution_reason- false alarm vs. verifier-cleared close look identical in the DB right nowendpoint_id(nullable) - the documented site → endpoint → check hierarchy needs itcheck_type/probe_type- the API filters on it, and DNS/TCP/TLS probes are comingcause_event_id- for causal links between eventsmetadata JSON- for check-specific data
The ON UPDATE CURRENT_TIMESTAMP of updated_at contradicts the immutability invariant. EVENTS.md invariant 5: "Closed events are never mutated." The column will silently bump every time anything touches the row. I think it's best to drop it or replace with an explicit closed_at that's only ever set once.
| } | ||
| } | ||
|
|
||
| func (o *Orchestrator) closeOpenSiteEvent(site db.Site, endedAt time.Time) { |
There was a problem hiding this comment.
It would be good to add a third parameter here that records the reason for closing the open event, such as handling a recovery or a false alarm.
| const ( | ||
| EventTypeSeemsDown = 1 | ||
| EventTypeConfirmedDown = 2 | ||
|
|
||
| EventSeverityLow = 1 | ||
| EventSeverityHigh = 2 | ||
| ) |
There was a problem hiding this comment.
Should these be split up into separate types? Such as:
type EventType uint8
type EventSeverity uint8
const (
EventTypeSeemsDown EventType = 1
EventTypeConfirmedDown EventType = 2
)
const (
EventSeverityLow EventSeverity = 1
EventSeverityHigh EventSeverity = 2
)
This way EventTypeSeemsDown and EventSeverityLow can't be accidentally swapped but the code still compiles since they are the same number.
| // OpenSiteEvent inserts a new open event for the site if none is currently open. | ||
| func OpenSiteEvent(ctx context.Context, siteID int64, eventType, severity uint8, startedAt time.Time) error { | ||
| _, err := db.ExecContext(ctx, | ||
| `INSERT INTO jetmon_site_events | ||
| (jetpack_monitor_site_id, event_type, severity, started_at) | ||
| SELECT ?, ?, ?, ? | ||
| WHERE NOT EXISTS ( | ||
| SELECT 1 | ||
| FROM jetmon_site_events | ||
| WHERE jetpack_monitor_site_id = ? | ||
| AND ended_at IS NULL | ||
| )`, | ||
| siteID, eventType, severity, startedAt.UTC(), siteID, | ||
| ) | ||
| return err | ||
| } |
There was a problem hiding this comment.
The idea of only one open event per site breaks documented behavior. EVENTS.md and TAXONOMY.md describe a site that can have multiple open events at once: expiring cert and failing endpoint and stopped wp-cron, all visible at the same time. There should only be one open event of a specific event_type per site, but the current checks don't achieve this.
This function doesn't tell the caller whether it actually inserted. This means that we can't log "open suppressed because one was already open" since that detail isn't communicated to the caller. Checking on the number of rows affected would provide this information.
Summary
This branch layers a first-class site event lifecycle onto Jetmon 2 and wires it into the monitor flow, while also tightening migration behavior and expanding project docs/API planning relative to
master.Changes vs
masterjetmon_site_eventstable (migration 9) and introduced explicit event lifecycle/severity constants ininternal/db/site_events.go:SeemsDownandConfirmedDownLowandHighopenSiteEvent)upgradeOpenSiteEvent)closeOpenSiteEvent)OpenSiteEvent,UpgradeOpenSiteEvent,CloseOpenSiteEvent) with tests./api/v1/*endpoints inROADMAP.md, including:Authorization: Bearer <token>, scoped API keys)last_checked_at,last_alert_sent_at, supporting index) so schema upgrades are safer on partially migrated environments.ARCHITECTURE.md,PROJECT.md,EVENTS.md,TAXONOMY.md,ROADMAP.md, and README updates), including clearer getting-started/onboarding references from the main docs.Testing performed
go test ./...