Jetmon 2 — Site health platform by chrisbliss18 · Pull Request #61 · Automattic/jetmon

chrisbliss18 · 2026-04-22T14:20:25Z

Work in progress. This branch (v2) is the ambitious successor to the Go rewrite started in refactor/jetmon2 (PR #60). It includes everything from that branch and extends it with a new architecture and direction.

What changed from PR #60

PR #60 scoped Jetmon 2 as a drop-in replacement: same interfaces, same schema, same behaviour — just Go instead of Node.js + C++. That work is complete and forms the base of this branch.

This branch pivots to a larger goal: Jetmon 2 as a full site health monitoring platform, not just an uptime tracker. The key additions:

Event-sourced architecture. Site state is derived from an event log, not a mutable status column. The event log is the source of truth; the site row carries a denormalized projection for fast reads. Full design in EVENTS.md.
Five-layer test taxonomy. Reachability → Transport & Security → Infrastructure & Edge → Application Response → Content Integrity, plus Reverse Checks (agent-reported signals from inside WordPress). ~55 v1 items, ~55 v2, ~40 v3. Full taxonomy in TAXONOMY.md.
Site → Endpoint → Check hierarchy. Sites have multiple endpoints; each endpoint has multiple checks of different types. Site state rolls up from endpoint state, which rolls up from check results. Rollup rules are explicit and configurable per site.
Multi-state vocabulary. Up, Warning, Degraded, Seems Down, Down, Paused, Maintenance, Unknown. Unknown is not downtime — monitor-side failures never inflate customer downtime figures.
Competitor-parity public REST API. Five capability groups: status and state, events and history, SLA statistics (uptime %, response time p95/p99, MTTR), monitor management (CRUD, pause, resume, trigger-now), and alert contacts with outbound webhooks. Full design in ROADMAP.md.
Gradual rollout with back-compat. The existing site_status column keeps receiving derived writes so current consumers are not broken. New capabilities are additive; consumers adopt progressively.

Architectural decisions are locked in AGENTS.md so they are enforced consistently across all changes.

Why Go

The current architecture uses forked Node.js processes (8–16MB RSS each at startup, 53MB limit before recycling) as workers, plus a compiled C++ addon to escape Node's event loop for blocking network I/O. Go eliminates both constraints:

Goroutines start at ~4KB of stack and grow on demand, making 50,000 concurrent checks on a single host practical without the memory overhead of forked processes or libuv thread pools
net/http and crypto/tls are first-class stdlib packages — no native addon, no node-gyp, no compilation step during deployment
net/http/httptrace provides DNS, TCP, TLS, and TTFB timing hooks as separate measurements within each check, for free
Single static binary deployment with no runtime dependencies, no node_modules, and no addon rebuild on Node.js version upgrades
Built-in profiling via pprof, race detector via go test -race, and a mature testing ecosystem
Graceful goroutine lifecycle management replaces the fragile worker spawn/recycle/evaporate lifecycle

The Veriflier is rewritten in Go as well, replacing the Qt C++ dependency with a lightweight Go HTTP service. The protocol between Monitor and Verifliers moves from custom HTTPS to gRPC, providing type-safe contracts, built-in retries, and bidirectional streaming for future use.

Benefits of the Rewrite

Memory

The current architecture forks Node.js worker processes that start at 8–16MB RSS and are recycled once they reach 53MB. With a typical deployment of 8–16 workers, the process tree consumes 240–850MB of resident memory just for worker overhead, before any check data is counted.

Jetmon 2 runs as a single process. Go goroutines start at 4KB of stack and grow on demand. A pool of 1,000 concurrent goroutines costs roughly 4MB of stack. Total process RSS for an equivalent workload is estimated at 50–150MB — a 75–90% reduction in memory consumption per host.

Concurrent Checks

Current concurrency is bounded by the number of worker processes. Each worker is a single-threaded Node.js process; practical concurrency per host is in the low hundreds.

Go's goroutine scheduler makes 10,000+ concurrent in-flight checks on a single host practical with no additional configuration. At a conservative network timeout of 10 seconds and average site response time of 200ms, a pool of 1,000 goroutines sustains approximately 5,000 check completions per second — an estimated 10–50× increase in concurrent checks per host.

Throughput

The current architecture crosses a process boundary on every unit of work: master dispatches via IPC, worker receives and processes, replies via IPC, master aggregates. Each crossing involves serialisation, a context switch, and V8 event loop scheduling on both ends.

Jetmon 2 replaces all IPC with Go channel sends, which are in-process and order-of-magnitude cheaper. Estimated throughput improvement: 3–10× more sites checked per second per host under equivalent conditions.

Check Scheduling Accuracy

The current system uses setTimeout and setInterval for round scheduling, subject to V8 event loop delay — a busy loop can delay a callback by tens to hundreds of milliseconds, introducing jitter into RTT measurements.

Go's time.Ticker fires with OS-level timer precision. RTT measurements from net/http/httptrace are taken inside the HTTP stack with no event loop between the measurement point and the timer.

Deployment Speed

Current deployment requires npm install, a node-gyp rebuild of the native C++ addon, and a coordinated process restart. A failed addon compilation blocks deployment entirely.

Jetmon 2 deploys as a single static binary. Deployment is: copy binary, systemctl restart jetmon2. Total deployment time drops from several minutes to under 30 seconds.

Mean Time to Recovery

A worker process crash requires the master to detect the exit, spawn a replacement, and wait for initialisation — several seconds, with in-flight checks unresolved.

In Jetmon 2, a panicking goroutine is recovered by a deferred handler, the result counted as an error, and a replacement goroutine immediately spawned — recovery in the low milliseconds. For a full process crash, systemd restarts the binary; with Go's fast startup, the process is accepting work again in under 2 seconds.

Operational Complexity

The current system requires managing Node.js version compatibility, native addon compilation, npm dependency trees, and the fragile worker spawn/recycle lifecycle.

Jetmon 2 eliminates all of this. One artifact to manage: the Go binary. No node-gyp, no npm, no Node.js version management.

Build order

Schema — jetmon_endpoints, jetmon_events, updated site row projection, back-compat site_status derived write
Probe runner — replaces per-site HTTP check loop; iterates endpoints; owns idempotent event dedup
Check types — DNS, TLS cert expiry, redirect chain, keyword, TTFB (v1 taxonomy)
Public REST API — status/state, events/history, SLA statistics, monitor management, alert contacts/webhooks

🤖 Generated with Claude Code

…orld ReadMemStats - refreshVeriflierClients now diffs addr|token fingerprints and skips rebuilding when the verifier list is unchanged, preserving TCP connection pools between rounds - Remove runtime.ReadMemStats stop-the-world call — it was logging but taking no action; memory metrics are already covered by EmitMemStats - Remove unused statusDown constant; the DB transition path goes directly from statusRunning to statusConfirmedDown - Add comment to per-round ClaimBuckets call explaining the rebalancing intent Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…d orchestrator logic

…ck, and orchestrator paths

Fixes cleanup ordering deadlock in pool tests (LIFO cleanup, close channel before Drain). Adds tests for wpcom circuit breaker, veriflier transport, checker.Check paths, config hot-reload, dashboard SSE, audit helpers, orchestrator memory pressure, retry queue, and pure utility functions.

EVENTS.md: event-sourced architecture — lifecycle, idempotency, resolution reasons, causal links, and site-row projection. TAXONOMY.md: five-layer test taxonomy (Reachability → Transport → Infrastructure → Application → Content + Reverse checks), site/endpoint/ check data model, multi-state vocabulary, event schema, scope matrix, signal processing, and versioned implementation roadmap. ROADMAP.md: deferred public REST API — query and manage endpoints, auth, pagination, and uptime-bench integration context. AGENTS.md: architectural decision log covering event sourcing, severity vs. state separation, Seems Down lifecycle, in-place event updates, idempotent event identity, resolution reasons, causal vs. rollup links, and Unknown-is-not-downtime invariant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Jetmon 2 is no longer scoped as a drop-in replacement. It is a comprehensive site health monitoring platform — event-sourced, multi-layer, multi-endpoint — with a competitor-parity public REST API as a first-class product surface. ROADMAP.md: reframe public API from internal tooling to user-facing capability on parity with Pingdom, UptimeRobot, and Better Uptime. Expand from two capabilities (query, manage) to five: status and state, events and history, SLA statistics (uptime %, response time p95/p99, MTTR), monitor management (CRUD, pause, resume, trigger-now), and alert contacts with outbound webhooks. Add Unknown/Downtime distinction to uptime calculations, trigger-now async semantics, per-scope rate limiting with a dedicated trigger bucket, and key lifecycle CLI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Output binary as veriflier2-bin in the builder stage to avoid colliding with the veriflier2/ source directory, then copy it into the final image as veriflier2 so the entrypoint script works unchanged.

- Remove ADD COLUMN IF NOT EXISTS from migrations 2 and 7 — MySQL 8.0 does not support that MariaDB-only syntax; the migration tracker already prevents re-applying - Set JETMON_PID_FILE=/jetmon/jetmon2.pid in the entrypoint so the non-root jetmon user can write the PID file - Drop logs/stats host-volume mounts from docker-compose to avoid ownership conflicts with the container's jetmon user

Fresh dev databases don't have the production table, so migration 3 (ALTER TABLE) was failing. Migration 2 now creates the base table with IF NOT EXISTS so production deployments skip it safely. Renumbered subsequent migrations 3–7 to 4–8.

Docker looks for .dockerignore in the build context root (repo root), not in the Dockerfile's subdirectory. The misplaced docker/.dockerignore was never being applied, so the pre-built jetmon2 binary was leaking into the build context and could mask a fresh compile.

The database column last_status_change is DATETIME NULL, but the Site struct had it as time.Time (non-pointer). Go's database/sql returns "converting NULL to time.Time is unsupported" when scanning a NULL value into a non-pointer struct field, causing GetSitesForBucket to return an error on every round and silently skip all checks.

The entry 'veriflier2' matched the source directory after the binary was renamed to veriflier2-bin. Update to veriflier2-bin so the source tree is included in the build context while the local binary is not.

When the mounted config/ directory is owned by a different UID than the container's jetmon user, writing config.json there fails. Check if config/ is writable first; if not, generate config.json into /jetmon/ (container-owned) and set JETMON_CONFIG accordingly.

Matches the typical first user on Linux hosts, ensuring the container can read and write to host-mounted volumes without permission errors.

- Add JETMON_UID/JETMON_GID to .env-sample (default 1000) and wire them into docker-compose user: so the container runs as the host user - Revert Dockerfile to system user; chmod 777 internal dirs (logs, stats, certs) so they are writable under any UID - Fall back to /tmp/config.json when config/ is not writable, since /tmp is always world-writable

Chris Jean and others added 15 commits April 19, 2026 16:14

Introducing Jetmon 2.

77c1568

Fix Jetmon v2 scheduling and drain semantics

4cf0a53

Add veriflier reuse tests and worker memory pressure drain

9b2fbe2

Add unit tests for config validation, retry queue, checker result, an…

da21a8f

…d orchestrator logic

Add coverage for runtime orchestration paths

e805309

Add tests for wpcom circuit breaker, veriflier transport, checker.Che…

e78a9cd

…ck, and orchestrator paths

Ignore Go coverage output files

18c3544

Harden flaky and stateful test coverage

416f687

Remove compiled jetmon2 binary from tracking and add to .gitignore

03490e3

Add architecture overview document

7a2c388

Clean up architecture overview document

fb5f3bb

chrisbliss18 added enhancement DO NOT MERGE labels Apr 22, 2026

Chris Jean and others added 10 commits April 22, 2026 17:21

Add jetmon-pre-ship skill for pre-PR checklist automation

6329a0b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix veriflier Docker build output path collision

e659c1c

Output binary as veriflier2-bin in the builder stage to avoid colliding with the veriflier2/ source directory, then copy it into the final image as veriflier2 so the entrypoint script works unchanged.

Fix .dockerignore excluding veriflier2 source directory

30143bb

The entry 'veriflier2' matched the source directory after the binary was renamed to veriflier2-bin. Update to veriflier2-bin so the source tree is included in the build context while the local binary is not.

Set jetmon container user to UID/GID 1000

e58732c

Matches the typical first user on Linux hosts, ensuring the container can read and write to host-mounted volumes without permission errors.

chrisbliss18 mentioned this pull request Apr 25, 2026

Internal REST API foundation: events, webhooks, alerting #72

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jetmon 2 — Site health platform#61

Jetmon 2 — Site health platform#61
chrisbliss18 wants to merge 25 commits intomasterfrom
v2

chrisbliss18 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chrisbliss18 commented Apr 22, 2026

What changed from PR #60

Why Go

Benefits of the Rewrite

Memory

Concurrent Checks

Throughput

Check Scheduling Accuracy

Deployment Speed

Mean Time to Recovery

Operational Complexity

Build order

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant