Skip to content

Jetmon 2 — Site health platform#61

Open
chrisbliss18 wants to merge 25 commits intomasterfrom
v2
Open

Jetmon 2 — Site health platform#61
chrisbliss18 wants to merge 25 commits intomasterfrom
v2

Conversation

@chrisbliss18
Copy link
Copy Markdown
Contributor

Work in progress. This branch (v2) is the ambitious successor to the Go rewrite started in refactor/jetmon2 (PR #60). It includes everything from that branch and extends it with a new architecture and direction.


What changed from PR #60

PR #60 scoped Jetmon 2 as a drop-in replacement: same interfaces, same schema, same behaviour — just Go instead of Node.js + C++. That work is complete and forms the base of this branch.

This branch pivots to a larger goal: Jetmon 2 as a full site health monitoring platform, not just an uptime tracker. The key additions:

  • Event-sourced architecture. Site state is derived from an event log, not a mutable status column. The event log is the source of truth; the site row carries a denormalized projection for fast reads. Full design in EVENTS.md.
  • Five-layer test taxonomy. Reachability → Transport & Security → Infrastructure & Edge → Application Response → Content Integrity, plus Reverse Checks (agent-reported signals from inside WordPress). ~55 v1 items, ~55 v2, ~40 v3. Full taxonomy in TAXONOMY.md.
  • Site → Endpoint → Check hierarchy. Sites have multiple endpoints; each endpoint has multiple checks of different types. Site state rolls up from endpoint state, which rolls up from check results. Rollup rules are explicit and configurable per site.
  • Multi-state vocabulary. Up, Warning, Degraded, Seems Down, Down, Paused, Maintenance, Unknown. Unknown is not downtime — monitor-side failures never inflate customer downtime figures.
  • Competitor-parity public REST API. Five capability groups: status and state, events and history, SLA statistics (uptime %, response time p95/p99, MTTR), monitor management (CRUD, pause, resume, trigger-now), and alert contacts with outbound webhooks. Full design in ROADMAP.md.
  • Gradual rollout with back-compat. The existing site_status column keeps receiving derived writes so current consumers are not broken. New capabilities are additive; consumers adopt progressively.

Architectural decisions are locked in AGENTS.md so they are enforced consistently across all changes.


Why Go

The current architecture uses forked Node.js processes (8–16MB RSS each at startup, 53MB limit before recycling) as workers, plus a compiled C++ addon to escape Node's event loop for blocking network I/O. Go eliminates both constraints:

  • Goroutines start at ~4KB of stack and grow on demand, making 50,000 concurrent checks on a single host practical without the memory overhead of forked processes or libuv thread pools
  • net/http and crypto/tls are first-class stdlib packages — no native addon, no node-gyp, no compilation step during deployment
  • net/http/httptrace provides DNS, TCP, TLS, and TTFB timing hooks as separate measurements within each check, for free
  • Single static binary deployment with no runtime dependencies, no node_modules, and no addon rebuild on Node.js version upgrades
  • Built-in profiling via pprof, race detector via go test -race, and a mature testing ecosystem
  • Graceful goroutine lifecycle management replaces the fragile worker spawn/recycle/evaporate lifecycle

The Veriflier is rewritten in Go as well, replacing the Qt C++ dependency with a lightweight Go HTTP service. The protocol between Monitor and Verifliers moves from custom HTTPS to gRPC, providing type-safe contracts, built-in retries, and bidirectional streaming for future use.

Benefits of the Rewrite

Memory

The current architecture forks Node.js worker processes that start at 8–16MB RSS and are recycled once they reach 53MB. With a typical deployment of 8–16 workers, the process tree consumes 240–850MB of resident memory just for worker overhead, before any check data is counted.

Jetmon 2 runs as a single process. Go goroutines start at 4KB of stack and grow on demand. A pool of 1,000 concurrent goroutines costs roughly 4MB of stack. Total process RSS for an equivalent workload is estimated at 50–150MB — a 75–90% reduction in memory consumption per host.

Concurrent Checks

Current concurrency is bounded by the number of worker processes. Each worker is a single-threaded Node.js process; practical concurrency per host is in the low hundreds.

Go's goroutine scheduler makes 10,000+ concurrent in-flight checks on a single host practical with no additional configuration. At a conservative network timeout of 10 seconds and average site response time of 200ms, a pool of 1,000 goroutines sustains approximately 5,000 check completions per second — an estimated 10–50× increase in concurrent checks per host.

Throughput

The current architecture crosses a process boundary on every unit of work: master dispatches via IPC, worker receives and processes, replies via IPC, master aggregates. Each crossing involves serialisation, a context switch, and V8 event loop scheduling on both ends.

Jetmon 2 replaces all IPC with Go channel sends, which are in-process and order-of-magnitude cheaper. Estimated throughput improvement: 3–10× more sites checked per second per host under equivalent conditions.

Check Scheduling Accuracy

The current system uses setTimeout and setInterval for round scheduling, subject to V8 event loop delay — a busy loop can delay a callback by tens to hundreds of milliseconds, introducing jitter into RTT measurements.

Go's time.Ticker fires with OS-level timer precision. RTT measurements from net/http/httptrace are taken inside the HTTP stack with no event loop between the measurement point and the timer.

Deployment Speed

Current deployment requires npm install, a node-gyp rebuild of the native C++ addon, and a coordinated process restart. A failed addon compilation blocks deployment entirely.

Jetmon 2 deploys as a single static binary. Deployment is: copy binary, systemctl restart jetmon2. Total deployment time drops from several minutes to under 30 seconds.

Mean Time to Recovery

A worker process crash requires the master to detect the exit, spawn a replacement, and wait for initialisation — several seconds, with in-flight checks unresolved.

In Jetmon 2, a panicking goroutine is recovered by a deferred handler, the result counted as an error, and a replacement goroutine immediately spawned — recovery in the low milliseconds. For a full process crash, systemd restarts the binary; with Go's fast startup, the process is accepting work again in under 2 seconds.

Operational Complexity

The current system requires managing Node.js version compatibility, native addon compilation, npm dependency trees, and the fragile worker spawn/recycle lifecycle.

Jetmon 2 eliminates all of this. One artifact to manage: the Go binary. No node-gyp, no npm, no Node.js version management.


Build order

  1. Schemajetmon_endpoints, jetmon_events, updated site row projection, back-compat site_status derived write
  2. Probe runner — replaces per-site HTTP check loop; iterates endpoints; owns idempotent event dedup
  3. Check types — DNS, TLS cert expiry, redirect chain, keyword, TTFB (v1 taxonomy)
  4. Public REST API — status/state, events/history, SLA statistics, monitor management, alert contacts/webhooks

🤖 Generated with Claude Code

Chris Jean and others added 15 commits April 19, 2026 16:14
…orld ReadMemStats

- refreshVeriflierClients now diffs addr|token fingerprints and skips
  rebuilding when the verifier list is unchanged, preserving TCP
  connection pools between rounds
- Remove runtime.ReadMemStats stop-the-world call — it was logging but
  taking no action; memory metrics are already covered by EmitMemStats
- Remove unused statusDown constant; the DB transition path goes directly
  from statusRunning to statusConfirmedDown
- Add comment to per-round ClaimBuckets call explaining the rebalancing intent

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes cleanup ordering deadlock in pool tests (LIFO cleanup, close channel
before Drain). Adds tests for wpcom circuit breaker, veriflier transport,
checker.Check paths, config hot-reload, dashboard SSE, audit helpers,
orchestrator memory pressure, retry queue, and pure utility functions.
EVENTS.md: event-sourced architecture — lifecycle, idempotency,
resolution reasons, causal links, and site-row projection.

TAXONOMY.md: five-layer test taxonomy (Reachability → Transport →
Infrastructure → Application → Content + Reverse checks), site/endpoint/
check data model, multi-state vocabulary, event schema, scope matrix,
signal processing, and versioned implementation roadmap.

ROADMAP.md: deferred public REST API — query and manage endpoints,
auth, pagination, and uptime-bench integration context.

AGENTS.md: architectural decision log covering event sourcing, severity
vs. state separation, Seems Down lifecycle, in-place event updates,
idempotent event identity, resolution reasons, causal vs. rollup links,
and Unknown-is-not-downtime invariant.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Jetmon 2 is no longer scoped as a drop-in replacement. It is a
comprehensive site health monitoring platform — event-sourced,
multi-layer, multi-endpoint — with a competitor-parity public REST API
as a first-class product surface.

ROADMAP.md: reframe public API from internal tooling to user-facing
capability on parity with Pingdom, UptimeRobot, and Better Uptime.
Expand from two capabilities (query, manage) to five: status and state,
events and history, SLA statistics (uptime %, response time p95/p99,
MTTR), monitor management (CRUD, pause, resume, trigger-now), and
alert contacts with outbound webhooks. Add Unknown/Downtime distinction
to uptime calculations, trigger-now async semantics, per-scope rate
limiting with a dedicated trigger bucket, and key lifecycle CLI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Chris Jean and others added 10 commits April 22, 2026 17:21
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Output binary as veriflier2-bin in the builder stage to avoid colliding
with the veriflier2/ source directory, then copy it into the final image
as veriflier2 so the entrypoint script works unchanged.
- Remove ADD COLUMN IF NOT EXISTS from migrations 2 and 7 — MySQL 8.0
  does not support that MariaDB-only syntax; the migration tracker already
  prevents re-applying
- Set JETMON_PID_FILE=/jetmon/jetmon2.pid in the entrypoint so the
  non-root jetmon user can write the PID file
- Drop logs/stats host-volume mounts from docker-compose to avoid
  ownership conflicts with the container's jetmon user
Fresh dev databases don't have the production table, so migration 3
(ALTER TABLE) was failing. Migration 2 now creates the base table with
IF NOT EXISTS so production deployments skip it safely. Renumbered
subsequent migrations 3–7 to 4–8.
Docker looks for .dockerignore in the build context root (repo root),
not in the Dockerfile's subdirectory. The misplaced docker/.dockerignore
was never being applied, so the pre-built jetmon2 binary was leaking
into the build context and could mask a fresh compile.
The database column last_status_change is DATETIME NULL, but the Site
struct had it as time.Time (non-pointer). Go's database/sql returns
"converting NULL to time.Time is unsupported" when scanning a NULL
value into a non-pointer struct field, causing GetSitesForBucket to
return an error on every round and silently skip all checks.
The entry 'veriflier2' matched the source directory after the binary
was renamed to veriflier2-bin. Update to veriflier2-bin so the source
tree is included in the build context while the local binary is not.
When the mounted config/ directory is owned by a different UID than
the container's jetmon user, writing config.json there fails. Check
if config/ is writable first; if not, generate config.json into
/jetmon/ (container-owned) and set JETMON_CONFIG accordingly.
Matches the typical first user on Linux hosts, ensuring the container
can read and write to host-mounted volumes without permission errors.
- Add JETMON_UID/JETMON_GID to .env-sample (default 1000) and wire
  them into docker-compose user: so the container runs as the host user
- Revert Dockerfile to system user; chmod 777 internal dirs (logs,
  stats, certs) so they are writable under any UID
- Fall back to /tmp/config.json when config/ is not writable, since
  /tmp is always world-writable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant