uptime-bench

A benchmark suite for evaluating uptime monitoring services — by being deliberately and reproducibly mean to a fleet of webservers, then grading the watchers on whether they noticed.

uptime-bench runs controlled failure scenarios against target endpoints and measures how each monitoring service detects, classifies, and reports each failure. It produces structured, apples-to-apples comparisons across services that otherwise expose very different dashboards, alerting semantics, and terminology.

Why

Every monitoring vendor publishes their own uptime numbers. Their detection intervals vary, their definitions of "down" disagree in subtle ways, their behavior under flapping conditions or partial outages is rarely documented, and their dashboards are designed to demo well rather than to support a head-to-head comparison.

So we built a controlled environment where the failures are scripted, the timestamps are ground truth, and the same scenario can run against five services on the same afternoon.

What it does

Runs a fleet of controllable target servers that simulate specific failure modes — hard downtime, slow responses, intermittent 5xxs, content tampering, TLS errors, DNS anomalies, and more
Points monitoring services at those targets and waits for detection
Records what each service detected, when it detected it, and how it classified it
Produces structured comparison data: detection latency, accuracy, false-positive rate, and classification fidelity

A handful of the eleven shipped scenarios involve serving a fully-rendered ransomware demand, a hacktivist defacement, or hidden SEO spam — all with a 200 OK status. A monitor that only watches status codes is going to have a rough time. The full menu lives in SCENARIOS.md; the per-scenario file format in SCHEMA.md.

Services under evaluation

Service	Type	Adapter
Jetmon 1	Agent-based (WordPress/Jetpack)	Implemented (`jetmon-v1`, via jetmon-bridge)
Jetmon 2	Agent-based (WordPress/Jetpack)	Stub — blocked on Jetmon 2's public REST API
UptimeRobot	Probe-based	Implemented (`uptimerobot`) — live-tested against the public API
Pingdom	Probe-based	Implemented (`pingdom`) — live-tested against the public API
Datadog Synthetics	Probe-based	Implemented (`datadog-synthetics`) — live-tested against the public API
Better Uptime	Probe-based	Implemented (`better-uptime`) — live-tested against the public API

Adding a new adapter is a small, well-defined exercise — implement the adapter.Adapter interface (five methods), drop a normalization table next to it, register the type in cmd/harness/main.go. See ADAPTER.md for the contract.

Status

The end-to-end pipeline runs: target server, DNS server, control plane, harness, runner, MySQL event log, and the Jetmon 1 adapter. CI (go vet, gofmt, go test -race) is green on every push. Eleven scenarios across HTTP, TCP, DNS, and content failures are defined and runnable; the TLS scenarios are schema-defined but waiting on an HTTPS listener in the target binary (see ROADMAP.md).

Notable design choices, all enforced by the code or the tests:

The event log is the canonical record. Detection metrics are computed from the raw log in a separate pass; nothing is ever patched in place.
Adapters absorb service-specific complexity. The harness never branches on which service is under evaluation. Every adapter owns its own raw-label mapping table.
Unknown is not a missed detection. When an adapter can't reach its service's API, the result is RetrieveUnknown — not a false negative. It's recorded separately and excluded from accuracy metrics.
Reproducibility is non-negotiable. Every randomized injection is seeded, every seed is recorded with the run, and every adapter version is pinned.

Architecture

ARCHITECTURE.md is the full design doc. The other docs in this directory each cover one piece:

SCENARIOS.md — the failure scenario library
SCHEMA.md — scenario file format (TOML)
ADAPTER.md — monitor adapter interface
EVENTS.md — ground-truth event log and output schema
OPERATIONS.md — fleet provisioning, deployment, and operations
TESTING.md — local POC quick-start
ROADMAP.md — deferred features and unfinished work

Local development

Requires Go 1.26+ and Docker.

cp .env.example .env                       # configure local credentials
make dev                                   # start MySQL + Adminer
cp fleet.example.toml fleet.toml           # configure fleet
cp services.example.toml services.toml    # configure monitoring services
make build                                 # build all binaries

Adminer (database UI) is available at http://localhost:8081 after make dev.

For the full local POC including Jetmon and the bridge, see TESTING.md.

Testing

go test -race ./...

Over a hundred and fifty test cases across fifteen packages, including a corpus check that asserts every shipped scenario file parses cleanly and every documented config example loads without error. The DNS server and control plane are tested as units (the partial-read, latency-parallelism, and timing-attack regression tests are doing real work). Live build-tagged smoke tests under each adapter (internal/adapter/<name>/live_test.go) exercise the full Provision/Retrieve/Deprovision contract against the real APIs; CI compiles them but never runs them, since they require credentials.

Deployment

Fleet servers run Ubuntu Server 24.04. Provisioning is split into two phases: setup once, deploy whenever the binary changes.

make provision-target TARGET_HOST=203.0.113.20 HARNESS_IP=203.0.113.5
make deploy-target    TARGET_HOST=203.0.113.20

The provisioning script is idempotent — re-running it after a config or systemd-unit change is the supported upgrade path. Configuration files are auto-created from skeletons with correct ownership; the operator just edits values. See OPERATIONS.md for the full fleet bring-up sequence.

Scope

In scope:

Detection latency (time from failure start to first alert)
Detection accuracy (true positives, false positives, missed incidents)
Incident classification fidelity
Behavior under ambiguous conditions: slow responses, intermittent failures, DNS anomalies, content tampering with a healthy status code

Out of scope:

Dashboard or UI evaluation
Pricing or plan comparisons
Load-testing the monitoring services themselves
Alerting channel evaluation (PagerDuty, Slack integrations)

If you find uptime-bench-canary somewhere it shouldn't be, you've discovered our marker string in a healthy response body. It exists so content-inspecting monitors have something to anchor on, and it's how we tell a tampered page apart from a normal one in the test fleet.

License

GPL v2.0. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

uptime-bench

Why

What it does

Services under evaluation

Status

Architecture

Local development

Testing

Deployment

Scope

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github/workflows		.github/workflows
cmd		cmd
deploy		deploy
docker		docker
internal		internal
scenarios		scenarios
schema		schema
.env.example		.env.example
.gitignore		.gitignore
ADAPTER.md		ADAPTER.md
ARCHITECTURE.md		ARCHITECTURE.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
EVENTS.md		EVENTS.md
LICENSE		LICENSE
Makefile		Makefile
OPERATIONS.md		OPERATIONS.md
README.md		README.md
ROADMAP.md		ROADMAP.md
SCENARIOS.md		SCENARIOS.md
SCHEMA.md		SCHEMA.md
TESTING.md		TESTING.md
docker-compose.yml		docker-compose.yml
fleet.example.toml		fleet.example.toml
go.mod		go.mod
go.sum		go.sum
services.example.toml		services.example.toml

Folders and files

Latest commit

History

Repository files navigation

uptime-bench

Why

What it does

Services under evaluation

Status

Architecture

Local development

Testing

Deployment

Scope

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages