chore(observability): PR #83 follow-ups (webhook hygiene, alertmanager watchdog, cold-start gating + 6 nits) by 0xfandom · Pull Request #116 · Pablosinyores/aether

0xfandom · 2026-04-29T08:20:23Z

Summary

chore: infrastructure cleanup and dependency updates #1 Slack webhook leak — escape $$SLACK_WEBHOOK_URL in alertmanager entrypoint and swap sed delimiter to # so the resolved URL no longer ends up in docker inspect ARGV. Entrypoint also fails fast with a loud message when the env is unset.
feat(sol): migrate to OpenZeppelin SafeERC20 #2 Alertmanager-down silence — Prometheus now scrapes alertmanager:9093, and a new AlertmanagerDown rule fires after 2m of up{job="alertmanager"} == 0. Closes the gap where a config-validation crashloop silently took Slack delivery offline.
feat(sol): add deployment scripts and tests #3 Cold-start false page — AetherNoOpportunities is now gated behind a 30m warm-up via process_start_time_seconds, so a fresh boot or restart no longer pages operators while the rate window is still building.
feat: add Go protobuf stubs and gRPC client #4 Disabled-builder confusion — NewSubmitter skips PreRegisterBuilderLabels for builders configured with Enabled: false, and the AetherBuilderDown annotation now documents the intentional silence.
feat(go): extract shared risk, config, and test packages #5 system_state encoding drift — added SYNC SOURCE comments at every site that depends on the integer mapping (stateToInt, gauge Help text, internal/risk/state.go constants, AetherHalted rule) so a future state renumber surfaces in code review.
feat(go): wire executor arb processing pipeline #6 Pinned compose tags — prom/prometheus:v2.54.1, prom/alertmanager:v0.27.0, grafana/grafana:10.4.7 instead of :latest.
feat(go): add pool discovery CLI utility #7 Inhibit rule rewrite — old equal: [alertname] was a no-op (each alert has a unique alertname); new rule suppresses every warning/info while AetherHalted is firing so operators only see the halt page until manual reset.
test(go): add comprehensive executor unit tests #8 Test isolation — TestRecordBuilderResult_ScrapeLabels switched from real builder names (flashbots, titan) to scrape_alpha/scrape_beta to avoid global-registry pollution that would have broken later aggregate-rate assertions.
test(go): add integration and cross-language gRPC tests #9 Quieter precision loss — addBigIntCounter no longer emits a per-bundle log.Printf once cumulative wei crosses 2^53; the event is exposed instead as the new aether_metrics_precision_loss_total counter so it stays dashboardable without drowning executor logs.

Files Changed

File	Item	Purpose
`deploy/docker/docker-compose.yml`	#1, #6	Escape `$$SLACK_WEBHOOK_URL`, swap sed delimiter, fail-fast on empty env, pin prom/alertmanager/grafana tags
`deploy/docker/prometheus.yml`	#2	Add `alertmanager` scrape job
`deploy/docker/prometheus/alerts.yml`	#2, #3, #4, #5	New `AlertmanagerDown` rule, warm-up gate on `AetherNoOpportunities`, doc note on `AetherBuilderDown`, sync-source comment on `AetherHalted`
`deploy/docker/alertmanager.yml`	#7	Replace no-op inhibit rule with `AetherHalted` → suppress non-critical
`cmd/executor/submitter.go`	#4	Skip pre-register for disabled builders
`cmd/executor/main.go`	#5	Sync-source comment on `stateToInt`
`cmd/executor/metrics.go`	#5, #9	Sync-source comment on `systemStateGauge`; replace precision-loss `log.Printf` with `aether_metrics_precision_loss_total` counter
`cmd/executor/metrics_test.go`	#8	Unique `scrape_alpha` / `scrape_beta` prefixes
`internal/risk/state.go`	#5	Sync-source comment on `SystemState` constants

Acceptance Criteria

Slack webhook no longer appears in docker inspect aether-alertmanager ARGV (verified live: docker inspect ... .Args shows literal $SLACK_WEBHOOK_URL, not the resolved value)
AlertmanagerDown fires when alertmanager scrape target is down (verified live: Prometheus targets API reports alertmanager job UP after compose-up)
Bot restart no longer triggers AetherNoOpportunities during the first 30m (gate expression: unless on() ((time() - min(process_start_time_seconds{job="aether-rust"})) < 1800))
Disabled builders skipped in PreRegisterBuilderLabels (b.Enabled check), AetherBuilderDown annotation documents the intentional silence
Cross-reference comments added at all 4 system_state sites (stateToInt, systemStateGauge, internal/risk/state.go, AetherHalted)
All observability-stack images pinned to explicit majors (v2.54.1, v0.27.0, 10.4.7)
Inhibit rule rewritten with operational intent (AetherHalted source → all warning/info targets)
Metrics test uses unique builder-name prefixes (scrape_alpha, scrape_beta)
Precision-loss log no longer spams during normal operation (replaced by counter)

Test plan

Closes #99

Webhook hygiene, alertmanager watchdog, cold-start gating, image pinning, inhibit rule fix, system_state cross-refs, test isolation, quieter precision log. See PR body for per-item detail.

vercel · 2026-04-29T08:20:28Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
aether	Ready	Preview, Comment	Apr 29, 2026 8:20am
aether-63xv	Ready	Preview, Comment	Apr 29, 2026 8:20am

chore(observability): PR #83 follow-ups

19bb50f

Webhook hygiene, alertmanager watchdog, cold-start gating, image pinning, inhibit rule fix, system_state cross-refs, test isolation, quieter precision log. See PR body for per-item detail.

vercel Bot deployed to Preview – aether-63xv April 29, 2026 08:20 View deployment

0xfandom merged commit 2ca9285 into main May 5, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(observability): PR #83 follow-ups (webhook hygiene, alertmanager watchdog, cold-start gating + 6 nits)#116

chore(observability): PR #83 follow-ups (webhook hygiene, alertmanager watchdog, cold-start gating + 6 nits)#116
0xfandom merged 1 commit into
mainfrom
chore/obs-pr83-followups

0xfandom commented Apr 29, 2026

Uh oh!

vercel Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

0xfandom commented Apr 29, 2026

Summary

Files Changed

Acceptance Criteria

Test plan

Uh oh!

vercel Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Apr 29, 2026 •

edited

Loading