Skip to content

chore(observability): PR #83 follow-ups (webhook hygiene, alertmanager watchdog, cold-start gating + 6 nits)#116

Merged
0xfandom merged 1 commit into
mainfrom
chore/obs-pr83-followups
May 5, 2026
Merged

chore(observability): PR #83 follow-ups (webhook hygiene, alertmanager watchdog, cold-start gating + 6 nits)#116
0xfandom merged 1 commit into
mainfrom
chore/obs-pr83-followups

Conversation

@0xfandom

Copy link
Copy Markdown
Collaborator

Summary

  • chore: infrastructure cleanup and dependency updates #1 Slack webhook leak — escape $$SLACK_WEBHOOK_URL in alertmanager entrypoint and swap sed delimiter to # so the resolved URL no longer ends up in docker inspect ARGV. Entrypoint also fails fast with a loud message when the env is unset.
  • feat(sol): migrate to OpenZeppelin SafeERC20 #2 Alertmanager-down silence — Prometheus now scrapes alertmanager:9093, and a new AlertmanagerDown rule fires after 2m of up{job="alertmanager"} == 0. Closes the gap where a config-validation crashloop silently took Slack delivery offline.
  • feat(sol): add deployment scripts and tests #3 Cold-start false pageAetherNoOpportunities is now gated behind a 30m warm-up via process_start_time_seconds, so a fresh boot or restart no longer pages operators while the rate window is still building.
  • feat: add Go protobuf stubs and gRPC client #4 Disabled-builder confusionNewSubmitter skips PreRegisterBuilderLabels for builders configured with Enabled: false, and the AetherBuilderDown annotation now documents the intentional silence.
  • feat(go): extract shared risk, config, and test packages #5 system_state encoding drift — added SYNC SOURCE comments at every site that depends on the integer mapping (stateToInt, gauge Help text, internal/risk/state.go constants, AetherHalted rule) so a future state renumber surfaces in code review.
  • feat(go): wire executor arb processing pipeline #6 Pinned compose tagsprom/prometheus:v2.54.1, prom/alertmanager:v0.27.0, grafana/grafana:10.4.7 instead of :latest.
  • feat(go): add pool discovery CLI utility #7 Inhibit rule rewrite — old equal: [alertname] was a no-op (each alert has a unique alertname); new rule suppresses every warning/info while AetherHalted is firing so operators only see the halt page until manual reset.
  • test(go): add comprehensive executor unit tests #8 Test isolationTestRecordBuilderResult_ScrapeLabels switched from real builder names (flashbots, titan) to scrape_alpha/scrape_beta to avoid global-registry pollution that would have broken later aggregate-rate assertions.
  • test(go): add integration and cross-language gRPC tests #9 Quieter precision lossaddBigIntCounter no longer emits a per-bundle log.Printf once cumulative wei crosses 2^53; the event is exposed instead as the new aether_metrics_precision_loss_total counter so it stays dashboardable without drowning executor logs.

Files Changed

File Item Purpose
deploy/docker/docker-compose.yml #1, #6 Escape $$SLACK_WEBHOOK_URL, swap sed delimiter, fail-fast on empty env, pin prom/alertmanager/grafana tags
deploy/docker/prometheus.yml #2 Add alertmanager scrape job
deploy/docker/prometheus/alerts.yml #2, #3, #4, #5 New AlertmanagerDown rule, warm-up gate on AetherNoOpportunities, doc note on AetherBuilderDown, sync-source comment on AetherHalted
deploy/docker/alertmanager.yml #7 Replace no-op inhibit rule with AetherHalted → suppress non-critical
cmd/executor/submitter.go #4 Skip pre-register for disabled builders
cmd/executor/main.go #5 Sync-source comment on stateToInt
cmd/executor/metrics.go #5, #9 Sync-source comment on systemStateGauge; replace precision-loss log.Printf with aether_metrics_precision_loss_total counter
cmd/executor/metrics_test.go #8 Unique scrape_alpha / scrape_beta prefixes
internal/risk/state.go #5 Sync-source comment on SystemState constants

Acceptance Criteria

  • Slack webhook no longer appears in docker inspect aether-alertmanager ARGV (verified live: docker inspect ... .Args shows literal $SLACK_WEBHOOK_URL, not the resolved value)
  • AlertmanagerDown fires when alertmanager scrape target is down (verified live: Prometheus targets API reports alertmanager job UP after compose-up)
  • Bot restart no longer triggers AetherNoOpportunities during the first 30m (gate expression: unless on() ((time() - min(process_start_time_seconds{job="aether-rust"})) < 1800))
  • Disabled builders skipped in PreRegisterBuilderLabels (b.Enabled check), AetherBuilderDown annotation documents the intentional silence
  • Cross-reference comments added at all 4 system_state sites (stateToInt, systemStateGauge, internal/risk/state.go, AetherHalted)
  • All observability-stack images pinned to explicit majors (v2.54.1, v0.27.0, 10.4.7)
  • Inhibit rule rewritten with operational intent (AetherHalted source → all warning/info targets)
  • Metrics test uses unique builder-name prefixes (scrape_alpha, scrape_beta)
  • Precision-loss log no longer spams during normal operation (replaced by counter)

Test plan

  • cargo build --workspace — green
  • cargo build --release --bin aether-rust — green
  • cargo clippy --workspace --all-targets -- -D warnings — clean
  • cargo test --workspace --release — 415 passed, 0 failed
  • go build ./... — green
  • go vet ./... — clean
  • go test ./... -race -count=1 — green
  • promtool check rules deploy/docker/prometheus/alerts.yml — 8 rules found
  • promtool check config deploy/docker/prometheus.yml — valid syntax
  • amtool check-config deploy/docker/alertmanager.yml — 1 inhibit rule, 1 receiver
  • docker compose up -d prometheus alertmanager — both healthy, alertmanager loads patched config, prometheus scrapes new alertmanager target as UP
  • docker inspect aether-alertmanager — webhook URL absent from .Args
  • Go executor + monitor binaries boot and emit expected JSON logs

Closes #99

Webhook hygiene, alertmanager watchdog, cold-start gating, image
pinning, inhibit rule fix, system_state cross-refs, test isolation,
quieter precision log. See PR body for per-item detail.
@vercel

vercel Bot commented Apr 29, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
aether Ready Ready Preview, Comment Apr 29, 2026 8:20am
aether-63xv Ready Ready Preview, Comment Apr 29, 2026 8:20am

@0xfandom 0xfandom merged commit 2ca9285 into main May 5, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[E2/WS-7] Observability PR #83 follow-up: alertmanager watchdog, webhook hygiene, cold-start alert gating

1 participant