chore(observability): PR #83 follow-ups (webhook hygiene, alertmanager watchdog, cold-start gating + 6 nits)#116
Merged
Merged
Conversation
Webhook hygiene, alertmanager watchdog, cold-start gating, image pinning, inhibit rule fix, system_state cross-refs, test isolation, quieter precision log. See PR body for per-item detail.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
$$SLACK_WEBHOOK_URLin alertmanager entrypoint and swap sed delimiter to#so the resolved URL no longer ends up indocker inspectARGV. Entrypoint also fails fast with a loud message when the env is unset.alertmanager:9093, and a newAlertmanagerDownrule fires after 2m ofup{job="alertmanager"} == 0. Closes the gap where a config-validation crashloop silently took Slack delivery offline.AetherNoOpportunitiesis now gated behind a 30m warm-up viaprocess_start_time_seconds, so a fresh boot or restart no longer pages operators while the rate window is still building.NewSubmitterskipsPreRegisterBuilderLabelsfor builders configured withEnabled: false, and theAetherBuilderDownannotation now documents the intentional silence.SYNC SOURCEcomments at every site that depends on the integer mapping (stateToInt, gauge Help text,internal/risk/state.goconstants,AetherHaltedrule) so a future state renumber surfaces in code review.prom/prometheus:v2.54.1,prom/alertmanager:v0.27.0,grafana/grafana:10.4.7instead of:latest.equal: [alertname]was a no-op (each alert has a unique alertname); new rule suppresses every warning/info whileAetherHaltedis firing so operators only see the halt page until manual reset.TestRecordBuilderResult_ScrapeLabelsswitched from real builder names (flashbots,titan) toscrape_alpha/scrape_betato avoid global-registry pollution that would have broken later aggregate-rate assertions.addBigIntCounterno longer emits a per-bundlelog.Printfonce cumulative wei crosses 2^53; the event is exposed instead as the newaether_metrics_precision_loss_totalcounter so it stays dashboardable without drowning executor logs.Files Changed
deploy/docker/docker-compose.yml$$SLACK_WEBHOOK_URL, swap sed delimiter, fail-fast on empty env, pin prom/alertmanager/grafana tagsdeploy/docker/prometheus.ymlalertmanagerscrape jobdeploy/docker/prometheus/alerts.ymlAlertmanagerDownrule, warm-up gate onAetherNoOpportunities, doc note onAetherBuilderDown, sync-source comment onAetherHalteddeploy/docker/alertmanager.ymlAetherHalted→ suppress non-criticalcmd/executor/submitter.gocmd/executor/main.gostateToIntcmd/executor/metrics.gosystemStateGauge; replace precision-losslog.Printfwithaether_metrics_precision_loss_totalcountercmd/executor/metrics_test.goscrape_alpha/scrape_betaprefixesinternal/risk/state.goSystemStateconstantsAcceptance Criteria
docker inspect aether-alertmanagerARGV (verified live:docker inspect ... .Argsshows literal$SLACK_WEBHOOK_URL, not the resolved value)AlertmanagerDownfires when alertmanager scrape target is down (verified live: Prometheus targets API reportsalertmanagerjob UP after compose-up)AetherNoOpportunitiesduring the first 30m (gate expression:unless on() ((time() - min(process_start_time_seconds{job="aether-rust"})) < 1800))PreRegisterBuilderLabels(b.Enabledcheck),AetherBuilderDownannotation documents the intentional silencestateToInt,systemStateGauge,internal/risk/state.go,AetherHalted)v2.54.1,v0.27.0,10.4.7)AetherHaltedsource → all warning/info targets)scrape_alpha,scrape_beta)Test plan
cargo build --workspace— greencargo build --release --bin aether-rust— greencargo clippy --workspace --all-targets -- -D warnings— cleancargo test --workspace --release— 415 passed, 0 failedgo build ./...— greengo vet ./...— cleango test ./... -race -count=1— greenpromtool check rules deploy/docker/prometheus/alerts.yml— 8 rules foundpromtool check config deploy/docker/prometheus.yml— valid syntaxamtool check-config deploy/docker/alertmanager.yml— 1 inhibit rule, 1 receiverdocker compose up -d prometheus alertmanager— both healthy, alertmanager loads patched config, prometheus scrapes new alertmanager target as UPdocker inspect aether-alertmanager— webhook URL absent from.ArgsCloses #99