Skip to content

Fix monitoring alert rules and log sinks#606

Open
mulatta wants to merge 8 commits into
mainfrom
monitoring-stack-reliability
Open

Fix monitoring alert rules and log sinks#606
mulatta wants to merge 8 commits into
mainfrom
monitoring-stack-reliability

Conversation

@mulatta

@mulatta mulatta commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • fix Prometheus rules to use current host_* Vector metric names
  • replace ineffective scrape-target-only node alert with per-host remote-write freshness alerts
  • add Prometheus alert for failing Gatus endpoints
  • fix rho/tau Vector local Loki sinks to use the Loki listen address instead of 127.0.0.1
  • exclude unreadable systemd credential mounts from Vector filesystem metrics
  • add a Gatus self-check for status.sjanglab.org

Validation

  • nix fmt
  • nix eval --impure --json --expr 'let f = builtins.getFlake (toString ./.); in builtins.mapAttrs (_: c: c.config.boot.zfs.forceImportRoot) f.nixosConfigurations'
  • nix run nixpkgs#nix-fast-build -- --flake .#checks --systems x86_64-linux --impure --no-link --fail-fast --no-nom --select 'checks: { inherit (checks.x86_64-linux) nixos-rho nixos-tau nixos-eta; }'

mulatta added 8 commits July 3, 2026 16:27
Prometheus alerts referenced Vector metric names that no longer exist, so disk, memory, and CPU alerts could not fire. Remote-write hosts also do not appear as scrape targets, which made the existing NodeDown rule ineffective for eta, psi, and tau.

Use the current host metric names, add per-host freshness alerts for remote-write data, and alert on failing Gatus endpoints. Point local Vector Loki sinks at the address Loki actually listens on, exclude unreadable systemd credential mounts from filesystem metrics, and add a Gatus self-check so the status page path is visible in Gatus.
Keep psi system builds focused on CUDA services instead of pulling i686 graphics compatibility outputs needed only for 32-bit desktop applications.
Avoid noisy host metric collection errors from Docker network namespace bind mounts that Vector cannot stat safely.
Avoid noisy host metric collection errors from Docker overlay mountpoints that Vector cannot stat safely.
Make Grafana useful immediately after deployment instead of requiring ad-hoc Explore queries for basic host, endpoint, and SSH visibility.
Avoid breaking startup on existing Grafana databases by keeping datasource provisioning compatible with the already-created Prometheus and Loki entries.
Avoid anonymous API and UI failures on fresh deployments where the custom Public organization does not exist.
ProxyJump hides the real client address from internal hosts because they only see eta as the source. Emit a bastion-side audit event that ties the authenticated client IP to each forwarded target so Loki can answer who reached which host.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant