Fix monitoring alert rules and log sinks#606
Open
mulatta wants to merge 8 commits into
Open
Conversation
Prometheus alerts referenced Vector metric names that no longer exist, so disk, memory, and CPU alerts could not fire. Remote-write hosts also do not appear as scrape targets, which made the existing NodeDown rule ineffective for eta, psi, and tau. Use the current host metric names, add per-host freshness alerts for remote-write data, and alert on failing Gatus endpoints. Point local Vector Loki sinks at the address Loki actually listens on, exclude unreadable systemd credential mounts from filesystem metrics, and add a Gatus self-check so the status page path is visible in Gatus.
Keep psi system builds focused on CUDA services instead of pulling i686 graphics compatibility outputs needed only for 32-bit desktop applications.
Avoid noisy host metric collection errors from Docker network namespace bind mounts that Vector cannot stat safely.
Avoid noisy host metric collection errors from Docker overlay mountpoints that Vector cannot stat safely.
Make Grafana useful immediately after deployment instead of requiring ad-hoc Explore queries for basic host, endpoint, and SSH visibility.
Avoid breaking startup on existing Grafana databases by keeping datasource provisioning compatible with the already-created Prometheus and Loki entries.
Avoid anonymous API and UI failures on fresh deployments where the custom Public organization does not exist.
ProxyJump hides the real client address from internal hosts because they only see eta as the source. Emit a bastion-side audit event that ties the authenticated client IP to each forwarded target so Loki can answer who reached which host.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
host_*Vector metric names127.0.0.1status.sjanglab.orgValidation
nix fmtnix eval --impure --json --expr 'let f = builtins.getFlake (toString ./.); in builtins.mapAttrs (_: c: c.config.boot.zfs.forceImportRoot) f.nixosConfigurations'nix run nixpkgs#nix-fast-build -- --flake .#checks --systems x86_64-linux --impure --no-link --fail-fast --no-nom --select 'checks: { inherit (checks.x86_64-linux) nixos-rho nixos-tau nixos-eta; }'