feat(scripts): discovery_snapshot.py — daily Discovery-Tracking cron (P3.1)#56
Merged
Conversation
Discovery-Tracking P3.1 per SPEC docs/specs/2026-05-21_discovery-tracking- baseline-SPEC.md §3.5 + §5.2. Self-contained daily cron script. Captures 5 sources into the discovery_snapshots table (migration in PR #55): - self_probes : GET 4 Discovery surfaces (sitemap.xml URL-count, llms.txt MoltGuard-block, /guard/openapi.json path-count, /extendedAgentCard MoltGuard-extensions) - bot_hits : parse /var/log/nginx/access.log* (last 7d), bot-UA × endpoint-class. moltstack is in `adm` group → cron reads logs without sudo. Privacy §3.7: no IPs persisted, only UA-counts. - github : GH_TOKEN-authenticated repo + traffic API, 6 MoltyCel repos. Graceful "pat-not-configured" if GH_TOKEN absent. - gsc : manual-pending (V0 per §9.1). - errors : non-fatal failures collected; source_run_status ok/partial/failed computed accordingly. Idempotenz: UPSERT ON CONFLICT (snapshot_at) DO UPDATE — repeated same-day runs refresh the row, never create a 2nd. DB literal is dollar-quoted ($disco$) — injection-safe without escaping. Alerts: Telegram on partial/failed status (TELEGRAM_BOT_TOKEN/CHAT_ID from ~/.moltrust_secrets). Flags: - --dry-run assemble + print, no DB write - --date YYYY-MM-DD override snapshot_at (backfill / throwaway test) Test-Run verified 2026-05-21 against throwaway date 2099-12-31: 4/4 probes, 16 bots / 1664 hits, 6/6 GitHub repos, upsert ok, throwaway row deleted, baseline 2026-05-21 untouched. Crontab entry (server-side, NOT repo-managed per CLAUDE.md §Geltungsbereich — applied manually post-merge with audit note): 30 0 * * * set -a && source /home/moltstack/.moltrust_secrets && set +a \ && cd /home/moltstack/moltstack \ && /home/moltstack/moltstack/venv/bin/python scripts/discovery_snapshot.py \ >> logs/discovery_snapshot.log 2>&1
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Discovery-Tracking P3.1 — CRON. Self-contained daily script
scripts/discovery_snapshot.py, schreibt einen Snapshot/Tag in diediscovery_snapshots-Tabelle (Migration aus PR #55).Per SPEC
docs/specs/2026-05-21_discovery-tracking-baseline-SPEC.md§3.5 + §5.2.Was das Script macht (5 Quellen)
/var/log/nginx/access.log*(last 7d), bot-UA × endpoint-classadm-Gruppe → kein sudo nötigGH_TOKENaus~/.moltrust_secrets— gracefulpat-not-configuredfalls absentsource_run_statusok/partial/failedIdempotenz
INSERT … ON CONFLICT (snapshot_at) DO UPDATE— doppelter Aufruf am selben Tag aktualisiert die Zeile, erzeugt nie eine 2. DB-Literal dollar-quoted ($disco$) → injection-safe ohne Escaping.Privacy (§3.7)
nginx-Parser aggregiert ausschließlich User-Agent × endpoint-class. Keine IPs ins payload. moltstack-
adm-Gruppen-Membership statt sudo = minimal-privilege.Alerts
Telegram bei
partial/failed(TELEGRAM_BOT_TOKEN/CHAT_ID aus secrets).Flags
--dry-run— assemble + print, kein DB-write--date YYYY-MM-DD— snapshot_at override (Backfill + Wegwerf-Test)Test-Run (verifiziert 2026-05-21)
Gegen Wegwerf-Datum
2099-12-31(full path inkl. DB-upsert):Genau das „Test-Run, dann löschen — kein 2. Snapshot heute" aus dem Sprint-Auftrag.
Crontab-Eintrag (server-side, NICHT repo-managed)
Per CLAUDE.md §Geltungsbereich ist cron Server-Infra, nicht repo-verwaltet. Nach Merge manuell hinzugefügt + Audit-Eintrag:
00:30 UTC täglich (vermeidet Backup-Window 03:00). Erster echter Cron-Fire: 2026-05-22 (Baseline 2026-05-21 bleibt frozen).
Pre-Commit-Diff (§8)
Genau 1 neues File,
scripts/-Konvention (wie endpoint_probe.py, daily_stats.sh), kein Fremd-Scope.§2.3 Cross-Review
Skip — Read-only Tracking, kein Auth-/Credential-Pfad geändert. Liest GH_TOKEN (read-only API) + nginx-Logs (adm-group) + schreibt aggregierte Metriken. Re-evaluate falls künftig GSC-OAuth dazukommt (P4).
Branch-Hygiene (§11.4)
Branch ab frischem
origin/main(2298618, 0 behind), Worktree~/moltrust-api-J.Test plan
git pull(script landet via Repo)