feat(uptime): hourly rollup + 30d raw retention for uptime snapshots by ankitgoswami · Pull Request #35 · block/proto-fleet

ankitgoswami · 2026-04-22T17:49:05Z

Summary

Add miner_state_snapshots_hourly — a TimescaleDB continuous aggregate on the per-device snapshot hypertable merged in feat(uptime): 3-state uptime chart #17, keyed by (hour, org, device) with last(state, time). One row per device per hour.
Shrink raw miner_state_snapshots retention from 1 y → 30 d. Anything older is served from the rollup.
New sqlc query GetMinerStateSnapshotsHourly with the same aggregation shape as the raw query (DISTINCT ON → SUM-by-state), and a router in the uptime read path that picks raw vs. rollup based on startTime.

Why

Follow-up to #17's code-review concern that per-device minute snapshots grow unsustainably on large fleets. Post-merge math:

Fleet	1 y raw	30 d raw (this PR)	Hourly rollup 1 y
1 k	~37 GB	~3 GB	~600 MB
10 k	~370 GB	~30 GB	~6 GB
100 k	~3.7 TB	~300 GB	~60 GB

(Raw numbers; cold chunks compress another ~10× via segmentby=device_identifier.) The cap shifts from "how much hot storage can we keep per miner" to "how much cheap-compressed hourly history can we keep per miner" — a much better trade.

Design notes

Continuous aggregate, not a cron. The prior uptime chart couldn't use a CAGG because its classifier joined non-hypertable sources (device_pairing, device_status, errors). After feat(uptime): 3-state uptime chart #17 the source is a hypertable, so CAGGs are back on the table — and they're strictly better than a cron-driven rollup: incremental materialization, refresh policy, compression, retention all managed by TimescaleDB.
last(state, time) per hour. Keeps one state value per device per hour. The read query's DISTINCT ON + SUM-by-state already handles further aggregation into larger bar intervals (daily / weekly), so the rollup doesn't need to pre-compute per-bar counts — it just compresses the time axis.
Routing by startTime, not bucket size. Raw is available for the last 30 d regardless of bucket size; older windows must use the rollup (raw rows are gone). One-h slack on the cutoff so boundary queries don't race the retention policy.
Unknown state (4) still excluded. Both queries SUM only states 0..3; the CAGG carries state 4 through but it drops out of every bucket count — same semantics as CountMinersByState.

Critical files

server/migrations/000034_miner_state_snapshots_hourly.up.sql — CAGG, refresh / compression / retention policies, raw retention swap.
server/sqlc/queries/miner_state_snapshots.sql — GetMinerStateSnapshotsHourly mirrors the raw query.
server/sqlc.yaml — CAGG column type overrides (bucket, state).
server/internal/infrastructure/timescaledb/telemetry_store.go — useHourlyRollup router + queryUptimeRaw / queryUptimeHourly.

Test plan

go build ./..., go vet ./..., golangci-lint run clean on server/internal/... and server/cmd/....
go test ./internal/domain/telemetry/... + ./internal/handlers/telemetry/... green.
Pure-Go router test TestUseHourlyRollup covers 1d / 29d / 31d / 1y windows.
Integration test for the rollup against a seeded fleet (needs local stack with a few hours of snapshots).
Manual on local stack:
- just db-migrate applies 000034, CAGG materializes.
- Dashboard 24h / 7d views still hit raw (unchanged behavior).
- 1y view renders without erroring. Bars at hourly resolution for history older than ~30 d.
- SELECT COUNT(*) FROM miner_state_snapshots stays bounded (confirm retention policy kicks in after 30d mark; the new policy takes effect immediately for out-of-window rows).

Compatibility / deploy notes

The migration only adds a CAGG and updates the retention policy on the existing raw table. No schema change to miner_state_snapshots itself, no code path changes on the write side. Safe to deploy incrementally.
On first migration, the CAGG is created WITH NO DATA; the refresh policy backfills incrementally (mirroring 000016_recreate_metrics_aggregates). Queries against long windows will return empty results until the refresh job has covered the backfill window — same trade-off the existing metric CAGGs accept.

🤖 Generated with Claude Code

Adds miner_state_snapshots_hourly, a continuous aggregate on the per-device snapshot table keyed by (hour, org, device) with last(state, time) — one row per device per hour. Chart queries whose window predates the 30-day raw retention now read the rollup instead of the raw hypertable, which is orders of magnitude cheaper at scale and unblocks shrinking raw retention. Schema - CREATE MATERIALIZED VIEW ... WITH (timescaledb.continuous) - Refresh policy matching the existing hourly CAGGs (30m schedule, 1h offset) - Compression segmentby=device_identifier, retention 1 year - Raw miner_state_snapshots retention dropped from 1y to 30d Read path - uptimeSnapshotRawRetention const with 1h slack - useHourlyRollup(startTime) routes <=30d to raw, older to rollup - queryUptimeRaw / queryUptimeHourly share the result-shape translation Storage rough numbers (compressed, 10x): 1k miners -> ~4 GB raw + rollup; 10k -> ~40 GB; 100k -> ~400 GB. Raw retention cut to 30d saves 12x hot storage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Split uptimeSnapshotRawRetentionPolicy (the migration contract) from uptimeSnapshotRawRetention (router cutoff with slack) so a future retention bump is a single const change and the coupling to the migration is called out explicitly. - Replace the magic sql.NullString{"1", true} sentinel used to activate narg filters with a named package-level var. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-22T18:01:24Z

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

Reviewed pull request diff only (a2f3e8ee1a521b70e65906d7ed9308096a4a1d1e...cffd5a8d9a3eaf3a3d33d86d96ee93b2e140714a, exact PR three-dot diff)

Model: gpt-5.4

💡 Click "edited" above to see previous reviews for this PR.

Review Summary

Overall Risk: HIGH

Findings

[HIGH] Migration can permanently discard existing uptime history before the new rollups are populated

Category: Reliability
Location: server/migrations/000034_miner_state_snapshots_rollups.up.sql:9
Description: The migration creates both continuous aggregates WITH NO DATA, only schedules refreshes over the last 1 day / 7 days, and then immediately shrinks raw miner_state_snapshots retention from 1 year to 30 days. On an environment upgraded on April 22, 2026 that already has January-March snapshot history, buckets older than those refresh lookbacks are never materialized into the new views before raw chunks older than March 23, 2026 become eligible for deletion.
Impact: Existing 30d/90d/1y uptime history can be silently truncated or lost after the retention job runs. The application path just returns empty uptime counts, so this will look like missing chart data rather than a failed migration.
Recommendation: Explicitly backfill both aggregates across the full existing raw retention window with refresh_continuous_aggregate before reducing raw retention, or create them with data if the migration window can tolerate it. If needed, split the retention reduction into a later migration after backfill completion is verified.

[MEDIUM] New uptime routing chooses the source table by window width only, so historical short-range queries can hit already-purged data

Category: Reliability
Location: server/internal/infrastructure/timescaledb/telemetry_store.go:1295
Description: selectUptimeDataSource looks only at end-start. After this PR, raw snapshots retain 30 days and hourly rollups retain 3 months. As of April 22, 2026, a 12-hour query for February 20, 2026 still routes to raw, and a 5-day query for January 1-5, 2026 routes to hourly, even though those sources are already outside retention. miner_state_snapshots_daily still retains 3 years, but this code never falls back to it.
Impact: Historical uptime charts for short windows can come back empty even when the backend still has enough daily data to answer at a coarser resolution.
Recommendation: Make source selection retention-aware: choose the highest-resolution source whose retention fully covers the requested absolute time range, or cascade raw -> hourly -> daily when the preferred source is out of retention. Add tests around the 30-day and 90-day cutoffs.

Notes

Review scope was limited to .git/codex-review.diff.
The diff does not touch auth/JWT, command execution, network discovery, plugin boundaries, frontend rendering, or pool/wallet configuration. I did not find any cryptostealing or pool-hijack changes in scope.
The new test coverage only checks duration thresholds. It does not exercise migration backfill, retention cutover, or historical-range routing.
I could not run the targeted Go tests in this sandbox because the filesystem is read-only and Go could not create its module/cache directories.
Finding 1 assumes this migration can land on environments that already contain miner_state_snapshots older than the policy lookback windows; if this feature has never been deployed with live history, the impact is lower.

_{Generated by Codex Security Review |

Triggered by: @ankitgoswami |

Review workflow run}

Follow the existing CAGG routing idiom (selectDataSource + String + switch) for the raw-vs-hourly decision on miner_state_snapshots. Makes the two data-source routers in this file visually parallel and extends cleanly if a future daily rollup is added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extends the miner_state_snapshots rollup story to match the two-tier CAGG pattern used by device_metrics and device_status: - New miner_state_snapshots_daily CAGG sourced from raw (no CAGG-on-CAGG; matches the codebase convention). Policies mirror device_metrics_daily: start_offset=7d, end_offset=1d, schedule=6h, compression=7d, retention=3y. - Hourly CAGG retention adjusted to 3 months to match device_metrics_hourly. - Router rewritten to duration-based selectUptimeDataSource, mirroring selectDataSource (<=24h raw, <=10d hourly, >10d daily). Drops the start-time-age consts I had; the new shape is symmetric with the other router a few lines above in the same file. - New GetMinerStateSnapshotsDaily sqlc query + sqlc.yaml column overrides. - queryUptimeDaily parallels queryUptimeHourly; switch-dispatch clamps bucket duration per source (1 min / 1 h / 1 d). Migration file renamed to 000034_miner_state_snapshots_rollups.{up,down}.sql since it now owns both CAGGs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot assigned ankitgoswami Apr 22, 2026

github-actions Bot added the server label Apr 22, 2026

ankitgoswami and others added 2 commits April 22, 2026 11:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(uptime): hourly rollup + 30d raw retention for uptime snapshots#35

feat(uptime): hourly rollup + 30d raw retention for uptime snapshots#35
ankitgoswami wants to merge 4 commits intomainfrom
ankitg/uptime-hourly-rollup

ankitgoswami commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ankitgoswami commented Apr 22, 2026

Summary

Why

Design notes

Critical files

Test plan

Compatibility / deploy notes

Uh oh!

github-actions Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔐 Codex Security Review

Review Summary

Findings

[HIGH] Migration can permanently discard existing uptime history before the new rollups are populated

[MEDIUM] New uptime routing chooses the source table by window width only, so historical short-range queries can hit already-purged data

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Apr 22, 2026 •

edited

Loading