Skip to content

feat(uptime): hourly rollup + 30d raw retention for uptime snapshots#35

Draft
ankitgoswami wants to merge 4 commits intomainfrom
ankitg/uptime-hourly-rollup
Draft

feat(uptime): hourly rollup + 30d raw retention for uptime snapshots#35
ankitgoswami wants to merge 4 commits intomainfrom
ankitg/uptime-hourly-rollup

Conversation

@ankitgoswami
Copy link
Copy Markdown
Contributor

Summary

  • Add miner_state_snapshots_hourly — a TimescaleDB continuous aggregate on the per-device snapshot hypertable merged in feat(uptime): 3-state uptime chart #17, keyed by (hour, org, device) with last(state, time). One row per device per hour.
  • Shrink raw miner_state_snapshots retention from 1 y → 30 d. Anything older is served from the rollup.
  • New sqlc query GetMinerStateSnapshotsHourly with the same aggregation shape as the raw query (DISTINCT ON → SUM-by-state), and a router in the uptime read path that picks raw vs. rollup based on startTime.

Rollup

Why

Follow-up to #17's code-review concern that per-device minute snapshots grow unsustainably on large fleets. Post-merge math:

Fleet 1 y raw 30 d raw (this PR) Hourly rollup 1 y
1 k ~37 GB ~3 GB ~600 MB
10 k ~370 GB ~30 GB ~6 GB
100 k ~3.7 TB ~300 GB ~60 GB

(Raw numbers; cold chunks compress another ~10× via segmentby=device_identifier.) The cap shifts from "how much hot storage can we keep per miner" to "how much cheap-compressed hourly history can we keep per miner" — a much better trade.

Design notes

  • Continuous aggregate, not a cron. The prior uptime chart couldn't use a CAGG because its classifier joined non-hypertable sources (device_pairing, device_status, errors). After feat(uptime): 3-state uptime chart #17 the source is a hypertable, so CAGGs are back on the table — and they're strictly better than a cron-driven rollup: incremental materialization, refresh policy, compression, retention all managed by TimescaleDB.
  • last(state, time) per hour. Keeps one state value per device per hour. The read query's DISTINCT ON + SUM-by-state already handles further aggregation into larger bar intervals (daily / weekly), so the rollup doesn't need to pre-compute per-bar counts — it just compresses the time axis.
  • Routing by startTime, not bucket size. Raw is available for the last 30 d regardless of bucket size; older windows must use the rollup (raw rows are gone). One-h slack on the cutoff so boundary queries don't race the retention policy.
  • Unknown state (4) still excluded. Both queries SUM only states 0..3; the CAGG carries state 4 through but it drops out of every bucket count — same semantics as CountMinersByState.

Critical files

Test plan

  • go build ./..., go vet ./..., golangci-lint run clean on server/internal/... and server/cmd/....
  • go test ./internal/domain/telemetry/... + ./internal/handlers/telemetry/... green.
  • Pure-Go router test TestUseHourlyRollup covers 1d / 29d / 31d / 1y windows.
  • Integration test for the rollup against a seeded fleet (needs local stack with a few hours of snapshots).
  • Manual on local stack:
    • just db-migrate applies 000034, CAGG materializes.
    • Dashboard 24h / 7d views still hit raw (unchanged behavior).
    • 1y view renders without erroring. Bars at hourly resolution for history older than ~30 d.
    • SELECT COUNT(*) FROM miner_state_snapshots stays bounded (confirm retention policy kicks in after 30d mark; the new policy takes effect immediately for out-of-window rows).

Compatibility / deploy notes

  • The migration only adds a CAGG and updates the retention policy on the existing raw table. No schema change to miner_state_snapshots itself, no code path changes on the write side. Safe to deploy incrementally.
  • On first migration, the CAGG is created WITH NO DATA; the refresh policy backfills incrementally (mirroring 000016_recreate_metrics_aggregates). Queries against long windows will return empty results until the refresh job has covered the backfill window — same trade-off the existing metric CAGGs accept.

🤖 Generated with Claude Code

Adds miner_state_snapshots_hourly, a continuous aggregate on the per-device
snapshot table keyed by (hour, org, device) with last(state, time) — one row
per device per hour. Chart queries whose window predates the 30-day raw
retention now read the rollup instead of the raw hypertable, which is orders
of magnitude cheaper at scale and unblocks shrinking raw retention.

Schema
- CREATE MATERIALIZED VIEW ... WITH (timescaledb.continuous)
- Refresh policy matching the existing hourly CAGGs (30m schedule, 1h offset)
- Compression segmentby=device_identifier, retention 1 year
- Raw miner_state_snapshots retention dropped from 1y to 30d

Read path
- uptimeSnapshotRawRetention const with 1h slack
- useHourlyRollup(startTime) routes <=30d to raw, older to rollup
- queryUptimeRaw / queryUptimeHourly share the result-shape translation

Storage rough numbers (compressed, 10x): 1k miners -> ~4 GB raw + rollup;
10k -> ~40 GB; 100k -> ~400 GB. Raw retention cut to 30d saves 12x hot
storage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Split uptimeSnapshotRawRetentionPolicy (the migration contract) from
  uptimeSnapshotRawRetention (router cutoff with slack) so a future retention
  bump is a single const change and the coupling to the migration is called
  out explicitly.
- Replace the magic sql.NullString{"1", true} sentinel used to activate
  narg filters with a named package-level var.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 22, 2026

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (a2f3e8ee1a521b70e65906d7ed9308096a4a1d1e...cffd5a8d9a3eaf3a3d33d86d96ee93b2e140714a, exact PR three-dot diff)
  • Model: gpt-5.4

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: HIGH

Findings

[HIGH] Migration can permanently discard existing uptime history before the new rollups are populated

  • Category: Reliability
  • Location: server/migrations/000034_miner_state_snapshots_rollups.up.sql:9
  • Description: The migration creates both continuous aggregates WITH NO DATA, only schedules refreshes over the last 1 day / 7 days, and then immediately shrinks raw miner_state_snapshots retention from 1 year to 30 days. On an environment upgraded on April 22, 2026 that already has January-March snapshot history, buckets older than those refresh lookbacks are never materialized into the new views before raw chunks older than March 23, 2026 become eligible for deletion.
  • Impact: Existing 30d/90d/1y uptime history can be silently truncated or lost after the retention job runs. The application path just returns empty uptime counts, so this will look like missing chart data rather than a failed migration.
  • Recommendation: Explicitly backfill both aggregates across the full existing raw retention window with refresh_continuous_aggregate before reducing raw retention, or create them with data if the migration window can tolerate it. If needed, split the retention reduction into a later migration after backfill completion is verified.

[MEDIUM] New uptime routing chooses the source table by window width only, so historical short-range queries can hit already-purged data

  • Category: Reliability
  • Location: server/internal/infrastructure/timescaledb/telemetry_store.go:1295
  • Description: selectUptimeDataSource looks only at end-start. After this PR, raw snapshots retain 30 days and hourly rollups retain 3 months. As of April 22, 2026, a 12-hour query for February 20, 2026 still routes to raw, and a 5-day query for January 1-5, 2026 routes to hourly, even though those sources are already outside retention. miner_state_snapshots_daily still retains 3 years, but this code never falls back to it.
  • Impact: Historical uptime charts for short windows can come back empty even when the backend still has enough daily data to answer at a coarser resolution.
  • Recommendation: Make source selection retention-aware: choose the highest-resolution source whose retention fully covers the requested absolute time range, or cascade raw -> hourly -> daily when the preferred source is out of retention. Add tests around the 30-day and 90-day cutoffs.

Notes

  • Review scope was limited to .git/codex-review.diff.
  • The diff does not touch auth/JWT, command execution, network discovery, plugin boundaries, frontend rendering, or pool/wallet configuration. I did not find any cryptostealing or pool-hijack changes in scope.
  • The new test coverage only checks duration thresholds. It does not exercise migration backfill, retention cutover, or historical-range routing.
  • I could not run the targeted Go tests in this sandbox because the filesystem is read-only and Go could not create its module/cache directories.
  • Finding 1 assumes this migration can land on environments that already contain miner_state_snapshots older than the policy lookback windows; if this feature has never been deployed with live history, the impact is lower.

Generated by Codex Security Review |
Triggered by: @ankitgoswami |
Review workflow run

ankitgoswami and others added 2 commits April 22, 2026 11:12
Follow the existing CAGG routing idiom (selectDataSource + String + switch)
for the raw-vs-hourly decision on miner_state_snapshots. Makes the two
data-source routers in this file visually parallel and extends cleanly if
a future daily rollup is added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the miner_state_snapshots rollup story to match the two-tier CAGG
pattern used by device_metrics and device_status:

- New miner_state_snapshots_daily CAGG sourced from raw (no CAGG-on-CAGG;
  matches the codebase convention). Policies mirror device_metrics_daily:
  start_offset=7d, end_offset=1d, schedule=6h, compression=7d, retention=3y.
- Hourly CAGG retention adjusted to 3 months to match device_metrics_hourly.
- Router rewritten to duration-based selectUptimeDataSource, mirroring
  selectDataSource (<=24h raw, <=10d hourly, >10d daily). Drops the
  start-time-age consts I had; the new shape is symmetric with the other
  router a few lines above in the same file.
- New GetMinerStateSnapshotsDaily sqlc query + sqlc.yaml column overrides.
- queryUptimeDaily parallels queryUptimeHourly; switch-dispatch clamps
  bucket duration per source (1 min / 1 h / 1 d).

Migration file renamed to 000034_miner_state_snapshots_rollups.{up,down}.sql
since it now owns both CAGGs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant