Skip to content

feat(file/cache): emit OTEL metrics#90

Open
jpugliesi wants to merge 3 commits into
mainfrom
john/file-cache-otel-metrics
Open

feat(file/cache): emit OTEL metrics#90
jpugliesi wants to merge 3 commits into
mainfrom
john/file-cache-otel-metrics

Conversation

@jpugliesi

@jpugliesi jpugliesi commented May 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds optional OTEL metrics to file.Cache, primarily to monitor the FIRE-2419 fix (#55) rollout. Without these, the file cache had no OTEL signal — only slog lines on failure.

All values flow through Cache.Stat() — single source of truth, no direct atomic reads in the OTEL callback. storage.CacheStat grows five new fields (InFlight, EvictUntilFitsCount, EvictForSpaceCount, WriteErrorsENOSPC, WriteErrorsOther); the in-memory cache leaves them zero.

Wire-up:

file.NewCache(dir, size, file.WithMeterProvider(provider))

Metric names mirror storage/cache_metrics.go so the file cache merges under one OTEL scope, differentiated by the existing storage.cache.kind attribute (objects/infos/pages for in-memory, file for this).

Metrics

All emit with storage.cache.id (per-instance) and storage.cache.kind="file".

Metric Type Notes
storage.cache.size.bytes gauge Committed LRU size
storage.cache.limit.bytes gauge Configured limit
storage.cache.entries gauge LRU entry count
storage.cache.inflight.bytes gauge In-progress write bytes
storage.cache.hits / .misses / .evictions counters
storage.cache.evict_until_fits.invocations counter Proactive path from #55 — non-zero = back-pressure active
storage.cache.evict_for_space.invocations counter Reactive (post-ENOSPC) path — should trend to zero
storage.cache.write_errors{error.type=enospc|other} counter End-goal signal — enospc should drop to ~0

Test plan

  • go test -race ./storage/file/... — includes new TestCacheMetricsExposeFireFix (20 concurrent gets, asserts evict_until_fits.invocations > 0)
  • TestCacheWithoutMeterProvider — counters tick (via Stat()) even with no MeterProvider attached
  • Full ./... clean except pre-existing flake in storage/gs/gsclient

🤖 Generated with Claude Code

jpugliesi added 3 commits May 18, 2026 11:56
…19 rollout signals

The file cache had no OTEL telemetry — only slog log lines on failure. That
makes it hard to monitor the rollout of the FIRE-2419 ENOSPC fix (#55) on
running services: "did the proactive evict_until_fits path actually run"
and "did the reactive evict_for_space path become unnecessary" were not
directly observable.

Add an optional `WithMeterProvider` constructor option (modelled on
`storage.CacheMeterProvider`) and register observable gauges + counters:

- `storage.file_cache.size.bytes` (gauge)              — committed LRU size
- `storage.file_cache.limit.bytes` (gauge)             — configured limit
- `storage.file_cache.entries` (gauge)                 — LRU entry count
- `storage.file_cache.inflight.bytes` (gauge)          — in-flight write bytes
- `storage.file_cache.hits` (counter)
- `storage.file_cache.misses` (counter)
- `storage.file_cache.evictions` (counter)
- `storage.file_cache.evict_until_fits.invocations` (counter)
    Proactive path from PR #55. Should be non-zero post-rollout under
    concurrent-write load.
- `storage.file_cache.evict_for_space.invocations` (counter)
    Reactive path that pre-dates PR #55. Should trend toward zero as the
    proactive path takes over.
- `storage.file_cache.write_errors{kind=enospc|other}` (counter)
    End-goal user-visible signal: enospc count should drop to ~0.

Internal atomic counters on Cache tick whether or not a MeterProvider is
attached, so the same data is also reachable via `Cache.Stat()` for tests.

Also populate `CacheStat.Entries` in `Cache.Stat()` — the field already
existed in the type but the file cache wasn't filling it in.

NewCache is now `NewCache(tmpdir, size, opts ...Option)`; existing call
sites (no options) continue to compile unchanged.
- Use the same instrumentation name ("github.com/firetiger-oss/tigerblock/storage")
  so the file-cache metrics merge under one OTEL scope with the in-memory
  cache and dashboards can query the union.
- Use the existing `storage.cache.*` metric names with `storage.cache.kind=file`
  to differentiate from the in-memory kinds (`objects`, `infos`, `pages`).
  Shared metrics (size.bytes, limit.bytes, entries, hits, misses, evictions)
  reuse the same instrument; file-only metrics (inflight.bytes,
  evict_until_fits.invocations, evict_for_space.invocations, write_errors)
  live in the same `storage.cache.*` namespace and only emit with kind=file.
- Rename the write_errors discriminator from `kind=enospc|other` to
  `error.type=enospc|other` (OTEL semconv) so it doesn't shadow the existing
  `storage.cache.kind` attribute.
…ce of truth

The OTEL callback was reading the basic LRU counters from Cache.Stat but
the FIRE-2419-specific counters (inflight, evict_until_fits, evict_for_space,
write_errors) directly from atomic fields on the Cache struct. That meant
two ways for an observer to see cache state, which would diverge as the
struct grew.

Push everything through Cache.Stat:
- Extend storage.CacheStat with InFlight, EvictUntilFitsCount,
  EvictForSpaceCount, WriteErrorsENOSPC, WriteErrorsOther. The in-memory
  cache leaves these zero so the type stays shared.
- file.Cache.Stat populates all five.
- The OTEL callback reads only from Stat — no direct atomic loads.
- storage.Cache.Stat switches from a type conversion to an explicit
  cacheStatFromLRU helper now that storage.CacheStat and cache.Stat have
  diverging shapes.
- Test asserts EvictUntilFitsCount via Stat instead of the unexported
  atomic field.
@jpugliesi jpugliesi requested a review from achille-roussel May 18, 2026 21:38
@jpugliesi jpugliesi changed the title feat(file/cache): emit OTEL metrics, including FIRE-2419 rollout signals feat(file/cache): emit OTEL metrics May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant