feat(file/cache): emit OTEL metrics#90
Open
jpugliesi wants to merge 3 commits into
Open
Conversation
…19 rollout signals The file cache had no OTEL telemetry — only slog log lines on failure. That makes it hard to monitor the rollout of the FIRE-2419 ENOSPC fix (#55) on running services: "did the proactive evict_until_fits path actually run" and "did the reactive evict_for_space path become unnecessary" were not directly observable. Add an optional `WithMeterProvider` constructor option (modelled on `storage.CacheMeterProvider`) and register observable gauges + counters: - `storage.file_cache.size.bytes` (gauge) — committed LRU size - `storage.file_cache.limit.bytes` (gauge) — configured limit - `storage.file_cache.entries` (gauge) — LRU entry count - `storage.file_cache.inflight.bytes` (gauge) — in-flight write bytes - `storage.file_cache.hits` (counter) - `storage.file_cache.misses` (counter) - `storage.file_cache.evictions` (counter) - `storage.file_cache.evict_until_fits.invocations` (counter) Proactive path from PR #55. Should be non-zero post-rollout under concurrent-write load. - `storage.file_cache.evict_for_space.invocations` (counter) Reactive path that pre-dates PR #55. Should trend toward zero as the proactive path takes over. - `storage.file_cache.write_errors{kind=enospc|other}` (counter) End-goal user-visible signal: enospc count should drop to ~0. Internal atomic counters on Cache tick whether or not a MeterProvider is attached, so the same data is also reachable via `Cache.Stat()` for tests. Also populate `CacheStat.Entries` in `Cache.Stat()` — the field already existed in the type but the file cache wasn't filling it in. NewCache is now `NewCache(tmpdir, size, opts ...Option)`; existing call sites (no options) continue to compile unchanged.
- Use the same instrumentation name ("github.com/firetiger-oss/tigerblock/storage")
so the file-cache metrics merge under one OTEL scope with the in-memory
cache and dashboards can query the union.
- Use the existing `storage.cache.*` metric names with `storage.cache.kind=file`
to differentiate from the in-memory kinds (`objects`, `infos`, `pages`).
Shared metrics (size.bytes, limit.bytes, entries, hits, misses, evictions)
reuse the same instrument; file-only metrics (inflight.bytes,
evict_until_fits.invocations, evict_for_space.invocations, write_errors)
live in the same `storage.cache.*` namespace and only emit with kind=file.
- Rename the write_errors discriminator from `kind=enospc|other` to
`error.type=enospc|other` (OTEL semconv) so it doesn't shadow the existing
`storage.cache.kind` attribute.
…ce of truth The OTEL callback was reading the basic LRU counters from Cache.Stat but the FIRE-2419-specific counters (inflight, evict_until_fits, evict_for_space, write_errors) directly from atomic fields on the Cache struct. That meant two ways for an observer to see cache state, which would diverge as the struct grew. Push everything through Cache.Stat: - Extend storage.CacheStat with InFlight, EvictUntilFitsCount, EvictForSpaceCount, WriteErrorsENOSPC, WriteErrorsOther. The in-memory cache leaves these zero so the type stays shared. - file.Cache.Stat populates all five. - The OTEL callback reads only from Stat — no direct atomic loads. - storage.Cache.Stat switches from a type conversion to an explicit cacheStatFromLRU helper now that storage.CacheStat and cache.Stat have diverging shapes. - Test asserts EvictUntilFitsCount via Stat instead of the unexported atomic field.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds optional OTEL metrics to
file.Cache, primarily to monitor the FIRE-2419 fix (#55) rollout. Without these, the file cache had no OTEL signal — only slog lines on failure.All values flow through
Cache.Stat()— single source of truth, no direct atomic reads in the OTEL callback.storage.CacheStatgrows five new fields (InFlight,EvictUntilFitsCount,EvictForSpaceCount,WriteErrorsENOSPC,WriteErrorsOther); the in-memory cache leaves them zero.Wire-up:
Metric names mirror
storage/cache_metrics.goso the file cache merges under one OTEL scope, differentiated by the existingstorage.cache.kindattribute (objects/infos/pagesfor in-memory,filefor this).Metrics
All emit with
storage.cache.id(per-instance) andstorage.cache.kind="file".storage.cache.size.bytesstorage.cache.limit.bytesstorage.cache.entriesstorage.cache.inflight.bytesstorage.cache.hits/.misses/.evictionsstorage.cache.evict_until_fits.invocationsstorage.cache.evict_for_space.invocationsstorage.cache.write_errors{error.type=enospc|other}enospcshould drop to ~0Test plan
go test -race ./storage/file/...— includes newTestCacheMetricsExposeFireFix(20 concurrent gets, assertsevict_until_fits.invocations > 0)TestCacheWithoutMeterProvider— counters tick (viaStat()) even with no MeterProvider attached./...clean except pre-existing flake instorage/gs/gsclient🤖 Generated with Claude Code