Skip to content

ABN-382/feat: Redis sidecar for cache, page_cache, and session#11

Open
dgjlindsay wants to merge 1 commit into
brtkwr:mainfrom
dgjlindsay:doug/ABN-382-redis-cache-backend
Open

ABN-382/feat: Redis sidecar for cache, page_cache, and session#11
dgjlindsay wants to merge 1 commit into
brtkwr:mainfrom
dgjlindsay:doug/ABN-382-redis-cache-backend

Conversation

@dgjlindsay
Copy link
Copy Markdown

Why

Today on magento.staging.achterafbetalen.co/admin we hit:

Fatal error: Uncaught Exception: Warning: SessionHandler::write(): Write failed: No space left on device (28)

/var/www/html/var PVC (9.8G) was 100% full. Of which var/page_cache/ alone was 9.0G of file-backed Magento full-page-cache entries.

Root cause: this chart never sets cache.frontend.default, cache.frontend.page_cache, or session backends in app/etc/env.php. Magento defaults all three to filesystem, and the file backend has no eviction policy — entries accumulate until the PVC fills, sessions start failing, admin returns 500s. Affects every release of the chart (both magento and magento-abn deployments hit this).

What

  • Redis as a native sidecar in the magento pod — matches the existing DB / OpenSearch sidecar pattern (restartPolicy: Always init container, K8s 1.28+). Localhost-only, no Service, no persistence. Sized small: 256MB maxmemory with allkeys-lru eviction.
  • init-setup.sh wiring via bin/magento setup:config:set. Idempotent, runs on both upgrade and fresh-install paths, before the setup:upgrade that bootstraps the cache layer.
  • Three Redis DBs split between cache.default / cache.page_cache / session — flushing one doesn't blow the others away.
  • Bash /dev/tcp readiness probe rather than redis-cli (not in the magento image).
  • Gated by redis.enabled (default true). When false, chart renders the prior filesystem behaviour — escape hatch for anyone deploying outside K8s 1.28+ or wanting the old shape.

Verification

  • helm lint charts/magento — clean
  • helm template ... --set redis.enabled=true — sidecar + setup:config:set lines present
  • helm template ... --set redis.enabled=false — no Redis container, no configure_redis_backends calls

Once merged, first pod restart re-initialises env.php with Redis backends; var/page_cache stays empty and the PVC stops filling.

Context

  • Linear: ABN-382
  • Workaround applied today: find /var/www/html/var/page_cache -mindepth 1 -delete on the live pod (PVC back to 8% used, admin reachable). This recurs without the permanent fix in this PR.

Out of scope

  • Production Redis topology (managed Memorystore, sentinel, persistence)
  • Backfilling existing PVCs

🤖 Generated with Claude Code

Without this the chart leaves env.php cache backends unset, so Magento
defaults to filesystem with no eviction policy. var/page_cache then
fills the PVC over time and SessionHandler::write() starts failing —
admin 500s, recently observed on staging.

Wires Redis as a native sidecar (matching the DB / OpenSearch pattern,
K8s 1.28+ restartPolicy: Always init container) and configures env.php
in init-setup.sh via setup:config:set, on both the upgrade and the
fresh-install paths.

Localhost-only, no Service, ephemeral. Cache hot-set capped at 256MB
with allkeys-lru eviction. Three Redis DBs split between cache.default,
cache.page_cache, and session so flushing one doesn't blow the others.

Gated by redis.enabled (default true) — chart still renders the prior
filesystem-backed behaviour when disabled.

ABN-382

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Owner

@brtkwr brtkwr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review notes — not blockers, mostly polish + one operational gotcha worth knowing about

# Wire Redis backends into the freshly-written env.php before any
# subsequent mage command (setup:upgrade, di:compile, config:set)
# bootstraps the cache layer.
configure_redis_backends
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inconsistent error handling between paths — the upgrade branch above (line 169) wraps this in || { restore env.php.bak; exit 1 }, but the fresh-install path here is a bare call. if setup:config:set ever errors mid-install (malformed values, redis genuinely unreachable past the wait loop), you get a half-configured env.php with no fallback. either both should fail-fast cleanly or both should restore

# Wire Redis backends into the freshly-written env.php before any
# subsequent mage command (setup:upgrade, di:compile, config:set)
# bootstraps the cache layer.
configure_redis_backends
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: on the fresh-install path setup:install runs before this with filesystem-backed cache (since env.php was just written without redis backends), then we reconfigure to redis. anything setup:install populates in cache is stranded on the unused filesystem path. probably zero impact in practice but a cache:flush after configure_redis_backends would be cleaner. better still: write env.php with redis backends from the start so install itself uses redis

--session-save-redis-host={{ .Values.redis.host }} \
--session-save-redis-port={{ .Values.redis.port }} \
--session-save-redis-db={{ .Values.redis.databases.session }} \
--session-save-redis-log-level=3
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hardcoded — worth lifting to redis.session.logLevel in values for consistency with the rest of the redis config block

# Backs cache.default, cache.page_cache, and session — wired in
# init-setup.sh via setup:config:set. No persistence, no Service:
# cache is ephemeral and only reachable from this pod's localhost.
- name: redis
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sidecar ordering — native sidecars start in declaration order, and each subsequent regular initContainer waits for prior sidecars to be ready. redis is declared after opensearch (which is also a sidecar), so redis startup is gated behind opensearch becoming ready. they're independent — moving redis above opensearch (or at least not after it) avoids redis being delayed when opensearch is slow

enabled: true
image: redis
tag: "7.4-alpine"
host: 127.0.0.1
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having host as a value implies external redis is supported, but there's no password or tls knob — so anyone pointing this at managed redis (memorystore, elasticache) has no way to authenticate. either lock the chart to localhost-only (drop host, hardcode 127.0.0.1 in the template) or add redis.password / redis.tls.enabled for honesty. happy with localhost-only as the design — just make the constraint explicit

port: 6379
# Eviction. allkeys-lru drops the least-recently-used keys when memory is
# exhausted. Avoids a wedged pod if the cache hot set ever exceeds maxmemory.
maxmemory: "256mb"
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

256mb is fine for dev/staging but will churn through LRU evictions on a real storefront — FPC entries pile up fast (one per url × customer-group × currency × store-view). worth either bumping the default or noting in the comment that prod workloads should size up. small deployments can always override down


# Redis sidecar — backs cache.default, cache.page_cache, and session.
# Without this, Magento defaults to filesystem backends with no eviction
# policy, and var/page_cache fills the PVC over time → SessionHandler
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operational gotcha worth flagging in the PR description (or as a one-time init step): existing deployments switching redis.enabled from false → true keep all the old var/page_cache/, var/cache/, var/session/ files lying around forever — magento just stops writing there. the PVC pollution that motivated this PR persists until someone manually cleans up. could be a one-shot rm -rf var/page_cache/* var/cache/* var/session/* in init-setup gated on first redis-enabled boot, or just a documented runbook step

- {{ .Values.redis.maxmemoryPolicy | quote }}
- --save
- ""
- --appendonly
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth making persistence configurable? agreed that ephemeral is the right default — adding a PVC by default would just reintroduce the unbounded-growth failure mode on a different volume, and cache.default / cache.page_cache are regenerable so losing them on pod restart is a non-event.

the one case that hurts is session — every pod restart logs everyone out, including admins mid-task. not a regression vs the pre-PR filesystem behaviour (sessions were already on the same PVC that died), but if you ever wanted session continuity across restarts, a redis.persistence.enabled knob (defaulting false) would let users opt in with a small PVC sized just for AOF. they'd still own the sizing risk, but at least the option exists without forking the chart.

fine to defer to a follow-up PR — just flagging that the current shape leaves no escape hatch short of pointing redis.host at external redis (which itself isn't fully wired, see the password/tls comment above)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants