Add Valkey memory limits and analysis tooling#701
Open
majamassarini wants to merge 1 commit intopackit:mainfrom
Open
Add Valkey memory limits and analysis tooling#701majamassarini wants to merge 1 commit intopackit:mainfrom
majamassarini wants to merge 1 commit intopackit:mainfrom
Conversation
Problem: Valkey PVC filled up (1Gi -> 2Gi -> 4Gi) due to orphaned Celery pidbox reply queues accumulating without TTL. When disk filled, Packit stack became stuck with "No space left on device" errors. Root cause analysis: - 1,693 *.reply.celery.pidbox keys with no expiry (TTL = -1) - These are worker control queues that should be temporary - Orphaned when workers crash/restart improperly - No maxmemory limits, so memory/disk could grow unbounded Changes: 1. Configure Valkey with memory limits (configmap-redis_like_config.yml): - maxmemory: 3670mb (~87.5% of 4Gi pod limit) - maxmemory-policy: volatile-lru (safest - only evicts keys with TTL) - Prevents unbounded memory/disk growth 2. Add Valkey analysis script (scripts/analyze_valkey.sh): - Comprehensive data analysis tool - Identifies orphaned keys, disk usage, memory stats - Scans for Celery patterns and TTL distribution - Provides actionable recommendations - Safe to run on production (read-only operations) Additional fix (separate PR in packit-service): - Celery beat task to set 24-hour TTL on orphaned pidbox keys - Prometheus metric to track total Redis keys over time Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>
52b8e14 to
6d652f8
Compare
Contributor
|
Build succeeded. ✔️ pre-commit SUCCESS in 1m 34s |
majamassarini
added a commit
to majamassarini/packit-service
that referenced
this pull request
Apr 1, 2026
Problem: Celery workers create pidbox (control) reply queues for worker management commands (inspect, ping, stats, etc.). These queues accumulate when workers crash or restart improperly, leading to: - 1,693+ orphaned *.reply.celery.pidbox keys in production - Keys with no TTL (TTL = -1) that persist indefinitely Root cause: Celery's Redis transport does not provide a native way to set TTL on pidbox reply queues when they're created. These are internal implementation details of Celery's broadcast/control mechanism, and there's no configuration option to automatically expire them. Solution: Heartbeat cleanup task Since we cannot tell Celery to natively set TTL on pidbox messages, we implement a periodic heartbeat task that: - Runs nightly at 12:30 AM via Celery beat - Scans for *.reply.celery.pidbox keys without TTL - Sets 1-hour expiration on orphaned queues - Tracks total Redis keys via Prometheus for monitoring Related to: packit/deployment#701 Should fix: packit#2983 Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>
majamassarini
added a commit
to majamassarini/packit-service
that referenced
this pull request
Apr 1, 2026
Problem: Celery workers create pidbox (control) reply queues for worker management commands (inspect, ping, stats, etc.). These queues accumulate when workers crash or restart improperly, leading to: - 1,693+ orphaned *.reply.celery.pidbox keys in production - Keys with no TTL (TTL = -1) that persist indefinitely Root cause: Celery's Redis transport does not provide a native way to set TTL on pidbox reply queues when they're created. These are internal implementation details of Celery's broadcast/control mechanism, and there's no configuration option to automatically expire them. Solution: Heartbeat cleanup task Since we cannot tell Celery to natively set TTL on pidbox messages, we implement a periodic heartbeat task that: - Runs nightly at 12:30 AM via Celery beat - Scans for *.reply.celery.pidbox keys without TTL - Sets 1-hour expiration on orphaned queues - Tracks total Redis keys via Prometheus for monitoring Related to: packit/deployment#701 Should fix: packit#2983 Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com> Assisted-By: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem: Valkey PVC filled up (1Gi -> 2Gi -> 4Gi) due to orphaned Celery pidbox reply queues accumulating without TTL. When disk filled, Packit stack became stuck with "No space left on device" errors.
Root cause analysis:
Changes:
Configure Valkey with memory limits (configmap-redis_like_config.yml):
Add Valkey analysis script (scripts/analyze_valkey.sh):
Assisted-By: Claude Sonnet 4.5 noreply@anthropic.com
Fix packit/packit-service#2983