Skip to content

Add Valkey memory limits and analysis tooling#701

Open
majamassarini wants to merge 1 commit intopackit:mainfrom
majamassarini:prevent-valkey-filling-up
Open

Add Valkey memory limits and analysis tooling#701
majamassarini wants to merge 1 commit intopackit:mainfrom
majamassarini:prevent-valkey-filling-up

Conversation

@majamassarini
Copy link
Copy Markdown
Member

Problem: Valkey PVC filled up (1Gi -> 2Gi -> 4Gi) due to orphaned Celery pidbox reply queues accumulating without TTL. When disk filled, Packit stack became stuck with "No space left on device" errors.

Root cause analysis:

  • 1,693 *.reply.celery.pidbox keys with no expiry (TTL = -1)
  • These are worker control queues that should be temporary
  • Orphaned when workers crash/restart improperly
  • No maxmemory limits, so memory/disk could grow unbounded

Changes:

  1. Configure Valkey with memory limits (configmap-redis_like_config.yml):

    • maxmemory: 3670mb (~87.5% of 4Gi pod limit)
    • maxmemory-policy: volatile-lru (safest - only evicts keys with TTL)
    • Prevents unbounded memory/disk growth
  2. Add Valkey analysis script (scripts/analyze_valkey.sh):

    • Comprehensive data analysis tool
    • Identifies orphaned keys, disk usage, memory stats
    • Scans for Celery patterns and TTL distribution
    • Provides actionable recommendations
    • Safe to run on production (read-only operations)

Assisted-By: Claude Sonnet 4.5 noreply@anthropic.com

Fix packit/packit-service#2983

Problem:
Valkey PVC filled up (1Gi -> 2Gi -> 4Gi) due to orphaned Celery
pidbox reply queues accumulating without TTL. When disk filled,
Packit stack became stuck with "No space left on device" errors.

Root cause analysis:
- 1,693 *.reply.celery.pidbox keys with no expiry (TTL = -1)
- These are worker control queues that should be temporary
- Orphaned when workers crash/restart improperly
- No maxmemory limits, so memory/disk could grow unbounded

Changes:
1. Configure Valkey with memory limits (configmap-redis_like_config.yml):
   - maxmemory: 3670mb (~87.5% of 4Gi pod limit)
   - maxmemory-policy: volatile-lru (safest - only evicts keys with TTL)
   - Prevents unbounded memory/disk growth

2. Add Valkey analysis script (scripts/analyze_valkey.sh):
   - Comprehensive data analysis tool
   - Identifies orphaned keys, disk usage, memory stats
   - Scans for Celery patterns and TTL distribution
   - Provides actionable recommendations
   - Safe to run on production (read-only operations)

Additional fix (separate PR in packit-service):
- Celery beat task to set 24-hour TTL on orphaned pidbox keys
- Prometheus metric to track total Redis keys over time

Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@majamassarini majamassarini force-pushed the prevent-valkey-filling-up branch from 52b8e14 to 6d652f8 Compare April 1, 2026 06:56
@centosinfra-prod-github-app
Copy link
Copy Markdown
Contributor

majamassarini added a commit to majamassarini/packit-service that referenced this pull request Apr 1, 2026
Problem:
Celery workers create pidbox (control) reply queues for worker management
commands (inspect, ping, stats, etc.). These queues accumulate when workers
crash or restart improperly, leading to:
- 1,693+ orphaned *.reply.celery.pidbox keys in production
- Keys with no TTL (TTL = -1) that persist indefinitely

Root cause:
Celery's Redis transport does not provide a native way to set TTL on pidbox
reply queues when they're created. These are internal implementation details
of Celery's broadcast/control mechanism, and there's no configuration option
to automatically expire them.

Solution: Heartbeat cleanup task
Since we cannot tell Celery to natively set TTL on pidbox messages, we
implement a periodic heartbeat task that:
- Runs nightly at 12:30 AM via Celery beat
- Scans for *.reply.celery.pidbox keys without TTL
- Sets 1-hour expiration on orphaned queues
- Tracks total Redis keys via Prometheus for monitoring

Related to: packit/deployment#701
Should fix: packit#2983

Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>
majamassarini added a commit to majamassarini/packit-service that referenced this pull request Apr 1, 2026
Problem:
Celery workers create pidbox (control) reply queues for worker management
commands (inspect, ping, stats, etc.). These queues accumulate when workers
crash or restart improperly, leading to:
- 1,693+ orphaned *.reply.celery.pidbox keys in production
- Keys with no TTL (TTL = -1) that persist indefinitely

Root cause:
Celery's Redis transport does not provide a native way to set TTL on pidbox
reply queues when they're created. These are internal implementation details
of Celery's broadcast/control mechanism, and there's no configuration option
to automatically expire them.

Solution: Heartbeat cleanup task
Since we cannot tell Celery to natively set TTL on pidbox messages, we
implement a periodic heartbeat task that:
- Runs nightly at 12:30 AM via Celery beat
- Scans for *.reply.celery.pidbox keys without TTL
- Sets 1-hour expiration on orphaned queues
- Tracks total Redis keys via Prometheus for monitoring

Related to: packit/deployment#701
Should fix: packit#2983

Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Assisted-By: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

valkey-pvc requires periodic increases

2 participants