Skip to content

[SAP] Implement graceful shutdown for cinder services#314

Open
hemna wants to merge 1 commit into
stable/2023.1-m3from
graceful-shutdown
Open

[SAP] Implement graceful shutdown for cinder services#314
hemna wants to merge 1 commit into
stable/2023.1-m3from
graceful-shutdown

Conversation

@hemna
Copy link
Copy Markdown

@hemna hemna commented Feb 20, 2026

Graceful Shutdown for Cinder Volume & Backup Services

Implements graceful shutdown that allows in-flight volume and backup operations to complete before the pod exits during Kubernetes rolling updates. Covers both cinder-volume and cinder-backup services.

How It Works

Phase 1 — Skip consumer cancel:

  • We intentionally do NOT send Basic.Cancel or call conn.stop_consuming(). Doing so causes eventlet socket races ("simultaneous read on fileno") that disrupt outbound HTTP/RPC connections used by in-flight operations (e.g., Swift reads during backup restore).
  • Instead, we rely solely on pool.waitall() in Phase 2.
  • The _runner greenthread remains blocked in drain_events() at 0% CPU — harmless.
  • The scheduler stops routing new work within service_down_time since we stop heartbeating.

Phase 2 — Wait for in-flight operations:

  • GreenPool.waitall() blocks until all RPC handler greenthreads finish
  • Worker entry heartbeat keeps entries fresh (prevents new pod cleanup interference)
  • Heartbeats continue (service stays "up" in DB)

Phase 3 — Clean exit:

  • Skip rpcserver.stop()/rpcserver.wait() (hangs on dead AMQP socket)
  • Process exits cleanly after stop() returns

Additional Mechanisms

  • Worker entry heartbeat (cinder/objects/cleanable.py): set_workers decorator spawns a greenthread that touches worker DB entries every 10s during operations. Prevents new pod's init_host_do_cleanup from resetting in-flight volumes to 'error'.
  • do_cleanup freshness check (cinder/manager.py): Skips worker entries updated within service_down_time (60s). Only cleans up truly stale/crashed entries.
  • Backup restore heartbeat (cinder/backup/manager.py): Touches backup.updated_at every 10s during restore. Prevents new backup pod's init_host from resetting the backup status and triggering BackupRestoreCancel.
  • Backup _cleanup_one_backup freshness check: Skips backups in creating/restoring state if updated_at is recent.
  • Backup _detach_device no-reraise: If detach fails during shutdown (RPC timeout), log error but continue finalization. Data integrity preserved; dangling export cleaned up on next startup.
  • reject_if_draining decorator: Rejects new RPC calls during shutdown so scheduler routes to healthy backends.
  • Semaphore guard: Prevents concurrent stop() calls on same Service instance.

Requirements (separate changes)

  • dumb-init --single-child on cinder-volume AND cinder-backup container commands — ensures ProcessLauncher parent waits for all children before exit
  • terminationGracePeriodSeconds: 900 on pod spec

Test Results (qa-de-1, 2026-05-14 to 2026-05-15)

All tests use artificial delays in the FCD driver (not committed) to keep fast operations in-flight long enough for the pod kill to catch them.

# Test Operation Kill Target Result
1 test_idle_shutdown Clean exit Volume pod ✅ <1s
2 test_inflight_volume_create Volume from image (16GB) Volume pod ✅ (41s to 8min)
3 test_inflight_backup Backup to Swift Backup pod
4 test_scheduler_reroutes New work during drain Volume pod
5 test_inflight_volume_delete Delete (with 30s delay) Volume pod
6 test_inflight_volume_clone Clone volume Volume pod
7 test_inflight_snapshot_create Snapshot create Volume pod
8 test_inflight_snapshot_delete Snapshot delete Volume pod
9 test_inflight_backup_kill_volume_pod Backup (kill volume pod) Volume pod
10 test_inflight_volume_extend Extend 16→32GB Volume pod
11 test_inflight_multiple_operations 4 concurrent ops Volume pod
12 test_inflight_restore_kill_backup_pod Backup restore Backup pod
13 test_inflight_migrate_same_vc Migration (same vCenter) Volume pod
14 test_inflight_migrate_cross_vc Migration (cross-datastore 16GB) Volume pod

Key Finding: Cross-Datastore Migration

The cross-datastore migration test confirms that even operations initiated during drain (after SIGTERM) complete successfully:

  • Pod killed while in 30s pre-relocate delay
  • FCD RelocateVStorageObject_Task issued to vCenter 13s AFTER kill
  • 16GB copied between NFS datastores in ~14s (~1.1 GB/s, NetApp server-side copy)
  • Migration completed 29s after SIGTERM

Files Changed

File Purpose
cinder/service.py Three-phase shutdown, skip consumer cancel, pool.waitall
cinder/manager.py do_cleanup freshness check for worker entries
cinder/objects/cleanable.py Worker heartbeat greenthread in set_workers decorator
cinder/volume/manager.py Direct flow execution, GS-DEBUG logging
cinder/volume/flows/manager/create_volume.py CreateVolumeOnFinishTask unconditional write
cinder/backup/manager.py Backup restore heartbeat + freshness check + no-reraise
cinder/opts.py graceful_shutdown_timeout config option
cinder/tests/unit/test_manager.py Unit tests
cinder/tests/unit/test_service.py Unit tests
doc/source/admin/graceful-shutdown-race-condition.rst Race condition documentation
sap-doc/graceful-shutdown-test-results.md Test results

No oslo.messaging source changes required

All changes are self-contained in cinder. The shutdown mechanism relies on pool.waitall() — no manipulation of oslo.messaging internals needed.

Debugging Findings

Full debugging notes: https://github.wdf.sap.corp/gist/09ac921da78820047bdea06651b32205

@hemna hemna force-pushed the graceful-shutdown branch 3 times, most recently from 5a72074 to 1f69e00 Compare February 23, 2026 14:24
@hemna hemna force-pushed the graceful-shutdown branch 3 times, most recently from 4c183ec to d331087 Compare April 30, 2026 12:37
@hemna hemna force-pushed the graceful-shutdown branch 2 times, most recently from 1dd055a to d2fddd7 Compare May 14, 2026 22:25
@hemna hemna changed the title [SAP] Try graceful shutdown [SAP] Implement graceful shutdown for cinder services May 14, 2026
@hemna hemna force-pushed the graceful-shutdown branch 4 times, most recently from a235a58 to b77438d Compare May 14, 2026 22:41
Scsabiii
Scsabiii previously approved these changes May 15, 2026
Three-phase graceful shutdown that allows in-flight volume and backup
operations to complete before the pod exits during Kubernetes rolling
updates. Covers both cinder-volume and cinder-backup services.

Phase 1: Skip consumer cancel (relying on pool.waitall in Phase 2).
         Previous approach (Basic.Cancel) caused eventlet socket races
         that disrupted outbound HTTP/RPC connections during drain.
         The scheduler stops routing new work within service_down_time.

Phase 2: Block in pool.waitall() until all in-flight RPC handler
         greenthreads in the GreenPool complete their operations.

Phase 3: Skip rpcserver.stop()/wait() (hangs on dead AMQP socket).
         Process exits cleanly after stop() returns.

Additional mechanisms:
- Worker entry heartbeat in set_workers decorator: touches worker DB
  entries every 10s during operations, preventing new pod init_host
  _do_cleanup from resetting in-flight volumes to error.
- do_cleanup freshness check: skips worker entries updated within
  service_down_time (60s), only cleans up truly stale/crashed entries.
- Backup restore heartbeat: touches backup.updated_at every 10s during
  restore, preventing new backup pod init_host from resetting the
  backup status and triggering BackupRestoreCancel.
- Backup _cleanup_one_backup freshness check: skips backups in
  creating/restoring state if updated_at is recent.
- Backup _detach_device no-reraise: if detach fails during shutdown
  (RPC timeout), log error but continue finalization. Data integrity
  is preserved; dangling export cleaned up on next startup.
- Semaphore guard: prevents concurrent stop() calls on same Service.
- Heartbeat continues during drain: service stays up in DB.
- reject_if_draining decorator: rejects new RPC calls during shutdown
  so scheduler routes to healthy backends.

Requires:
- dumb-init --single-child (Helm chart change in separate commit)
- terminationGracePeriodSeconds: 900 on pod spec

Tested operations surviving pod termination (qa-de-1):
- Volume create from image (41s to 8min drains)
- Volume delete (with driver delay)
- Volume extend (16->32GB with driver delay)
- Volume clone
- Snapshot create, snapshot delete
- Multiple concurrent operations (4 ops on same pod)
- Backup create (kill backup pod during Swift upload)
- Backup restore (kill backup pod during data transfer)
- Backup (kill volume pod during snapshot prep)
- Migration same-vCenter (vc-a-0 -> vc-a-1, metadata re-home)
- Migration cross-datastore (16GB FCD relocate between NFS datastores)
- Scheduler rerouting during drain
- Idle shutdown (clean exit <1s)

Change-Id: Icdd28affc73fd34491b656a68410dce8e46264d4
@hemna hemna force-pushed the graceful-shutdown branch from 81fe034 to 399ad35 Compare May 15, 2026 22:23
@hemna
Copy link
Copy Markdown
Author

hemna commented May 15, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants