fix(drbd): create metadata during toggle-disk from diskless to diskful#474
fix(drbd): create metadata during toggle-disk from diskless to diskful#474kvaps wants to merge 1 commit intoLINBIT:masterfrom
Conversation
When a diskless resource is being converted to diskful via toggle-disk, several conditions prevented DRBD metadata from being created: 1. processChild() didn't process child layers for diskless resources during disk addition 2. detachVolumesIfNecessary() returned empty checkMetaData list for diskless resources 3. Metadata creation block was skipped for diskless resources 4. hasMetaData() assumed metadata exists when hasDisk() returns true, even during retry when storage exists but no metadata This caused toggle-disk to fail with "No valid meta data found" error from drbdadm adjust, especially for LUKS encrypted resources. Add DISK_ADD_REQUESTED/DISK_ADDING flag checks to: - Process child layers during toggle-disk - Include volumes in metadata check list - Enter metadata creation block - Force metadata verification during retry Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
## What this PR does Build piraeus-server (linstor-server) from source with custom patches: - **adjust-on-resfile-change.diff** — Use actual device path in res file during toggle-disk; fix LUKS data offset - Upstream: [#473](LINBIT/linstor-server#473), [#472](LINBIT/linstor-server#472) - **allow-toggle-disk-retry.diff** — Allow retry and cancellation of failed toggle-disk operations - Upstream: [#475](LINBIT/linstor-server#475) - **force-metadata-check-on-disk-add.diff** — Create metadata during toggle-disk from diskless to diskful - Upstream: [#474](LINBIT/linstor-server#474) - **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res file regeneration when child layer device is inaccessible - Upstream: [#471](LINBIT/linstor-server#471) Also updates plunger-satellite script and values.yaml for the new build. ### Release note ```release-note [linstor] Build linstor-server with custom patches for improved disk handling ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added automatic DRBD stall detection and recovery, improving storage resync resilience without manual intervention. * Introduced configurable container image references via Helm values for streamlined deployment. * **Bug Fixes** * Enhanced disk toggle operations with retry and cancellation support for better error handling. * Improved metadata creation during disk state transitions. * Added device accessibility checks to prevent errors when underlying storage devices are unavailable. * Fixed LUKS encryption header sizing for consistent deployment across nodes. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## What this PR does Build piraeus-server (linstor-server) from source with custom patches: - **adjust-on-resfile-change.diff** — Use actual device path in res file during toggle-disk; fix LUKS data offset - Upstream: [#473](LINBIT/linstor-server#473), [#472](LINBIT/linstor-server#472) - **allow-toggle-disk-retry.diff** — Allow retry and cancellation of failed toggle-disk operations - Upstream: [#475](LINBIT/linstor-server#475) - **force-metadata-check-on-disk-add.diff** — Create metadata during toggle-disk from diskless to diskful - Upstream: [#474](LINBIT/linstor-server#474) - **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res file regeneration when child layer device is inaccessible - Upstream: [#471](LINBIT/linstor-server#471) Also updates plunger-satellite script and values.yaml for the new build. ### Release note ```release-note [linstor] Build linstor-server with custom patches for improved disk handling ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added automatic DRBD stall detection and recovery, improving storage resync resilience without manual intervention. * Introduced configurable container image references via Helm values for streamlined deployment. * **Bug Fixes** * Enhanced disk toggle operations with retry and cancellation support for better error handling. * Improved metadata creation during disk state transitions. * Added device accessibility checks to prevent errors when underlying storage devices are unavailable. * Fixed LUKS encryption header sizing for consistent deployment across nodes. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## What this PR does Build piraeus-server (linstor-server) from source with custom patches: - **adjust-on-resfile-change.diff** — Use actual device path in res file during toggle-disk; fix LUKS data offset - Upstream: [#473](LINBIT/linstor-server#473), [#472](LINBIT/linstor-server#472) - **allow-toggle-disk-retry.diff** — Allow retry and cancellation of failed toggle-disk operations - Upstream: [#475](LINBIT/linstor-server#475) - **force-metadata-check-on-disk-add.diff** — Create metadata during toggle-disk from diskless to diskful - Upstream: [#474](LINBIT/linstor-server#474) - **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res file regeneration when child layer device is inaccessible - Upstream: [#471](LINBIT/linstor-server#471) Also updates plunger-satellite script and values.yaml for the new build. ### Release note ```release-note [linstor] Build linstor-server with custom patches for improved disk handling ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added automatic DRBD stall detection and recovery, improving storage resync resilience without manual intervention. * Introduced configurable container image references via Helm values for streamlined deployment. * **Bug Fixes** * Enhanced disk toggle operations with retry and cancellation support for better error handling. * Improved metadata creation during disk state transitions. * Added device accessibility checks to prevent errors when underlying storage devices are unavailable. * Fixed LUKS encryption header sizing for consistent deployment across nodes. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## What this PR does Build piraeus-server (linstor-server) from source with custom patches: - **adjust-on-resfile-change.diff** — Use actual device path in res file during toggle-disk; fix LUKS data offset - Upstream: [#473](LINBIT/linstor-server#473), [#472](LINBIT/linstor-server#472) - **allow-toggle-disk-retry.diff** — Allow retry and cancellation of failed toggle-disk operations - Upstream: [#475](LINBIT/linstor-server#475) - **force-metadata-check-on-disk-add.diff** — Create metadata during toggle-disk from diskless to diskful - Upstream: [#474](LINBIT/linstor-server#474) - **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res file regeneration when child layer device is inaccessible - Upstream: [#471](LINBIT/linstor-server#471) Also updates plunger-satellite script and values.yaml for the new build. ### Release note ```release-note [linstor] Build linstor-server with custom patches for improved disk handling ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added automatic DRBD stall detection and recovery, improving storage resync resilience without manual intervention. * Introduced configurable container image references via Helm values for streamlined deployment. * **Bug Fixes** * Enhanced disk toggle operations with retry and cancellation support for better error handling. * Improved metadata creation during disk state transitions. * Added device accessibility checks to prevent errors when underlying storage devices are unavailable. * Fixed LUKS encryption header sizing for consistent deployment across nodes. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
|
Hi there! For this PR my questions / failed reproducer attempt is basically the same as in my comment in PR 473. I admit it might look a bit confusing why the code checks for |
|
Hi Gabor! I believe there's a misunderstanding. Let me show the issue with code references. Important context: This is about RETRY scenariosBoth PR #473 and #474 are related to PR #472 (LUKS data offset fix). The problem appears in retry scenarios:
In your test, you're testing the happy path where everything succeeds on the first attempt. That works because the two devMgrRuns happen quickly in sequence. But in retry scenarios, the state is different. DRBD_DISKLESS flag is NOT removed during toggle-diskFrom * <li>DISKLESS, DISK_ADD_REQUESTED - the transition has been requested but not yet started</li>
* <li>DISKLESS, DISK_ADD_REQUESTED, DISK_ADDING - the peers should prepare for resource to gain disks</li>And So during satellite processing: The actual problem: checkMetaData is emptyLook at if (!rsc.isDrbdDiskless(workerCtx) || // FALSE during toggle-disk!
rsc.getStateFlags().isSet(workerCtx, Resource.Flags.DISK_REMOVING) // FALSE
)
{
// volumes added to checkMetaData
}
return checkMetaData; // EMPTY LIST!During toggle-disk: Then at line 703: for (DrbdVlmData<Resource> drbdVlmData : checkMetaData) // empty list!
{
if (!hasMetaData(drbdVlmData) ...) { createMetaData.add(...); }
}Loop doesn't execute → metadata is never created! Why your test works (but retry fails)Your test succeeds because:
But in retry scenario:
How to reproduce
Our patches ensure metadata is created even during retry when storage already exists but the DRBD_DISKLESS flag is still set. Best regards |
|
Decided not to use these changes for now, closing. Will reopen if the issues arise again. |
Summary
Fix DRBD metadata creation during toggle-disk operation from diskless to diskful resource.
Problem
When converting a diskless resource to diskful via
toggle-disk -s <storage-pool>, several conditions prevented DRBD metadata from being created:processChild()didn't process child layers for diskless resources during disk additiondetachVolumesIfNecessary()returned emptycheckMetaDatalist for diskless resourceshasMetaData()assumed metadata exists whenhasDisk()returns true, even during retry when storage exists but no metadataThis caused toggle-disk to fail with "No valid meta data found" error from
drbdadm adjust, especially for LUKS encrypted resources.Solution
Add
DISK_ADD_REQUESTED/DISK_ADDINGflag checks to:Test plan
toggle-disk -s <storage-pool>to convert to diskful