fix(drbd): use actual device path in res file during toggle-disk#473
fix(drbd): use actual device path in res file during toggle-disk#473kvaps wants to merge 1 commit intoLINBIT:masterfrom
Conversation
When a diskless resource is being converted to diskful via toggle-disk, the res file generator was still outputting "disk none" because it only checked isDrbdDiskless() which returns true until the operation completes. This caused drbdadm adjust to fail because the res file didn't reflect the actual storage device that was already created by the storage layer. Add check for DISK_ADD_REQUESTED/DISK_ADDING flags to use the actual device path when toggle-disk operation is in progress. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
## What this PR does Build piraeus-server (linstor-server) from source with custom patches: - **adjust-on-resfile-change.diff** — Use actual device path in res file during toggle-disk; fix LUKS data offset - Upstream: [#473](LINBIT/linstor-server#473), [#472](LINBIT/linstor-server#472) - **allow-toggle-disk-retry.diff** — Allow retry and cancellation of failed toggle-disk operations - Upstream: [#475](LINBIT/linstor-server#475) - **force-metadata-check-on-disk-add.diff** — Create metadata during toggle-disk from diskless to diskful - Upstream: [#474](LINBIT/linstor-server#474) - **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res file regeneration when child layer device is inaccessible - Upstream: [#471](LINBIT/linstor-server#471) Also updates plunger-satellite script and values.yaml for the new build. ### Release note ```release-note [linstor] Build linstor-server with custom patches for improved disk handling ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added automatic DRBD stall detection and recovery, improving storage resync resilience without manual intervention. * Introduced configurable container image references via Helm values for streamlined deployment. * **Bug Fixes** * Enhanced disk toggle operations with retry and cancellation support for better error handling. * Improved metadata creation during disk state transitions. * Added device accessibility checks to prevent errors when underlying storage devices are unavailable. * Fixed LUKS encryption header sizing for consistent deployment across nodes. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## What this PR does Build piraeus-server (linstor-server) from source with custom patches: - **adjust-on-resfile-change.diff** — Use actual device path in res file during toggle-disk; fix LUKS data offset - Upstream: [#473](LINBIT/linstor-server#473), [#472](LINBIT/linstor-server#472) - **allow-toggle-disk-retry.diff** — Allow retry and cancellation of failed toggle-disk operations - Upstream: [#475](LINBIT/linstor-server#475) - **force-metadata-check-on-disk-add.diff** — Create metadata during toggle-disk from diskless to diskful - Upstream: [#474](LINBIT/linstor-server#474) - **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res file regeneration when child layer device is inaccessible - Upstream: [#471](LINBIT/linstor-server#471) Also updates plunger-satellite script and values.yaml for the new build. ### Release note ```release-note [linstor] Build linstor-server with custom patches for improved disk handling ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added automatic DRBD stall detection and recovery, improving storage resync resilience without manual intervention. * Introduced configurable container image references via Helm values for streamlined deployment. * **Bug Fixes** * Enhanced disk toggle operations with retry and cancellation support for better error handling. * Improved metadata creation during disk state transitions. * Added device accessibility checks to prevent errors when underlying storage devices are unavailable. * Fixed LUKS encryption header sizing for consistent deployment across nodes. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## What this PR does Build piraeus-server (linstor-server) from source with custom patches: - **adjust-on-resfile-change.diff** — Use actual device path in res file during toggle-disk; fix LUKS data offset - Upstream: [#473](LINBIT/linstor-server#473), [#472](LINBIT/linstor-server#472) - **allow-toggle-disk-retry.diff** — Allow retry and cancellation of failed toggle-disk operations - Upstream: [#475](LINBIT/linstor-server#475) - **force-metadata-check-on-disk-add.diff** — Create metadata during toggle-disk from diskless to diskful - Upstream: [#474](LINBIT/linstor-server#474) - **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res file regeneration when child layer device is inaccessible - Upstream: [#471](LINBIT/linstor-server#471) Also updates plunger-satellite script and values.yaml for the new build. ### Release note ```release-note [linstor] Build linstor-server with custom patches for improved disk handling ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added automatic DRBD stall detection and recovery, improving storage resync resilience without manual intervention. * Introduced configurable container image references via Helm values for streamlined deployment. * **Bug Fixes** * Enhanced disk toggle operations with retry and cancellation support for better error handling. * Improved metadata creation during disk state transitions. * Added device accessibility checks to prevent errors when underlying storage devices are unavailable. * Fixed LUKS encryption header sizing for consistent deployment across nodes. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## What this PR does Build piraeus-server (linstor-server) from source with custom patches: - **adjust-on-resfile-change.diff** — Use actual device path in res file during toggle-disk; fix LUKS data offset - Upstream: [#473](LINBIT/linstor-server#473), [#472](LINBIT/linstor-server#472) - **allow-toggle-disk-retry.diff** — Allow retry and cancellation of failed toggle-disk operations - Upstream: [#475](LINBIT/linstor-server#475) - **force-metadata-check-on-disk-add.diff** — Create metadata during toggle-disk from diskless to diskful - Upstream: [#474](LINBIT/linstor-server#474) - **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res file regeneration when child layer device is inaccessible - Upstream: [#471](LINBIT/linstor-server#471) Also updates plunger-satellite script and values.yaml for the new build. ### Release note ```release-note [linstor] Build linstor-server with custom patches for improved disk handling ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added automatic DRBD stall detection and recovery, improving storage resync resilience without manual intervention. * Introduced configurable container image references via Helm values for streamlined deployment. * **Bug Fixes** * Enhanced disk toggle operations with retry and cancellation support for better error handling. * Improved metadata creation during disk state transitions. * Added device accessibility checks to prevent errors when underlying storage devices are unavailable. * Fixed LUKS encryption header sizing for consistent deployment across nodes. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
|
Hello! I am trying to reproduce this issue but I can't. From your description it sounds like this scenario should always fail, but for me it just works fine: # linstor --no-utf8 --no-color r l -a
+----------------------------------------------------------------------------------------------+
| ResourceName | Node | Layers | Usage | Conns | State | CreatedOn |
|==============================================================================================|
| rsc | bravo | DRBD,LUKS,STORAGE | Unused | Ok | UpToDate | 2026-01-14 07:36:57 |
| rsc | charlie | DRBD,LUKS,STORAGE | Unused | Ok | UpToDate | 2026-01-14 07:36:57 |
| rsc | delta | DRBD,STORAGE | Unused | Ok | Diskless | 2026-01-14 07:36:59 |
+----------------------------------------------------------------------------------------------+
# linstor r td delta rsc -s lvmpool
# linstor --no-utf8 --no-color r l -a
+----------------------------------------------------------------------------------------------+
| ResourceName | Node | Layers | Usage | Conns | State | CreatedOn |
|==============================================================================================|
| rsc | bravo | DRBD,LUKS,STORAGE | Unused | Ok | UpToDate | 2026-01-14 07:36:57 |
| rsc | charlie | DRBD,LUKS,STORAGE | Unused | Ok | UpToDate | 2026-01-14 07:36:57 |
| rsc | delta | DRBD,LUKS,STORAGE | Unused | Ok | UpToDate | 2026-01-14 07:36:59 |
+----------------------------------------------------------------------------------------------+Snippet of the resulting .res file from root@delta:~# lsblk -s /dev/drbd1000
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
drbd1000 147:1000 0 12M 0 disk
└─Linstor-Crypt-rsc_00000 253:1 0 12M 0 crypt
└─scratch-rsc_00000 253:0 0 28M 0 lvm
└─vdb 252:16 0 10G 0 disk I do not really understand why this would be an issue. Without your patch the local DRBD will still be configured as diskless while the storage and in your case the LUKS device is already up and running but ignored for the current device-manager-run. In the next devMgrRun the toggle-disk flags should be removed along with the diskless flag, which will then finally make the ConfFileBuilder to generate the device path. So your patch seems only to change the behavior that LINSTOR tells DRBD one step earlier to use the LUKS device. Can you elaborate why this is necessary? I do not see the issue with the current behavior right now, but I can of course be missing something. Do you have an ErrorReport you could share or reproducer for us to understand the issue better? |
|
Hi Gabor! Thank you for testing this. The context I missed in the PR description is that this fix is for retry scenarios, not the happy path. Important context: RETRY scenarios after PR #472 issuesBoth PR #473 and #474 are related to PR #472 (LUKS data offset fix). The problem appears when:
Why your test worksIn your happy path test:
Why retry fails without this patchIn retry scenario:
The fixPR #473 checks for
How to reproduce
SummaryYou're right that the happy path works without this patch. The value is in:
If you prefer, we can mark this as low-priority improvement. The critical fixes are PR #474, #476, and #477. Best regards |
|
Decided not to use these changes for now, closing. Will reopen if the issues arise again. |
Summary
Problem
When converting a diskless resource to diskful via
toggle-disk -s <storage-pool>, the res file generator outputsdisk nonebecauseisDrbdDiskless()returns true until the operation completes. This causesdrbdadm adjustto fail as the res file doesn't reflect the actual storage device already created by the storage layer.Solution
Check for
DISK_ADD_REQUESTEDorDISK_ADDINGflags and use the actual device path when toggle-disk operation is in progress.Test plan
toggle-disk -s <storage-pool>to convert to diskful