Skip to content

fix(drbd): use actual device path in res file during toggle-disk#473

Closed
kvaps wants to merge 1 commit intoLINBIT:masterfrom
kvaps:fix/toggle-disk-resfile
Closed

fix(drbd): use actual device path in res file during toggle-disk#473
kvaps wants to merge 1 commit intoLINBIT:masterfrom
kvaps:fix/toggle-disk-resfile

Conversation

@kvaps
Copy link
Copy Markdown

@kvaps kvaps commented Jan 6, 2026

Summary

  • Fix res file generation during toggle-disk operation from diskless to diskful
  • Add check for DISK_ADD_REQUESTED/DISK_ADDING flags to use actual device path instead of "none"

Problem

When converting a diskless resource to diskful via toggle-disk -s <storage-pool>, the res file generator outputs disk none because isDrbdDiskless() returns true until the operation completes. This causes drbdadm adjust to fail as the res file doesn't reflect the actual storage device already created by the storage layer.

Solution

Check for DISK_ADD_REQUESTED or DISK_ADDING flags and use the actual device path when toggle-disk operation is in progress.

Test plan

  • Create diskless DRBD resource with LUKS encryption
  • Run toggle-disk -s <storage-pool> to convert to diskful
  • Verify res file contains actual device path, not "none"
  • Verify drbdadm adjust succeeds

When a diskless resource is being converted to diskful via toggle-disk,
the res file generator was still outputting "disk none" because it only
checked isDrbdDiskless() which returns true until the operation completes.

This caused drbdadm adjust to fail because the res file didn't reflect
the actual storage device that was already created by the storage layer.

Add check for DISK_ADD_REQUESTED/DISK_ADDING flags to use the actual
device path when toggle-disk operation is in progress.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps kvaps marked this pull request as ready for review January 6, 2026 09:13
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 6, 2026
## What this PR does

Build piraeus-server (linstor-server) from source with custom patches:

- **adjust-on-resfile-change.diff** — Use actual device path in res file
during toggle-disk; fix LUKS data offset
- Upstream: [#473](LINBIT/linstor-server#473),
[#472](LINBIT/linstor-server#472)
- **allow-toggle-disk-retry.diff** — Allow retry and cancellation of
failed toggle-disk operations
  - Upstream: [#475](LINBIT/linstor-server#475)
- **force-metadata-check-on-disk-add.diff** — Create metadata during
toggle-disk from diskless to diskful
  - Upstream: [#474](LINBIT/linstor-server#474)
- **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res
file regeneration when child layer device is inaccessible
  - Upstream: [#471](LINBIT/linstor-server#471)

Also updates plunger-satellite script and values.yaml for the new build.

### Release note

```release-note
[linstor] Build linstor-server with custom patches for improved disk handling
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added automatic DRBD stall detection and recovery, improving storage
resync resilience without manual intervention.
* Introduced configurable container image references via Helm values for
streamlined deployment.

* **Bug Fixes**
* Enhanced disk toggle operations with retry and cancellation support
for better error handling.
  * Improved metadata creation during disk state transitions.
* Added device accessibility checks to prevent errors when underlying
storage devices are unavailable.
* Fixed LUKS encryption header sizing for consistent deployment across
nodes.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 8, 2026
## What this PR does

Build piraeus-server (linstor-server) from source with custom patches:

- **adjust-on-resfile-change.diff** — Use actual device path in res file
during toggle-disk; fix LUKS data offset
- Upstream: [#473](LINBIT/linstor-server#473),
[#472](LINBIT/linstor-server#472)
- **allow-toggle-disk-retry.diff** — Allow retry and cancellation of
failed toggle-disk operations
  - Upstream: [#475](LINBIT/linstor-server#475)
- **force-metadata-check-on-disk-add.diff** — Create metadata during
toggle-disk from diskless to diskful
  - Upstream: [#474](LINBIT/linstor-server#474)
- **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res
file regeneration when child layer device is inaccessible
  - Upstream: [#471](LINBIT/linstor-server#471)

Also updates plunger-satellite script and values.yaml for the new build.

### Release note

```release-note
[linstor] Build linstor-server with custom patches for improved disk handling
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added automatic DRBD stall detection and recovery, improving storage
resync resilience without manual intervention.
* Introduced configurable container image references via Helm values for
streamlined deployment.

* **Bug Fixes**
* Enhanced disk toggle operations with retry and cancellation support
for better error handling.
  * Improved metadata creation during disk state transitions.
* Added device accessibility checks to prevent errors when underlying
storage devices are unavailable.
* Fixed LUKS encryption header sizing for consistent deployment across
nodes.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 8, 2026
## What this PR does

Build piraeus-server (linstor-server) from source with custom patches:

- **adjust-on-resfile-change.diff** — Use actual device path in res file
during toggle-disk; fix LUKS data offset
- Upstream: [#473](LINBIT/linstor-server#473),
[#472](LINBIT/linstor-server#472)
- **allow-toggle-disk-retry.diff** — Allow retry and cancellation of
failed toggle-disk operations
  - Upstream: [#475](LINBIT/linstor-server#475)
- **force-metadata-check-on-disk-add.diff** — Create metadata during
toggle-disk from diskless to diskful
  - Upstream: [#474](LINBIT/linstor-server#474)
- **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res
file regeneration when child layer device is inaccessible
  - Upstream: [#471](LINBIT/linstor-server#471)

Also updates plunger-satellite script and values.yaml for the new build.

### Release note

```release-note
[linstor] Build linstor-server with custom patches for improved disk handling
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added automatic DRBD stall detection and recovery, improving storage
resync resilience without manual intervention.
* Introduced configurable container image references via Helm values for
streamlined deployment.

* **Bug Fixes**
* Enhanced disk toggle operations with retry and cancellation support
for better error handling.
  * Improved metadata creation during disk state transitions.
* Added device accessibility checks to prevent errors when underlying
storage devices are unavailable.
* Fixed LUKS encryption header sizing for consistent deployment across
nodes.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 9, 2026
## What this PR does

Build piraeus-server (linstor-server) from source with custom patches:

- **adjust-on-resfile-change.diff** — Use actual device path in res file
during toggle-disk; fix LUKS data offset
- Upstream: [#473](LINBIT/linstor-server#473),
[#472](LINBIT/linstor-server#472)
- **allow-toggle-disk-retry.diff** — Allow retry and cancellation of
failed toggle-disk operations
  - Upstream: [#475](LINBIT/linstor-server#475)
- **force-metadata-check-on-disk-add.diff** — Create metadata during
toggle-disk from diskless to diskful
  - Upstream: [#474](LINBIT/linstor-server#474)
- **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res
file regeneration when child layer device is inaccessible
  - Upstream: [#471](LINBIT/linstor-server#471)

Also updates plunger-satellite script and values.yaml for the new build.

### Release note

```release-note
[linstor] Build linstor-server with custom patches for improved disk handling
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added automatic DRBD stall detection and recovery, improving storage
resync resilience without manual intervention.
* Introduced configurable container image references via Helm values for
streamlined deployment.

* **Bug Fixes**
* Enhanced disk toggle operations with retry and cancellation support
for better error handling.
  * Improved metadata creation during disk state transitions.
* Added device accessibility checks to prevent errors when underlying
storage devices are unavailable.
* Fixed LUKS encryption header sizing for consistent deployment across
nodes.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@ghernadi
Copy link
Copy Markdown
Contributor

Hello!

I am trying to reproduce this issue but I can't. From your description it sounds like this scenario should always fail, but for me it just works fine:

# linstor --no-utf8 --no-color r l -a
+----------------------------------------------------------------------------------------------+
| ResourceName | Node    | Layers            | Usage  | Conns |    State | CreatedOn           |
|==============================================================================================|
| rsc          | bravo   | DRBD,LUKS,STORAGE | Unused | Ok    | UpToDate | 2026-01-14 07:36:57 |
| rsc          | charlie | DRBD,LUKS,STORAGE | Unused | Ok    | UpToDate | 2026-01-14 07:36:57 |
| rsc          | delta   | DRBD,STORAGE      | Unused | Ok    | Diskless | 2026-01-14 07:36:59 |
+----------------------------------------------------------------------------------------------+
# linstor r td delta rsc -s lvmpool
# linstor --no-utf8 --no-color r l -a
+----------------------------------------------------------------------------------------------+
| ResourceName | Node    | Layers            | Usage  | Conns |    State | CreatedOn           |
|==============================================================================================|
| rsc          | bravo   | DRBD,LUKS,STORAGE | Unused | Ok    | UpToDate | 2026-01-14 07:36:57 |
| rsc          | charlie | DRBD,LUKS,STORAGE | Unused | Ok    | UpToDate | 2026-01-14 07:36:57 |
| rsc          | delta   | DRBD,LUKS,STORAGE | Unused | Ok    | UpToDate | 2026-01-14 07:36:59 |
+----------------------------------------------------------------------------------------------+

Snippet of the resulting .res file from delta:

    on "delta"
    {
        volume 0
        {
            disk        /dev/mapper/Linstor-Crypt-rsc_00000;
root@delta:~# lsblk -s /dev/drbd1000 
NAME                      MAJ:MIN  RM SIZE RO TYPE  MOUNTPOINTS
drbd1000                  147:1000  0  12M  0 disk  
└─Linstor-Crypt-rsc_00000 253:1     0  12M  0 crypt 
  └─scratch-rsc_00000     253:0     0  28M  0 lvm   
    └─vdb                 252:16    0  10G  0 disk  

I do not really understand why this would be an issue. Without your patch the local DRBD will still be configured as diskless while the storage and in your case the LUKS device is already up and running but ignored for the current device-manager-run. In the next devMgrRun the toggle-disk flags should be removed along with the diskless flag, which will then finally make the ConfFileBuilder to generate the device path. So your patch seems only to change the behavior that LINSTOR tells DRBD one step earlier to use the LUKS device. Can you elaborate why this is necessary? I do not see the issue with the current behavior right now, but I can of course be missing something.

Do you have an ErrorReport you could share or reproducer for us to understand the issue better?
Does my test miss something from your context?

@kvaps
Copy link
Copy Markdown
Author

kvaps commented Jan 14, 2026

Hi Gabor!

Thank you for testing this. The context I missed in the PR description is that this fix is for retry scenarios, not the happy path.

Important context: RETRY scenarios after PR #472 issues

Both PR #473 and #474 are related to PR #472 (LUKS data offset fix). The problem appears when:

  1. First toggle-disk attempt fails (e.g., due to LUKS offset mismatch between nodes - PR fix(luks): use fixed data offset for consistent LUKS header size #472)
  2. Storage layer and LUKS device are already created on the local node
  3. User retries toggle-disk (or PR fix(toggle-disk): allow retry and cancellation of failed operations #475 retry mechanism kicks in)
  4. On retry: LUKS device exists and is open, but res file still says disk none

Why your test works

In your happy path test:

  1. First devMgrRun: res file has disk none, DRBD configured as diskless
  2. Controller receives success, calls markDiskAdded() → removes DRBD_DISKLESS
  3. Second devMgrRun: isDrbdDiskless=false → res file gets correct device path
  4. Everything happens quickly → success

Why retry fails without this patch

In retry scenario:

  1. First attempt: Storage/LUKS created, res file generated with disk none, then operation fails
  2. DRBD_DISKLESS flag is NOT removed (operation didn't complete successfully)
  3. Retry: LUKS device /dev/mapper/Linstor-Crypt-... already exists and is open
  4. But isDrbdDiskless() still returns true → res file still says disk none
  5. drbdadm adjust sees disk none but actual LUKS device is already there
  6. State mismatch → potential issues with metadata creation (PR fix(drbd): create metadata during toggle-disk from diskless to diskful #474)

The fix

PR #473 checks for DISK_ADD_REQUESTED/DISK_ADDING flags in ConfFileBuilder.java:

  • If these flags are set AND dataDevice is available → use actual device path
  • This ensures res file reflects reality during retry scenarios

How to reproduce

  1. Create a diskless DRBD+LUKS resource with replicas on nodes with different LUKS default offsets
  2. Run toggle-disk → fails due to "Low.dev. smaller than requested DRBD-dev. size" (PR fix(luks): use fixed data offset for consistent LUKS header size #472 issue)
  3. Apply PR fix(luks): use fixed data offset for consistent LUKS header size #472 fix (or ensure same LUKS offset)
  4. Retry toggle-disk
  5. Without PR fix(drbd): use actual device path in res file during toggle-disk #473: res file still has disk none even though LUKS device exists

Summary

You're right that the happy path works without this patch. The value is in:

  1. Retry scenarios: res file reflects actual device state immediately
  2. Consistency: no mismatch between res file and actual LUKS device
  3. Works with PR fix(drbd): create metadata during toggle-disk from diskless to diskful #474: metadata creation needs correct device path in res file

If you prefer, we can mark this as low-priority improvement. The critical fixes are PR #474, #476, and #477.

Best regards

@kvaps
Copy link
Copy Markdown
Author

kvaps commented Apr 3, 2026

Decided not to use these changes for now, closing. Will reopen if the issues arise again.

@kvaps kvaps closed this Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants