fix(drbd): create metadata during toggle-disk from diskless to diskful by kvaps · Pull Request #474 · LINBIT/linstor-server

kvaps · 2026-01-06T08:26:49Z

Summary

Fix DRBD metadata creation during toggle-disk operation from diskless to diskful resource.

Problem

When converting a diskless resource to diskful via toggle-disk -s <storage-pool>, several conditions prevented DRBD metadata from being created:

processChild() didn't process child layers for diskless resources during disk addition
detachVolumesIfNecessary() returned empty checkMetaData list for diskless resources
Metadata creation block was skipped for diskless resources
hasMetaData() assumed metadata exists when hasDisk() returns true, even during retry when storage exists but no metadata

This caused toggle-disk to fail with "No valid meta data found" error from drbdadm adjust, especially for LUKS encrypted resources.

Solution

Add DISK_ADD_REQUESTED/DISK_ADDING flag checks to:

Process child layers during toggle-disk
Include volumes in metadata check list
Enter metadata creation block
Force metadata verification during retry

Test plan

Create diskless DRBD resource with LUKS encryption
Run toggle-disk -s <storage-pool> to convert to diskful
Verify DRBD metadata is created before drbdadm adjust
Verify toggle-disk completes successfully
Test retry scenario: if first attempt fails, cleanup storage and retry

When a diskless resource is being converted to diskful via toggle-disk, several conditions prevented DRBD metadata from being created: 1. processChild() didn't process child layers for diskless resources during disk addition 2. detachVolumesIfNecessary() returned empty checkMetaData list for diskless resources 3. Metadata creation block was skipped for diskless resources 4. hasMetaData() assumed metadata exists when hasDisk() returns true, even during retry when storage exists but no metadata This caused toggle-disk to fail with "No valid meta data found" error from drbdadm adjust, especially for LUKS encrypted resources. Add DISK_ADD_REQUESTED/DISK_ADDING flag checks to: - Process child layers during toggle-disk - Include volumes in metadata check list - Enter metadata creation block - Force metadata verification during retry Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

## What this PR does Build piraeus-server (linstor-server) from source with custom patches: - **adjust-on-resfile-change.diff** — Use actual device path in res file during toggle-disk; fix LUKS data offset - Upstream: [#473](LINBIT/linstor-server#473), [#472](LINBIT/linstor-server#472) - **allow-toggle-disk-retry.diff** — Allow retry and cancellation of failed toggle-disk operations - Upstream: [#475](LINBIT/linstor-server#475) - **force-metadata-check-on-disk-add.diff** — Create metadata during toggle-disk from diskless to diskful - Upstream: [#474](LINBIT/linstor-server#474) - **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res file regeneration when child layer device is inaccessible - Upstream: [#471](LINBIT/linstor-server#471) Also updates plunger-satellite script and values.yaml for the new build. ### Release note ```release-note [linstor] Build linstor-server with custom patches for improved disk handling ```  ## Summary by CodeRabbit * **New Features** * Added automatic DRBD stall detection and recovery, improving storage resync resilience without manual intervention. * Introduced configurable container image references via Helm values for streamlined deployment. * **Bug Fixes** * Enhanced disk toggle operations with retry and cancellation support for better error handling. * Improved metadata creation during disk state transitions. * Added device accessibility checks to prevent errors when underlying storage devices are unavailable. * Fixed LUKS encryption header sizing for consistent deployment across nodes. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub>

ghernadi · 2026-01-14T06:55:00Z

Hi there!

For this PR my questions / failed reproducer attempt is basically the same as in my comment in PR 473.

I admit it might look a bit confusing why the code checks for boolean processChildren = !isDiskless || isDiskRemoving; but not || isDiskAdding. The point here is simply that during the "disk adding" state, the resource is simply not marked diskless. That is also the reason why I do not really see how your patch changes anything. Can you elaborate here?

kvaps · 2026-01-14T17:07:23Z

Hi Gabor!

I believe there's a misunderstanding. Let me show the issue with code references.

Important context: This is about RETRY scenarios

Both PR #473 and #474 are related to PR #472 (LUKS data offset fix). The problem appears in retry scenarios:

First toggle-disk attempt fails (e.g., due to LUKS offset mismatch - PR fix(luks): use fixed data offset for consistent LUKS header size #472)
Storage layer and LUKS device are already created
User retries toggle-disk
On retry: storage exists, but metadata doesn't exist yet
Without our patches: metadata is never created → "No valid meta data found"

In your test, you're testing the happy path where everything succeeds on the first attempt. That works because the two devMgrRuns happen quickly in sequence. But in retry scenarios, the state is different.

DRBD_DISKLESS flag is NOT removed during toggle-disk

From CtrlRscToggleDiskApiCallHandler.java documentation (lines 99-100):

* <li>DISKLESS, DISK_ADD_REQUESTED - the transition has been requested but not yet started</li>
* <li>DISKLESS, DISK_ADD_REQUESTED, DISK_ADDING - the peers should prepare for resource to gain disks</li>

And markDiskAdded() (lines 1453-1458) which removes DRBD_DISKLESS is called AFTER satellite completes - see line 1068 in finishOperationInTransaction().

So during satellite processing: isDrbdDiskless() == true

The actual problem: checkMetaData is empty

Look at detachVolumesIfNecessary() (DrbdLayer.java:1117):

if (!rsc.isDrbdDiskless(workerCtx) ||          // FALSE during toggle-disk!
    rsc.getStateFlags().isSet(workerCtx, Resource.Flags.DISK_REMOVING)  // FALSE
)
{
    // volumes added to checkMetaData
}
return checkMetaData;  // EMPTY LIST!

During toggle-disk: !true || false = false → block skipped → checkMetaData is empty!

Then at line 703:

for (DrbdVlmData<Resource> drbdVlmData : checkMetaData)  // empty list!
{
    if (!hasMetaData(drbdVlmData) ...) { createMetaData.add(...); }
}

Loop doesn't execute → metadata is never created!

Why your test works (but retry fails)

Your test succeeds because:

First devMgrRun: Storage/LUKS created, checkMetaData empty → no metadata
Controller receives "success", calls markDiskAdded() → removes DRBD_DISKLESS
Second devMgrRun: isDrbdDiskless=false → checkMetaData populated → metadata created
Everything happens quickly → success

But in retry scenario:

First attempt: Storage/LUKS created, then fails (e.g., LUKS offset issue)
DRBD_DISKLESS flag is NOT removed (operation didn't complete)
Retry: Storage already exists, but isDrbdDiskless() still true
checkMetaData is empty → metadata never created → "No valid meta data found"

How to reproduce

Create a diskless DRBD+LUKS resource
Start toggle-disk, but make it fail somehow (e.g., kill satellite mid-operation, or use PR fix(luks): use fixed data offset for consistent LUKS header size #472 issue with different LUKS offsets between nodes)
Retry toggle-disk
Observe "No valid meta data found" error

Our patches ensure metadata is created even during retry when storage already exists but the DRBD_DISKLESS flag is still set.

Best regards

kvaps · 2026-04-03T20:58:38Z

Decided not to use these changes for now, closing. Will reopen if the issues arise again.

kvaps mentioned this pull request Jan 6, 2026

[linstor] Build linstor-server with custom patches cozystack/cozystack#1726

Merged

kvaps marked this pull request as ready for review January 6, 2026 09:14

kvaps mentioned this pull request Jan 14, 2026

fix(drbd): use actual device path in res file during toggle-disk #473

Closed

4 tasks

kvaps closed this Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(drbd): create metadata during toggle-disk from diskless to diskful#474

fix(drbd): create metadata during toggle-disk from diskless to diskful#474
kvaps wants to merge 1 commit intoLINBIT:masterfrom
kvaps:fix/toggle-disk-metadata

kvaps commented Jan 6, 2026 •

edited

Loading

Uh oh!

ghernadi commented Jan 14, 2026

Uh oh!

kvaps commented Jan 14, 2026

Uh oh!

kvaps commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kvaps commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Test plan

Uh oh!

ghernadi commented Jan 14, 2026

Uh oh!

kvaps commented Jan 14, 2026

Important context: This is about RETRY scenarios

DRBD_DISKLESS flag is NOT removed during toggle-disk

The actual problem: checkMetaData is empty

Why your test works (but retry fails)

How to reproduce

Uh oh!

kvaps commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kvaps commented Jan 6, 2026 •

edited

Loading