Skip to content

fix(drbd): create metadata during toggle-disk from diskless to diskful#474

Closed
kvaps wants to merge 1 commit intoLINBIT:masterfrom
kvaps:fix/toggle-disk-metadata
Closed

fix(drbd): create metadata during toggle-disk from diskless to diskful#474
kvaps wants to merge 1 commit intoLINBIT:masterfrom
kvaps:fix/toggle-disk-metadata

Conversation

@kvaps
Copy link
Copy Markdown

@kvaps kvaps commented Jan 6, 2026

Summary

Fix DRBD metadata creation during toggle-disk operation from diskless to diskful resource.

Problem

When converting a diskless resource to diskful via toggle-disk -s <storage-pool>, several conditions prevented DRBD metadata from being created:

  1. processChild() didn't process child layers for diskless resources during disk addition
  2. detachVolumesIfNecessary() returned empty checkMetaData list for diskless resources
  3. Metadata creation block was skipped for diskless resources
  4. hasMetaData() assumed metadata exists when hasDisk() returns true, even during retry when storage exists but no metadata

This caused toggle-disk to fail with "No valid meta data found" error from drbdadm adjust, especially for LUKS encrypted resources.

Solution

Add DISK_ADD_REQUESTED/DISK_ADDING flag checks to:

  • Process child layers during toggle-disk
  • Include volumes in metadata check list
  • Enter metadata creation block
  • Force metadata verification during retry

Test plan

  • Create diskless DRBD resource with LUKS encryption
  • Run toggle-disk -s <storage-pool> to convert to diskful
  • Verify DRBD metadata is created before drbdadm adjust
  • Verify toggle-disk completes successfully
  • Test retry scenario: if first attempt fails, cleanup storage and retry

When a diskless resource is being converted to diskful via toggle-disk,
several conditions prevented DRBD metadata from being created:

1. processChild() didn't process child layers for diskless resources
   during disk addition
2. detachVolumesIfNecessary() returned empty checkMetaData list for
   diskless resources
3. Metadata creation block was skipped for diskless resources
4. hasMetaData() assumed metadata exists when hasDisk() returns true,
   even during retry when storage exists but no metadata

This caused toggle-disk to fail with "No valid meta data found" error
from drbdadm adjust, especially for LUKS encrypted resources.

Add DISK_ADD_REQUESTED/DISK_ADDING flag checks to:
- Process child layers during toggle-disk
- Include volumes in metadata check list
- Enter metadata creation block
- Force metadata verification during retry

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps kvaps marked this pull request as ready for review January 6, 2026 09:14
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 6, 2026
## What this PR does

Build piraeus-server (linstor-server) from source with custom patches:

- **adjust-on-resfile-change.diff** — Use actual device path in res file
during toggle-disk; fix LUKS data offset
- Upstream: [#473](LINBIT/linstor-server#473),
[#472](LINBIT/linstor-server#472)
- **allow-toggle-disk-retry.diff** — Allow retry and cancellation of
failed toggle-disk operations
  - Upstream: [#475](LINBIT/linstor-server#475)
- **force-metadata-check-on-disk-add.diff** — Create metadata during
toggle-disk from diskless to diskful
  - Upstream: [#474](LINBIT/linstor-server#474)
- **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res
file regeneration when child layer device is inaccessible
  - Upstream: [#471](LINBIT/linstor-server#471)

Also updates plunger-satellite script and values.yaml for the new build.

### Release note

```release-note
[linstor] Build linstor-server with custom patches for improved disk handling
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added automatic DRBD stall detection and recovery, improving storage
resync resilience without manual intervention.
* Introduced configurable container image references via Helm values for
streamlined deployment.

* **Bug Fixes**
* Enhanced disk toggle operations with retry and cancellation support
for better error handling.
  * Improved metadata creation during disk state transitions.
* Added device accessibility checks to prevent errors when underlying
storage devices are unavailable.
* Fixed LUKS encryption header sizing for consistent deployment across
nodes.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 8, 2026
## What this PR does

Build piraeus-server (linstor-server) from source with custom patches:

- **adjust-on-resfile-change.diff** — Use actual device path in res file
during toggle-disk; fix LUKS data offset
- Upstream: [#473](LINBIT/linstor-server#473),
[#472](LINBIT/linstor-server#472)
- **allow-toggle-disk-retry.diff** — Allow retry and cancellation of
failed toggle-disk operations
  - Upstream: [#475](LINBIT/linstor-server#475)
- **force-metadata-check-on-disk-add.diff** — Create metadata during
toggle-disk from diskless to diskful
  - Upstream: [#474](LINBIT/linstor-server#474)
- **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res
file regeneration when child layer device is inaccessible
  - Upstream: [#471](LINBIT/linstor-server#471)

Also updates plunger-satellite script and values.yaml for the new build.

### Release note

```release-note
[linstor] Build linstor-server with custom patches for improved disk handling
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added automatic DRBD stall detection and recovery, improving storage
resync resilience without manual intervention.
* Introduced configurable container image references via Helm values for
streamlined deployment.

* **Bug Fixes**
* Enhanced disk toggle operations with retry and cancellation support
for better error handling.
  * Improved metadata creation during disk state transitions.
* Added device accessibility checks to prevent errors when underlying
storage devices are unavailable.
* Fixed LUKS encryption header sizing for consistent deployment across
nodes.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 8, 2026
## What this PR does

Build piraeus-server (linstor-server) from source with custom patches:

- **adjust-on-resfile-change.diff** — Use actual device path in res file
during toggle-disk; fix LUKS data offset
- Upstream: [#473](LINBIT/linstor-server#473),
[#472](LINBIT/linstor-server#472)
- **allow-toggle-disk-retry.diff** — Allow retry and cancellation of
failed toggle-disk operations
  - Upstream: [#475](LINBIT/linstor-server#475)
- **force-metadata-check-on-disk-add.diff** — Create metadata during
toggle-disk from diskless to diskful
  - Upstream: [#474](LINBIT/linstor-server#474)
- **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res
file regeneration when child layer device is inaccessible
  - Upstream: [#471](LINBIT/linstor-server#471)

Also updates plunger-satellite script and values.yaml for the new build.

### Release note

```release-note
[linstor] Build linstor-server with custom patches for improved disk handling
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added automatic DRBD stall detection and recovery, improving storage
resync resilience without manual intervention.
* Introduced configurable container image references via Helm values for
streamlined deployment.

* **Bug Fixes**
* Enhanced disk toggle operations with retry and cancellation support
for better error handling.
  * Improved metadata creation during disk state transitions.
* Added device accessibility checks to prevent errors when underlying
storage devices are unavailable.
* Fixed LUKS encryption header sizing for consistent deployment across
nodes.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 9, 2026
## What this PR does

Build piraeus-server (linstor-server) from source with custom patches:

- **adjust-on-resfile-change.diff** — Use actual device path in res file
during toggle-disk; fix LUKS data offset
- Upstream: [#473](LINBIT/linstor-server#473),
[#472](LINBIT/linstor-server#472)
- **allow-toggle-disk-retry.diff** — Allow retry and cancellation of
failed toggle-disk operations
  - Upstream: [#475](LINBIT/linstor-server#475)
- **force-metadata-check-on-disk-add.diff** — Create metadata during
toggle-disk from diskless to diskful
  - Upstream: [#474](LINBIT/linstor-server#474)
- **skip-adjust-when-device-inaccessible.diff** — Skip DRBD adjust/res
file regeneration when child layer device is inaccessible
  - Upstream: [#471](LINBIT/linstor-server#471)

Also updates plunger-satellite script and values.yaml for the new build.

### Release note

```release-note
[linstor] Build linstor-server with custom patches for improved disk handling
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added automatic DRBD stall detection and recovery, improving storage
resync resilience without manual intervention.
* Introduced configurable container image references via Helm values for
streamlined deployment.

* **Bug Fixes**
* Enhanced disk toggle operations with retry and cancellation support
for better error handling.
  * Improved metadata creation during disk state transitions.
* Added device accessibility checks to prevent errors when underlying
storage devices are unavailable.
* Fixed LUKS encryption header sizing for consistent deployment across
nodes.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@ghernadi
Copy link
Copy Markdown
Contributor

Hi there!

For this PR my questions / failed reproducer attempt is basically the same as in my comment in PR 473.

I admit it might look a bit confusing why the code checks for boolean processChildren = !isDiskless || isDiskRemoving; but not || isDiskAdding. The point here is simply that during the "disk adding" state, the resource is simply not marked diskless. That is also the reason why I do not really see how your patch changes anything. Can you elaborate here?

@kvaps
Copy link
Copy Markdown
Author

kvaps commented Jan 14, 2026

Hi Gabor!

I believe there's a misunderstanding. Let me show the issue with code references.

Important context: This is about RETRY scenarios

Both PR #473 and #474 are related to PR #472 (LUKS data offset fix). The problem appears in retry scenarios:

  1. First toggle-disk attempt fails (e.g., due to LUKS offset mismatch - PR fix(luks): use fixed data offset for consistent LUKS header size #472)
  2. Storage layer and LUKS device are already created
  3. User retries toggle-disk
  4. On retry: storage exists, but metadata doesn't exist yet
  5. Without our patches: metadata is never created → "No valid meta data found"

In your test, you're testing the happy path where everything succeeds on the first attempt. That works because the two devMgrRuns happen quickly in sequence. But in retry scenarios, the state is different.

DRBD_DISKLESS flag is NOT removed during toggle-disk

From CtrlRscToggleDiskApiCallHandler.java documentation (lines 99-100):

* <li>DISKLESS, DISK_ADD_REQUESTED - the transition has been requested but not yet started</li>
* <li>DISKLESS, DISK_ADD_REQUESTED, DISK_ADDING - the peers should prepare for resource to gain disks</li>

And markDiskAdded() (lines 1453-1458) which removes DRBD_DISKLESS is called AFTER satellite completes - see line 1068 in finishOperationInTransaction().

So during satellite processing: isDrbdDiskless() == true

The actual problem: checkMetaData is empty

Look at detachVolumesIfNecessary() (DrbdLayer.java:1117):

if (!rsc.isDrbdDiskless(workerCtx) ||          // FALSE during toggle-disk!
    rsc.getStateFlags().isSet(workerCtx, Resource.Flags.DISK_REMOVING)  // FALSE
)
{
    // volumes added to checkMetaData
}
return checkMetaData;  // EMPTY LIST!

During toggle-disk: !true || false = false → block skipped → checkMetaData is empty!

Then at line 703:

for (DrbdVlmData<Resource> drbdVlmData : checkMetaData)  // empty list!
{
    if (!hasMetaData(drbdVlmData) ...) { createMetaData.add(...); }
}

Loop doesn't execute → metadata is never created!

Why your test works (but retry fails)

Your test succeeds because:

  1. First devMgrRun: Storage/LUKS created, checkMetaData empty → no metadata
  2. Controller receives "success", calls markDiskAdded() → removes DRBD_DISKLESS
  3. Second devMgrRun: isDrbdDiskless=false → checkMetaData populated → metadata created
  4. Everything happens quickly → success

But in retry scenario:

  1. First attempt: Storage/LUKS created, then fails (e.g., LUKS offset issue)
  2. DRBD_DISKLESS flag is NOT removed (operation didn't complete)
  3. Retry: Storage already exists, but isDrbdDiskless() still true
  4. checkMetaData is empty → metadata never created → "No valid meta data found"

How to reproduce

  1. Create a diskless DRBD+LUKS resource
  2. Start toggle-disk, but make it fail somehow (e.g., kill satellite mid-operation, or use PR fix(luks): use fixed data offset for consistent LUKS header size #472 issue with different LUKS offsets between nodes)
  3. Retry toggle-disk
  4. Observe "No valid meta data found" error

Our patches ensure metadata is created even during retry when storage already exists but the DRBD_DISKLESS flag is still set.

Best regards

@kvaps
Copy link
Copy Markdown
Author

kvaps commented Apr 3, 2026

Decided not to use these changes for now, closing. Will reopen if the issues arise again.

@kvaps kvaps closed this Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants