Skip to content

feat(agent): add LVM on file-backed loop devices (spec.fileDevices)#302

Open
duckhawk wants to merge 32 commits into
mainfrom
feat/lvm-file-backed-devices
Open

feat(agent): add LVM on file-backed loop devices (spec.fileDevices)#302
duckhawk wants to merge 32 commits into
mainfrom
feat/lvm-file-backed-devices

Conversation

@duckhawk

@duckhawk duckhawk commented Jun 23, 2026

Copy link
Copy Markdown
Member

Description

Add support for creating LVM Volume Groups on top of file-backed loop devices via a new spec.fileDevices field in LVMVolumeGroup CRD. The agent creates preallocated files (fallocate), attaches them as loop devices (losetup), and uses them as LVM Physical Volumes. File devices can be used standalone or combined with blockDeviceSelector in a single VG.

Key changes:

  • Extended LVMVolumeGroup and LVMVolumeGroupSet CRDs with spec.fileDevices and status.nodes[].fileDevices. blockDeviceSelector is now optional; a CEL rule requires at least one of blockDeviceSelector/fileDevices, and existing fileDevices entries are immutable.
  • The LVMVolumeGroupSet controller now propagates spec.lvmVolumeGroupTemplate.fileDevices into the child LVMVolumeGroup resources it creates and updates, so a file-only (or mixed) Set template actually provisions file-backed VGs.
  • Removed loop from the agent's own LVM --config global_filter (LVMGlobalFilter) and from ForeignDeviceBasePrefixes/FilterForeignPVs, so the agent can see and manage its file-backed loop PVs. Ownership is established by the backing-file owner pattern (IsManagedFileDevicePath) gated on the VG's storage.deckhouse.io/lvmVolumeGroupName tag.
  • Added 7 new agent host commands: CreateFileDevice, SetupLoopDevice, DetachLoopDevice, FindLoopDeviceByFile, GetLoopBackingFile, RemoveFileDevice, EnsureFileDeviceDirectory.
  • Implemented file device provisioning in the create flow (with scoped, detached-context rollback), incremental growth in the update flow (extendFileDevicesIfNeeded), and cleanup in the delete flow.
  • Added startup recovery: the agent re-attaches loop devices before VG activation after a node reboot.
  • Extended the discoverer to reflect file-backed PVs in status.nodes[].fileDevices, including resolving alias-reported (/dev/disk/by-id, /dev/block/MAJ:MIN) loop PVs to their canonical /dev/loopN. The update/extend path applies the same alias resolution before deciding whether a managed loop is already a PV of the VG, so it never re-runs pvcreate on an existing PV.
  • Updated NodeGroupConfiguration: the host-wide LVM global_filter still rejects /dev/loop (host lvm/pvs continue to ignore loop devices) and the multipath loop blacklist is retained. Only the agent's own per-command --config global_filter allows loop, so managed file-backed PVs are visible to the agent but not to host-wide LVM.

Thin-pool fixes (apply to block-device and file-backed VGs alike):

  • A thin pool sized 100% is now accepted at create time. During create validation the VG size is the raw (non-extent-aligned) sum of devices; percentage sizes are aligned down for the capacity check so a full-VG thin pool fits, while absolute sizes still round up so a genuinely oversized pool is rejected.
  • A no-op lvextend -l 100%VG (when a 100% thin pool already fills the VG, which it always does up to thin-pool metadata) is treated as success. The benign LVM message is filtered for both the old wording (No size change) and newer LVM 2.03.x wording (New size (N extents) matches existing size (M extents)), so a healthy pool no longer flaps VGConfigurationApplied=False.

Hardening from code review:

  • Unmanaged, purely loop-backed VGs (e.g. nested LVM inside a guest VM's file-backed disk attached via losetup) are dropped from the cache by utils.FilterForeignLoopPVs. Removing loop from the foreign-PV filter otherwise let such a VG into the cache, where a name collision with a managed VG made findDuplicateVGNames take the healthy managed LVMVolumeGroup offline. Managed loop VGs (tagged storage.deckhouse.io/enabled=true) and block-backed VGs are kept.
  • createVGComplex now skips pvcreate for any device that is already an LVM PV (with alias resolution for loop PVs), mirroring the guard in extendFileDevicesIfNeeded, so a retry after a partial create no longer wedges VGConfigurationApplied with "already a PV".
  • The fileDevices[].size docs and the agent's validation message warn that decimal units (1G = 10⁹ bytes) are below the 1Gi minimum, since an existing entry's size is immutable.

Data-integrity fixes (found via manual cluster testing + a from-scratch re-review):

  • Create-rollback corruption. On real clusters a file-backed VG could come up with its single backing file attached to two loop devices (one shown (deleted)), both registered as separate PVs, doubling the VG and placing a thin pool partly on a deleted-file loop. Root cause: CreatePV treated the benign File descriptor … leaked on lvm invocation exit (lvm.static under nsenter) as a failure even though pvcreate had already written the PV label; the create rollback then used the broad cleanupFileDevices and deleted that live PV's backing file, and the next reconcile re-provisioned a second loop. Fixed at two layers: (a) CreatePV now routes stderr through filterStdErr and errors only on a non-benign line (mirroring ExtendLV), removing the false failure; (b) the create/extend paths use a scoped rollbackProvisionedFileDevices that tears down only what the current call provisioned and never a loop that is already an LVM PV (verified against a fresh lvm pvs, not the stale cache; if the listing fails it tears nothing down — a leaked loop/file is recoverable, corrupting a live VG is not).
  • Symlinked directory double-counting. validateLVGForUpdateFunc matched existing file devices by full path, but status.nodes[].fileDevices[].FilePath comes from losetup, which canonicalizes symlink components of the directory. With a symlinked directory the already-provisioned device was counted as new on every reconcile, inflating the VG size used for thin-pool validation. Now matched by basename (stable across the symlink canonicalization).
  • Extend-path pvcreate guard. extendFileDevicesIfNeeded trusted the PV snapshot captured at the top of the reconcile; a loop that became a canonical /dev/loopN PV afterwards was re-pvcreated and failed "already a PV", wedging the condition. It now takes the union of that snapshot and a fresh cache read before deciding (mirroring the create path).

Verified on a fresh cluster (deckhouse pr302 build): the exact scenario that previously doubled the thin-on-file VG now produces a correct single-PV VG (one loop, one PV), with no pvcreate error, no rollback and no second loop in the agent log. All four combinations thin/thick × block/file reach Ready and provision real LVMLogicalVolumes.

Why do we need it, and what problem does it solve?

When dedicated block devices are not available on a node, there is no way to create an LVM Volume Group using the current sds-node-configurator. This feature allows users to allocate part of an existing filesystem for LVM by specifying a directory and size in spec.fileDevices, enabling LVM-based storage provisioning without dedicated disks.

What is the expected result?

  • Creating an LVMVolumeGroup with spec.fileDevices provisions preallocated files, attaches them as loop devices, creates PVs and a VG
  • status.nodes[].fileDevices reflects the file path, canonical loop device, PV size, and PV UUID for each file device
  • After a node reboot, the agent automatically re-attaches loop devices before activating VGs
  • Deleting an LVMVolumeGroup with file devices detaches loop devices and removes backing files
  • fileDevices and blockDeviceSelector can be used together in a single VG; thin-pool sizing accounts for file-backed capacity on both the create and update paths
  • An LVMVolumeGroupSet whose template specifies fileDevices provisions file-backed child LVMVolumeGroup resources on the selected nodes
  • A thin pool sized 100% (e.g. on top of a single file device or block device) reaches VGConfigurationApplied=True/Ready instead of being rejected at validation or flapping on a no-op resize
  • Loop devices that are NOT managed by the agent (backing file does not match the owner pattern within a tagged managed VG) are safely ignored
  • File device size semantics — minimum 1Gi and per-VG backing-file uniqueness — are validated by the agent and surfaced on the VGConfigurationApplied condition (consistent with how the other size fields in these CRDs are validated). The CRD size pattern only constrains the format.

Checklist

  • The code is covered by unit tests.
  • e2e tests passed.
  • Documentation updated according to the changes.
  • Changes were tested in the Kubernetes cluster manually (thin/thick × block/file VGs and LVMLogicalVolumes; a create-rollback corruption found and fixed).

duckhawk added 7 commits June 23, 2026 18:30
Agents: Claude Opus 4.6 (ведущий, авто-режим)
…er marker

provisionFileDevices used to allocate a fresh loop minor on every retry
because `losetup --find --show` does not deduplicate, leaking loop
devices until the system-wide limit. It also left half-attached files
on the disk whenever any later step (CreatePV, vgcreate) failed because
there was no compensating cleanup.

cleanupFileDevices walked status.nodes[].fileDevices only, so a crash
between fallocate and the next discoverer pass would leak the backing
files forever. It also swallowed losetup errors, allowing the finalizer
to be removed while a busy loop device was still alive.

The basename was tied to the slice index (`sds-<lvg>-<i>.img`), so any
reorder/insert/delete in `spec.fileDevices` silently retargeted an
existing entry to a different file.

This commit rewrites both routines on top of two helpers
(BuildFileDevicePath / IsManagedFileDevicePath) that derive the basename
from a SHA-256 prefix of (directory, size). The pattern doubles as the
owner marker: the cleanup refuses to rm any path whose basename does
not match, which protects unrelated host files (libvirt qcow2, snap
loops, …) from being clobbered if a foreign loop PV ever slips into
status.

Concrete changes:

- provisionFileDevices now queries FindLoopDeviceByFile first and reuses
  the existing loop device if there is one; otherwise it tracks every
  file it creates and every loop it attaches, and rolls them back in
  LIFO order when any step in the loop fails.
- cleanupFileDevices walks the union of spec.fileDevices and
  status.nodes[].fileDevices, returns an error when any losetup -d
  fails (so the finalizer stays in place and the reconcile retries),
  and refuses to act on paths that do not match the managed pattern.
- validateFileDevice rejects relative directories and `..` segments —
  defense in depth against runaway fallocate in arbitrary locations.
- reconcileLVGDeleteFunc now propagates the cleanup error, sets
  TypeVGConfigurationApplied=False/Terminating, and keeps the
  finalizer until the next pass succeeds.

Agents: Claude Opus 4.7 (ведущий, авто-режим)
…n on failure

ReattachFileDevices used to be fire-and-forget: a missing backing file
or an EIO on losetup --find --show was logged at Warning level and the
function returned cleanly. ActivateAllManagedVGs then proceeded to
bring up VGs whose PVs were still missing, which left the affected
volume groups in a degraded state with no user-visible signal — the
first symptom was usually a confusing "device busy" or "no such file"
when a PVC tried to mount much later.

The function now collects every per-device failure, joins them with
errors.Join, and returns a non-nil error if anything failed. The
startup runnable in cmd/main.go skips ActivateAllManagedVGs for this
pass when the reattach failed and logs the error so it is visible in
pod logs and pickable by alerting on agent error rate.

Agents: Claude Opus 4.7 (ведущий, авто-режим)
The discoverer used to run a private `nsenter -t 1 -m -- /bin/cat
/sys/block/<loop>/loop/backing_file` per loop PV per discovery cycle.
This duplicated the nsenter argv that already lives in
nsentrerExpendedArgs, bypassed exec.CommandContext (no timeout, the
reconciler would hang on a stuck cat), and ran an external process
from the discoverer's domain layer — outside the Commands interface
that all other host commands flow through and that tests mock.

This commit folds the read into a new Commands method,
GetLoopBackingFile, implemented on top of `losetup --noheadings
--output BACK-FILE <loopDev>`. The discoverer now calls it with a
10s context.WithTimeout, the gomock recorder is regenerated, and
fileDevice discovery is wired through the owner marker added in the
previous commit: any loop PV whose basename does not match the
managed `sds-<lvg>-<hash>.img` pattern (or whose VG carries no
storage.deckhouse.io/lvmVolumeGroupName tag) is skipped with a
warning instead of being written into status.

While here, drop the redundant `Nodes: nil, FileDeviceNodes: nil`
literal that was assigned and then immediately overwritten, and the
now-unused bytes/os/exec imports in discoverer.go.

Agents: Claude Opus 4.7 (ведущий, авто-режим)
CRD validation now rejects in-place edits of fileDevices[].directory
and fileDevices[].size via `x-kubernetes-validations: self == oldSelf`.
docs/FAQ.md already promised "No resize: file device size cannot be
changed after creation", but nothing actually enforced it: increasing
size would fallocate-extend the file while the loop device kept its
captured (old) size, and shrinking would leave a file larger than the
spec entry asked for. With the basename derived from (directory, size)
this would also silently retarget the entry to a different file.

The Go side gets a small bit of housekeeping: LVMVolumeGroupTemplate
in lvm_volume_group_set.go now uses `json:"blockDeviceSelector,omitempty"`,
matching the change applied to LVMVolumeGroupSpec in this PR. Without
it, marshalling a template with no selector would emit
`"blockDeviceSelector":null` while marshalling the spec form would
omit the field entirely.

The struct field alignment of both files is re-`gofmt`-ed for the
columns to line up after the json-tag change.

Agents: Claude Opus 4.7 (ведущий, авто-режим)
…ming

Adds unit tests for the safety properties added in the previous
commits, all driven through gomock against the Commands interface:

- BuildFileDevicePath / IsManagedFileDevicePath: determinism, that
  different (directory, size) tuples produce different basenames, and
  that foreign paths (libvirt qcow2, snap squashfs, a path with a too
  short or non-hex hash) are correctly rejected even when an empty
  lvgName is passed.

- ReattachFileDevices: already-attached files are skipped (no fresh
  losetup --find --show, which would have leaked loops); a per-device
  losetup failure does not abort the loop and the joined error
  surfaces every failing file path; FindLoopDeviceByFile errors are
  propagated.

- provisionFileDevices: reuses an existing loop attachment, and on
  partial failure rolls back both the file and the loop in LIFO
  order so a retry sees a clean slate.

- cleanupFileDevices: walks the union of spec.fileDevices and
  status.nodes[].fileDevices (so files created but never reflected
  in status are still removed), returns an error when losetup -d
  fails (so the finalizer cannot be removed while a busy loop is
  still alive), and refuses to act on paths whose basename does not
  match the managed pattern.

- validateFileDevice: rejects relative directories and `..` segments.

Agents: Claude Opus 4.7 (ведущий, авто-режим)
POSIX text files end with a newline; YAML tools and shell editors
otherwise show "No newline at end of file" on every diff.

Agents: Claude Opus 4.7 (ведущий, авто-режим)
@duckhawk duckhawk force-pushed the feat/lvm-file-backed-devices branch from 7ae49b3 to ec49aec Compare June 23, 2026 15:31
duckhawk and others added 22 commits June 23, 2026 18:38
Agents: Claude Opus 4.7 (ведущий, авто-режим)
Add context and command timeouts for loop/file-device operations,
validate duplicate spec entries and directory writability, restore host
LVM loop global_filter, fix CRD fileDevices list map semantics, and
update tests/mocks.
Signed-off-by: v.oleynikov <vasily.oleynikov@flant.com>
Transition rule перенесена с полей элементов map-list на spec/lvmVolumeGroupTemplate:
oldSelf на item-level внутри fileDevices отклонялся apiserver как uncorrelatable
и превышал CEL cost budget. Добавлены maxItems и maxLength для снижения стоимости правил.

Agents: Composer 2.5 (ведущий, авто-режим)
reattachFileDevicesAtStartup использует time.Duration; без импорта
падает сборка agent-golang-artifact.

Agents: Composer 2.5 (ведущий, авто-режим)
Signed-off-by: v.oleynikov <vasily.oleynikov@flant.com>
Review fixes for spec.fileDevices (PR #302):

- update path: provision and vgextend new fileDevices entries.
  provisionFileDevices was only reachable from createVGComplex, so the
  documented "add a new fileDevices entry to grow the VG" flow passed
  validation but silently did nothing. Add extendFileDevicesIfNeeded to
  reconcileLVGUpdateFunc (idempotent: reuses attached loops, only
  vgextends loops not already PVs of the VG).

- startup reattach no longer suppresses ActivateAllManagedVGs. A single
  file-device reattach failure used to skip activation of every VG on the
  node (including pure block-device ones) for the pod lifetime. It is now
  best-effort; the LVG reconciler retries reattach idempotently via
  provisionFileDevices, so transient boot losetup failures self-heal.

- cleanupFileDevices: for spec-derived targets (no recorded loop device,
  e.g. crash mid-provision) resolve the loop via losetup -j and detach it
  before rm, so the backing file is never removed while a loop is still
  bound to it.

- provisionFileDevices rollback runs on a detached context. The trigger
  is frequently the reconcile ctx being cancelled (SIGTERM/deadline),
  under which exec.CommandContext refuses to start, leaving the
  just-created loop+file dangling.

- FindLoopDeviceByFile parses only the first `losetup -j` line instead of
  splitting the whole multi-line buffer.

- gofmt: fix stray tabs after := in commands.go (RunWithTimeout rename
  artifact) and a misaligned struct in activation.go.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- discoverer: hasStatusNodesDiff now diffs status.nodes[].fileDevices, so a
  loop minor / pvUUID / size change (e.g. a new minor after reboot) actually
  refreshes status instead of being stuck on a stale value.
- discoverer: recognize managed loop PVs that LVM reports under
  /dev/block/MAJ:MIN or /dev/disk/by-id aliases (udev unavailable), not only
  /dev/loopN; the alias probe is quiet so unregistered BlockDevices do not
  spam warnings.
- reconciler/cleanupFileDevices: always re-resolve the loop from the backing
  file before detaching instead of trusting the loopDevice recorded in
  status, which can be stale after a reboot or reused by a foreign device;
  loop-only targets are confirmed managed via the backing file first.
- reconciler/createVGComplex: roll back provisioned file devices (detach
  loops, remove files) when a later pvcreate/vgcreate fails, so a failed
  create no longer leaks loop devices and preallocated files.
- activation: fix ReattachFileDevices doc to match the actual non-fatal
  startup contract (caller continues to ActivateAllManagedVGs; reconciler
  retries idempotently).
- mock_utils: restore the Apache license headers corrupted/stripped by the
  mock regeneration (block_device.go, syscall.go).

Tests updated/added for the stale-loop re-resolution path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A file-only LVMVolumeGroup (only spec.fileDevices) carries no
blockDeviceSelector now that the field is optional. The reconciler panicked
on it:

- validateSpecBlockDevices dereferenced lvg.Spec.BlockDeviceSelector
  (nil pointer panic at Reconcile), and
- GetAPIBlockDevices(nil) maps an empty selector to "select all", so a
  file-only group would also have pulled in EVERY BlockDevice on the node.

Reconcile now skips block-device discovery and validation entirely when
there is no selector and reconciles only the file devices; the create/update
paths already tolerate an empty block-device set. validateSpecBlockDevices
also guards the nil selector before iterating MatchExpressions as defense in
depth.

Adds a regression test for the file-only (nil selector) validation path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Previously the agent required spec.fileDevices[].directory to already exist
and be writable on the node, rejecting the LVMVolumeGroup otherwise (test -d
-a -w). Requiring a manual mkdir on every node is poor UX for a
cluster-managed feature.

Now the agent creates the directory on demand during provisioning:
- replace the CheckFileDeviceDirectory command (test -d -a -w) with
  EnsureFileDeviceDirectory (mkdir -p), called before fallocate;
- validation keeps only structural checks (absolute path, no '..', min size);
  an unusable path (read-only FS, non-directory in the way) now surfaces as a
  provisioning error on the VGConfigurationApplied condition instead of a
  hard validation failure.

Mocks, tests and CRD/FAQ docs updated accordingly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…PVs on extend

Two blocking issues found in review of the file-backed loop devices feature:

1. LVMVolumeGroupSet dropped spec.fileDevices. configureLVGBySet and
   updateLVMVolumeGroupByConfiguredFromSet copied every template field
   except FileDevices, so a file-only Set template (no blockDeviceSelector)
   was admitted by the relaxed Set CEL rule but produced child LVGs with
   neither selector nor fileDevices — rejected by the child LVMVolumeGroup's
   own CEL rule. Copy FileDevices on both the create and update paths.

2. extendFileDevicesIfNeeded re-added an already-attached loop PV. The
   "already a PV" check matched provisionFileDevices' canonical /dev/loopN
   against pvs[].PVName literally, but lvm.static (no udev integration)
   frequently reports a managed loop PV under a /dev/disk/by-id or
   /dev/block/MAJ:MIN alias — the same aliasing the discoverer already
   resolves. The literal miss handed the existing loop to pvcreate/vgextend
   again, failing pvcreate ("already a PV") and wedging the
   VGConfigurationApplied condition on every update reconcile. Resolve
   alias-form PV names via a canonical-path resolver (injectable for tests)
   before deciding a loop is new.

Adds unit tests for both fixes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cal loop

Two correctness fixes for the file-backed loop device feature:

1. Thin-pool sizing on the create path used countVGSizeByBlockDevices, which
   sums only block devices. A file-only VG therefore looked zero-sized:
   percentage thin pools collapsed to 0 and absolute-sized thin pools fell
   into the full-VG-space branch, taking the whole VG instead of the
   requested size. Add a countVGSizeByFileDevices helper and include
   spec.fileDevices capacity, mirroring validateLVGForCreateFunc.

2. The discoverer's alias-probe path recorded the alias (/dev/disk/by-id or
   /dev/block/MAJ:MIN) as status.loopDevice. On the next reconcile where lvm
   reported the same PV as /dev/loopN, hasStatusNodesDiff saw a change and
   rewrote status, flip-flopping forever. Re-resolve the canonical /dev/loopN
   from the backing file before writing status.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review findings on the spec.fileDevices feature:

- extendFileDevicesIfNeeded now rolls back the loop device + backing file
  it just provisioned if a later pvcreate/vgextend fails, instead of
  leaking them on the node (createVGComplex already had this). The
  rollback is scoped to only the devices this call created, so existing
  healthy file devices of the same VG are never touched.
  provisionFileDevices returns the newly-created artifacts to support it.

- reattachFileDevicesAtStartup also derives backing-file paths from
  spec.fileDevices (deterministic BuildFileDevicePath), not just status.
  A node that rebooted before the discoverer wrote status.nodes[].fileDevices
  can now recover its loop mappings before ActivateVGs. ReattachFileDevices
  never creates files, so a not-yet-provisioned entry is a harmless miss.

- Correct stale docs/log that still claimed loop devices are filtered
  out: removing loop from LVMGlobalFilter/ForeignDeviceBasePrefixes is
  intentional, and ownership is enforced downstream via the backing-file
  owner pattern + VG tag. Fixed the scanner drop log, FilterForeignPVs /
  IsForeignDeviceBase docstrings.

- convertLVMVGNodes: fast path (single pass, no node-name set) for the
  common no-file-devices case on the discover hot path.

- Document that status.fileDevices[].size is the PV size (LVM metadata
  overhead), mirroring status.nodes[].devices[].pvSize; align the CRD
  description and add cross-references between the two loop-PV alias
  resolution helpers (reconciler readlink vs discoverer losetup).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…den loop alias resolution

Two correctness fixes in the file-backed loop device flow:

- validateLVGForUpdateFunc ignored the capacity of newly-added
  spec.fileDevices entries when validating thin-pool sizes. The create
  path was already fixed via countVGSizeByFileDevices, but the update
  path computed newTotalVGSize as vg.VGSize + additionBlockDeviceSpace
  only. A combined "append a fileDevices entry + grow thin-pools" edit
  was therefore validated against the pre-extend VG size and wrongly
  rejected (or forced into the full-VG-space branch). Add
  additionFileDeviceSpace, mirroring additionBlockDeviceSpace, counting
  only spec entries not yet reflected in status.nodes[].fileDevices so
  already-provisioned devices are not double-counted.

- loopAlreadyRegisteredAsPV treated a resolver failure on an alias PV as
  "not a match", contradicting its own comment. That handed a possibly
  already-registered loop to pvcreate, which fails ("already a PV") and
  wedges the VGConfigurationApplied condition until the next reconcile.
  Be conservative: on resolver failure, report the loop as already
  registered so the extend is skipped this round and retried.

Add regression tests for both paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…vices

The three file-device rollback sites (provisionFileDevices,
extendFileDevicesIfNeeded, createVGComplex) each open-coded the same
detached-context-with-deadline setup. Extract it into
Reconciler.newRollbackContext so the "cleanup must survive a cancelled
reconcile ctx" invariant lives in one place and the blocks cannot drift.

No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review findings on the spec.fileDevices feature:

- SetupLoopDevice: pass `losetup --nooverlap` so a race between startup
  reattach and the reconciler (or an alias-only existing attachment) reuses
  the existing loop instead of binding a second minor that nothing reaps.
- GetLoopBackingFile: strip the " (deleted)" marker losetup appends when the
  backing file was unlinked while still attached. Without it
  IsManagedFileDevicePath misses the basename and cleanup refuses to detach
  the loop, stranding the minor forever.
- hasStatusNodesDiff: match file devices by FilePath instead of slice
  position. The candidate slice is built in raw `lvm pvs` report order
  (unsorted within a VG) and a loop PV can be reported under /dev/loopN or an
  alias across scans, so a positional compare churned status every reconcile.
- extendFileDevicesIfNeeded: when every new loop is skipped only because an
  alias PV could not be resolved, surface a retrying condition and return an
  error so the reconcile requeues, instead of silently reporting the
  configuration applied while the file device never joined the VG.

Add unit tests for each fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…size

A thin pool sized 100% can never reach the full VG size because LVM
reserves space for thin-pool metadata, so the reconciler perpetually sees
actualSize < requestedSize and re-runs `lvextend -l 100%VG`. That command
exits non-zero on a no-op resize; filterStdErr is meant to swallow the
benign message, but it only matched the older LVM wording "No size change".

Newer LVM (2.03.x, e.g. Ubuntu) instead prints
"New size (N extents) matches existing size (M extents).", which slipped
through the filter and surfaced as ThinPoolReconcileFailed /
VGConfigurationApplied=False on a perfectly healthy pool. Add a regex for
the new wording. Affects any 100%-sized thin pool, not only file-backed VGs.

Adds a filterStdErr unit test covering both wordings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A thin pool sized "100%" was rejected at create with "Required space for
thin-pools N is more than VG size M". The percentage is computed from the
raw block-device/file sum, which is not extent-aligned, and the validator
rounded it UP to the next extent — landing one extent past the VG size and
failing the capacity check, even though the pool is created with %FREE and
fits the actual (extent-aligned) VG.

Round percentage sizes DOWN for the capacity check instead (a percentage of
the VG can never legitimately exceed it; this also makes split percentages
like "50%"+"50%" sum to at most the VG). Absolute sizes keep rounding up so
a genuinely oversized absolute pool is still rejected.

Adds tests for a single "100%" pool (passes) and "100%" + another pool
(fails).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address issues found while reviewing the spec.fileDevices feature:

- Drop unmanaged, purely loop-backed VGs from the cache
  (utils.FilterForeignLoopPVs). Removing loop from the foreign-PV
  filter let a guest VM's nested-LVM loop VG into the cache, where a
  name collision with a managed VG made findDuplicateVGNames take the
  healthy managed LVMVolumeGroup offline. Managed loop VGs (tagged
  storage.deckhouse.io/enabled=true) and block-backed VGs are kept.

- Skip pvcreate for devices that are already an LVM PV in
  createVGComplex (with alias resolution for loop PVs), mirroring the
  guard in extendFileDevicesIfNeeded. Prevents a retry after a partial
  create from wedging VGConfigurationApplied with "already a PV".

- Warn about decimal size units in the fileDevices size docs and the
  agent validation message: "1G" (10^9) is below the 1Gi minimum and,
  being immutable, can otherwise only be fixed by recreating the LVG.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cluster testing of file-backed VG creation surfaced a data-integrity bug:
under concurrent LVG creation (e.g. several LVGs applied at once), a
file-backed VG could come up with its single backing file attached to two
loop devices — one of them shown "(deleted)" — both registered as separate
PVs, doubling the VG and placing a thin pool partly on a deleted-file loop.
Reproduced on two fresh clusters.

Root cause: createVGComplex rolled back a failed create via the broad
cleanupFileDevices (the delete-path cleanup), which walks spec+status and
removes every backing file of the LVG. When a concurrent reconcile — or a
pvcreate/vgcreate that materially succeeded but returned a non-zero status
("File descriptor leaked on lvm invocation", exit 5) — had already turned
the loop into a live PV of the VG, the rollback deleted that PV's backing
file. The next reconcile then re-provisioned a second loop and vgextended
it, doubling the VG.

Fix: replace the broad rollback on the create and extend paths with a
scoped rollbackProvisionedFileDevices that tears down ONLY the file devices
the current call provisioned, and:
  - lists PVs authoritatively (fresh `lvm pvs`, not the stale cache) and
    SKIPS any provisioned loop that is already an LVM PV (canonical or via
    a /dev/disk//dev/block alias), so an in-use loop is never torn down;
  - leaves everything in place if the PV listing fails (a leaked loop/file
    is recoverable and reused on the next reconcile; corrupting a live VG
    is not);
  - never removes a backing file whose loop it could not detach.

With this, a create whose pvcreate spuriously reports failure keeps the
loop+file, and the retry reuses the same loop instead of allocating a
second one.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three issues found in a from-scratch re-review:

1. validateLVGForUpdateFunc counted an already-provisioned file device as
   new whenever its directory is a symlink. status FilePath comes from
   `losetup --output BACK-FILE`, which canonicalizes symlink components,
   so it never matched the literal BuildFileDevicePath. Match by basename
   (stable: the hash is derived from the literal spec directory and losetup
   leaves the basename untouched) instead of full path, so thin-pool sizing
   validation no longer inflates the VG size and wrongly accepts a pool that
   then fails to create.

2. extendFileDevicesIfNeeded trusted the PV snapshot captured at the top of
   reconcileLVGUpdateFunc; a loop that became a canonical /dev/loopN PV after
   that snapshot (and is not an alias-form PV, the only form
   loopAlreadyRegisteredAsPV inspects) was handed to pvcreate again and failed
   "already a PV", wedging the condition. Take the union of the snapshot and a
   fresh cache read before deciding, mirroring createVGComplex.

3. CreatePV treated the benign "File descriptor N leaked on lvm invocation"
   exit (lvm.static under nsenter) as a hard failure, tripping the
   create/extend rollback against a device that pvcreate had in fact already
   turned into a healthy PV — the real trigger behind the doubled-VG corruption
   the rollback safety net contains. Route stderr through filterStdErr and
   error only when a non-benign line remains, mirroring ExtendLV.

Added regression tests for 1 and 2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses two review findings on the file-backed LVM feature:

- Refuse to allocate a backing file larger than the free space of the
  node's filesystem. provisionFileDevices now queries available bytes
  (stat -f in PID 1's mount namespace, via the new GetAvailableBytes
  command) right after mkdir and before fallocate, and reports a clear
  condition instead of letting `fallocate -l` fill the node's root
  filesystem and trip kubelet DiskPressure eviction. The check uses the
  non-superuser available-block count (%a), so the filesystem reserve is
  left as built-in headroom; it is best-effort and falls through to
  fallocate (which still fails cleanly on ENOSPC) when free space cannot
  be determined.

- Attach loop devices with `losetup --direct-io=on` so backing-file I/O
  bypasses the node filesystem's page cache, avoiding the double page
  cache (volume FS + backing file) that doubles RAM use and throttles
  throughput. The kernel silently falls back to buffered I/O on
  misalignment, so requesting it is always safe.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
duckhawk and others added 3 commits June 30, 2026 11:52
… dir

spec.fileDevices[].directory previously accepted any absolute host path, so
a typo could place a backing file under /, /etc or /var/lib/kubelet and fill
the node's filesystem. The agent now confines backing files to a base
directory, defaulting to /opt/deckhouse/sds/file-devices and overridable via
the new fileDevicesDirectory module setting (e.g. to point at a dedicated
data disk). Each directory must be that base or a subdirectory of it;
anything outside the subtree is rejected on the VGConfigurationApplied
condition.

- openapi: new config-values fileDevicesDirectory (+ru doc)
- daemonset: plumb it to the agent as FILE_DEVICES_DIRECTORY
- agent config: read the env with the /opt/deckhouse/sds/file-devices default
- lvg reconciler: validateFileDevice enforces the base-dir allowlist
- docs/CRD: document the constraint and update examples

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…laim

#6: extendFileDevicesIfNeeded previously requeued forever under the generic
"Updating" reason whenever alias-form PV names could not be resolved, so a
persistently broken resolver (missing nsenter binary, a dangling alias) was
indistinguishable from an ordinary in-flight update and could never be
alerted on. The reconciler now tracks consecutive resolver-only no-progress
rounds per LVG and, after a threshold, switches the VGConfigurationApplied
condition to a dedicated ReasonAliasResolutionFailed and logs at Error level.
The streak resets as soon as a round makes progress, has nothing to do, or
the LVG is deleted.

#4: document space reclamation for file-backed devices in the FAQ — the
discard/TRIM passdown chain (fstrim or mount -o discard punches holes in the
preallocated backing file via the default discards=passdown thin pool), plus
its caveats (chunk-size granularity, snapshots pinning chunks, shrink only
after write+delete+trim).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a "LVMVolumeGroup with file-backed devices" Ginkgo Context (label
e2e-tests) plus helpers. Unlike the other scenarios these need no dedicated
disk or nested VM: the VG is created with spec.fileDevices on any node that
runs the agent, and node-side state is checked over SSH. Backing files go in
the default base dir /opt/deckhouse/sds/file-devices.

Scenarios:
- create a file-only VG → Ready; assert status.fileDevices (filePath under
  the base dir, loopDevice, pvUUID) and that the backing file, loop device
  and VG exist on the node; on delete, assert the file is removed and the
  loop detached;
- thin-pool on a file-backed device → pool Ready + data LV present on node;
- extend a file-backed VG by appending a fileDevices entry → VG grows;
- reattach after the agent pod restarts → VG stays Ready, loop reattached;
- negative: a directory outside the allowed base dir is rejected with
  VGConfigurationApplied=False / ValidationFailed and nothing is provisioned.

Documents the scenarios and their prerequisites (branch build, node free
space) in e2e/README.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant