feat(agent): add LVM on file-backed loop devices (spec.fileDevices)#302
Open
duckhawk wants to merge 32 commits into
Open
feat(agent): add LVM on file-backed loop devices (spec.fileDevices)#302duckhawk wants to merge 32 commits into
duckhawk wants to merge 32 commits into
Conversation
Agents: Claude Opus 4.6 (ведущий, авто-режим)
…er marker provisionFileDevices used to allocate a fresh loop minor on every retry because `losetup --find --show` does not deduplicate, leaking loop devices until the system-wide limit. It also left half-attached files on the disk whenever any later step (CreatePV, vgcreate) failed because there was no compensating cleanup. cleanupFileDevices walked status.nodes[].fileDevices only, so a crash between fallocate and the next discoverer pass would leak the backing files forever. It also swallowed losetup errors, allowing the finalizer to be removed while a busy loop device was still alive. The basename was tied to the slice index (`sds-<lvg>-<i>.img`), so any reorder/insert/delete in `spec.fileDevices` silently retargeted an existing entry to a different file. This commit rewrites both routines on top of two helpers (BuildFileDevicePath / IsManagedFileDevicePath) that derive the basename from a SHA-256 prefix of (directory, size). The pattern doubles as the owner marker: the cleanup refuses to rm any path whose basename does not match, which protects unrelated host files (libvirt qcow2, snap loops, …) from being clobbered if a foreign loop PV ever slips into status. Concrete changes: - provisionFileDevices now queries FindLoopDeviceByFile first and reuses the existing loop device if there is one; otherwise it tracks every file it creates and every loop it attaches, and rolls them back in LIFO order when any step in the loop fails. - cleanupFileDevices walks the union of spec.fileDevices and status.nodes[].fileDevices, returns an error when any losetup -d fails (so the finalizer stays in place and the reconcile retries), and refuses to act on paths that do not match the managed pattern. - validateFileDevice rejects relative directories and `..` segments — defense in depth against runaway fallocate in arbitrary locations. - reconcileLVGDeleteFunc now propagates the cleanup error, sets TypeVGConfigurationApplied=False/Terminating, and keeps the finalizer until the next pass succeeds. Agents: Claude Opus 4.7 (ведущий, авто-режим)
…n on failure ReattachFileDevices used to be fire-and-forget: a missing backing file or an EIO on losetup --find --show was logged at Warning level and the function returned cleanly. ActivateAllManagedVGs then proceeded to bring up VGs whose PVs were still missing, which left the affected volume groups in a degraded state with no user-visible signal — the first symptom was usually a confusing "device busy" or "no such file" when a PVC tried to mount much later. The function now collects every per-device failure, joins them with errors.Join, and returns a non-nil error if anything failed. The startup runnable in cmd/main.go skips ActivateAllManagedVGs for this pass when the reattach failed and logs the error so it is visible in pod logs and pickable by alerting on agent error rate. Agents: Claude Opus 4.7 (ведущий, авто-режим)
The discoverer used to run a private `nsenter -t 1 -m -- /bin/cat /sys/block/<loop>/loop/backing_file` per loop PV per discovery cycle. This duplicated the nsenter argv that already lives in nsentrerExpendedArgs, bypassed exec.CommandContext (no timeout, the reconciler would hang on a stuck cat), and ran an external process from the discoverer's domain layer — outside the Commands interface that all other host commands flow through and that tests mock. This commit folds the read into a new Commands method, GetLoopBackingFile, implemented on top of `losetup --noheadings --output BACK-FILE <loopDev>`. The discoverer now calls it with a 10s context.WithTimeout, the gomock recorder is regenerated, and fileDevice discovery is wired through the owner marker added in the previous commit: any loop PV whose basename does not match the managed `sds-<lvg>-<hash>.img` pattern (or whose VG carries no storage.deckhouse.io/lvmVolumeGroupName tag) is skipped with a warning instead of being written into status. While here, drop the redundant `Nodes: nil, FileDeviceNodes: nil` literal that was assigned and then immediately overwritten, and the now-unused bytes/os/exec imports in discoverer.go. Agents: Claude Opus 4.7 (ведущий, авто-режим)
CRD validation now rejects in-place edits of fileDevices[].directory and fileDevices[].size via `x-kubernetes-validations: self == oldSelf`. docs/FAQ.md already promised "No resize: file device size cannot be changed after creation", but nothing actually enforced it: increasing size would fallocate-extend the file while the loop device kept its captured (old) size, and shrinking would leave a file larger than the spec entry asked for. With the basename derived from (directory, size) this would also silently retarget the entry to a different file. The Go side gets a small bit of housekeeping: LVMVolumeGroupTemplate in lvm_volume_group_set.go now uses `json:"blockDeviceSelector,omitempty"`, matching the change applied to LVMVolumeGroupSpec in this PR. Without it, marshalling a template with no selector would emit `"blockDeviceSelector":null` while marshalling the spec form would omit the field entirely. The struct field alignment of both files is re-`gofmt`-ed for the columns to line up after the json-tag change. Agents: Claude Opus 4.7 (ведущий, авто-режим)
…ming Adds unit tests for the safety properties added in the previous commits, all driven through gomock against the Commands interface: - BuildFileDevicePath / IsManagedFileDevicePath: determinism, that different (directory, size) tuples produce different basenames, and that foreign paths (libvirt qcow2, snap squashfs, a path with a too short or non-hex hash) are correctly rejected even when an empty lvgName is passed. - ReattachFileDevices: already-attached files are skipped (no fresh losetup --find --show, which would have leaked loops); a per-device losetup failure does not abort the loop and the joined error surfaces every failing file path; FindLoopDeviceByFile errors are propagated. - provisionFileDevices: reuses an existing loop attachment, and on partial failure rolls back both the file and the loop in LIFO order so a retry sees a clean slate. - cleanupFileDevices: walks the union of spec.fileDevices and status.nodes[].fileDevices (so files created but never reflected in status are still removed), returns an error when losetup -d fails (so the finalizer cannot be removed while a busy loop is still alive), and refuses to act on paths whose basename does not match the managed pattern. - validateFileDevice: rejects relative directories and `..` segments. Agents: Claude Opus 4.7 (ведущий, авто-режим)
POSIX text files end with a newline; YAML tools and shell editors otherwise show "No newline at end of file" on every diff. Agents: Claude Opus 4.7 (ведущий, авто-режим)
7ae49b3 to
ec49aec
Compare
Agents: Claude Opus 4.7 (ведущий, авто-режим)
Add context and command timeouts for loop/file-device operations, validate duplicate spec entries and directory writability, restore host LVM loop global_filter, fix CRD fileDevices list map semantics, and update tests/mocks.
Signed-off-by: v.oleynikov <vasily.oleynikov@flant.com>
Transition rule перенесена с полей элементов map-list на spec/lvmVolumeGroupTemplate: oldSelf на item-level внутри fileDevices отклонялся apiserver как uncorrelatable и превышал CEL cost budget. Добавлены maxItems и maxLength для снижения стоимости правил. Agents: Composer 2.5 (ведущий, авто-режим)
reattachFileDevicesAtStartup использует time.Duration; без импорта падает сборка agent-golang-artifact. Agents: Composer 2.5 (ведущий, авто-режим)
Signed-off-by: v.oleynikov <vasily.oleynikov@flant.com>
Review fixes for spec.fileDevices (PR #302): - update path: provision and vgextend new fileDevices entries. provisionFileDevices was only reachable from createVGComplex, so the documented "add a new fileDevices entry to grow the VG" flow passed validation but silently did nothing. Add extendFileDevicesIfNeeded to reconcileLVGUpdateFunc (idempotent: reuses attached loops, only vgextends loops not already PVs of the VG). - startup reattach no longer suppresses ActivateAllManagedVGs. A single file-device reattach failure used to skip activation of every VG on the node (including pure block-device ones) for the pod lifetime. It is now best-effort; the LVG reconciler retries reattach idempotently via provisionFileDevices, so transient boot losetup failures self-heal. - cleanupFileDevices: for spec-derived targets (no recorded loop device, e.g. crash mid-provision) resolve the loop via losetup -j and detach it before rm, so the backing file is never removed while a loop is still bound to it. - provisionFileDevices rollback runs on a detached context. The trigger is frequently the reconcile ctx being cancelled (SIGTERM/deadline), under which exec.CommandContext refuses to start, leaving the just-created loop+file dangling. - FindLoopDeviceByFile parses only the first `losetup -j` line instead of splitting the whole multi-line buffer. - gofmt: fix stray tabs after := in commands.go (RunWithTimeout rename artifact) and a misaligned struct in activation.go. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- discoverer: hasStatusNodesDiff now diffs status.nodes[].fileDevices, so a loop minor / pvUUID / size change (e.g. a new minor after reboot) actually refreshes status instead of being stuck on a stale value. - discoverer: recognize managed loop PVs that LVM reports under /dev/block/MAJ:MIN or /dev/disk/by-id aliases (udev unavailable), not only /dev/loopN; the alias probe is quiet so unregistered BlockDevices do not spam warnings. - reconciler/cleanupFileDevices: always re-resolve the loop from the backing file before detaching instead of trusting the loopDevice recorded in status, which can be stale after a reboot or reused by a foreign device; loop-only targets are confirmed managed via the backing file first. - reconciler/createVGComplex: roll back provisioned file devices (detach loops, remove files) when a later pvcreate/vgcreate fails, so a failed create no longer leaks loop devices and preallocated files. - activation: fix ReattachFileDevices doc to match the actual non-fatal startup contract (caller continues to ActivateAllManagedVGs; reconciler retries idempotently). - mock_utils: restore the Apache license headers corrupted/stripped by the mock regeneration (block_device.go, syscall.go). Tests updated/added for the stale-loop re-resolution path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A file-only LVMVolumeGroup (only spec.fileDevices) carries no blockDeviceSelector now that the field is optional. The reconciler panicked on it: - validateSpecBlockDevices dereferenced lvg.Spec.BlockDeviceSelector (nil pointer panic at Reconcile), and - GetAPIBlockDevices(nil) maps an empty selector to "select all", so a file-only group would also have pulled in EVERY BlockDevice on the node. Reconcile now skips block-device discovery and validation entirely when there is no selector and reconciles only the file devices; the create/update paths already tolerate an empty block-device set. validateSpecBlockDevices also guards the nil selector before iterating MatchExpressions as defense in depth. Adds a regression test for the file-only (nil selector) validation path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Previously the agent required spec.fileDevices[].directory to already exist and be writable on the node, rejecting the LVMVolumeGroup otherwise (test -d -a -w). Requiring a manual mkdir on every node is poor UX for a cluster-managed feature. Now the agent creates the directory on demand during provisioning: - replace the CheckFileDeviceDirectory command (test -d -a -w) with EnsureFileDeviceDirectory (mkdir -p), called before fallocate; - validation keeps only structural checks (absolute path, no '..', min size); an unusable path (read-only FS, non-directory in the way) now surfaces as a provisioning error on the VGConfigurationApplied condition instead of a hard validation failure. Mocks, tests and CRD/FAQ docs updated accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…PVs on extend
Two blocking issues found in review of the file-backed loop devices feature:
1. LVMVolumeGroupSet dropped spec.fileDevices. configureLVGBySet and
updateLVMVolumeGroupByConfiguredFromSet copied every template field
except FileDevices, so a file-only Set template (no blockDeviceSelector)
was admitted by the relaxed Set CEL rule but produced child LVGs with
neither selector nor fileDevices — rejected by the child LVMVolumeGroup's
own CEL rule. Copy FileDevices on both the create and update paths.
2. extendFileDevicesIfNeeded re-added an already-attached loop PV. The
"already a PV" check matched provisionFileDevices' canonical /dev/loopN
against pvs[].PVName literally, but lvm.static (no udev integration)
frequently reports a managed loop PV under a /dev/disk/by-id or
/dev/block/MAJ:MIN alias — the same aliasing the discoverer already
resolves. The literal miss handed the existing loop to pvcreate/vgextend
again, failing pvcreate ("already a PV") and wedging the
VGConfigurationApplied condition on every update reconcile. Resolve
alias-form PV names via a canonical-path resolver (injectable for tests)
before deciding a loop is new.
Adds unit tests for both fixes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cal loop Two correctness fixes for the file-backed loop device feature: 1. Thin-pool sizing on the create path used countVGSizeByBlockDevices, which sums only block devices. A file-only VG therefore looked zero-sized: percentage thin pools collapsed to 0 and absolute-sized thin pools fell into the full-VG-space branch, taking the whole VG instead of the requested size. Add a countVGSizeByFileDevices helper and include spec.fileDevices capacity, mirroring validateLVGForCreateFunc. 2. The discoverer's alias-probe path recorded the alias (/dev/disk/by-id or /dev/block/MAJ:MIN) as status.loopDevice. On the next reconcile where lvm reported the same PV as /dev/loopN, hasStatusNodesDiff saw a change and rewrote status, flip-flopping forever. Re-resolve the canonical /dev/loopN from the backing file before writing status. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review findings on the spec.fileDevices feature: - extendFileDevicesIfNeeded now rolls back the loop device + backing file it just provisioned if a later pvcreate/vgextend fails, instead of leaking them on the node (createVGComplex already had this). The rollback is scoped to only the devices this call created, so existing healthy file devices of the same VG are never touched. provisionFileDevices returns the newly-created artifacts to support it. - reattachFileDevicesAtStartup also derives backing-file paths from spec.fileDevices (deterministic BuildFileDevicePath), not just status. A node that rebooted before the discoverer wrote status.nodes[].fileDevices can now recover its loop mappings before ActivateVGs. ReattachFileDevices never creates files, so a not-yet-provisioned entry is a harmless miss. - Correct stale docs/log that still claimed loop devices are filtered out: removing loop from LVMGlobalFilter/ForeignDeviceBasePrefixes is intentional, and ownership is enforced downstream via the backing-file owner pattern + VG tag. Fixed the scanner drop log, FilterForeignPVs / IsForeignDeviceBase docstrings. - convertLVMVGNodes: fast path (single pass, no node-name set) for the common no-file-devices case on the discover hot path. - Document that status.fileDevices[].size is the PV size (LVM metadata overhead), mirroring status.nodes[].devices[].pvSize; align the CRD description and add cross-references between the two loop-PV alias resolution helpers (reconciler readlink vs discoverer losetup). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…den loop alias resolution
Two correctness fixes in the file-backed loop device flow:
- validateLVGForUpdateFunc ignored the capacity of newly-added
spec.fileDevices entries when validating thin-pool sizes. The create
path was already fixed via countVGSizeByFileDevices, but the update
path computed newTotalVGSize as vg.VGSize + additionBlockDeviceSpace
only. A combined "append a fileDevices entry + grow thin-pools" edit
was therefore validated against the pre-extend VG size and wrongly
rejected (or forced into the full-VG-space branch). Add
additionFileDeviceSpace, mirroring additionBlockDeviceSpace, counting
only spec entries not yet reflected in status.nodes[].fileDevices so
already-provisioned devices are not double-counted.
- loopAlreadyRegisteredAsPV treated a resolver failure on an alias PV as
"not a match", contradicting its own comment. That handed a possibly
already-registered loop to pvcreate, which fails ("already a PV") and
wedges the VGConfigurationApplied condition until the next reconcile.
Be conservative: on resolver failure, report the loop as already
registered so the extend is skipped this round and retried.
Add regression tests for both paths.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…vices The three file-device rollback sites (provisionFileDevices, extendFileDevicesIfNeeded, createVGComplex) each open-coded the same detached-context-with-deadline setup. Extract it into Reconciler.newRollbackContext so the "cleanup must survive a cancelled reconcile ctx" invariant lives in one place and the blocks cannot drift. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review findings on the spec.fileDevices feature: - SetupLoopDevice: pass `losetup --nooverlap` so a race between startup reattach and the reconciler (or an alias-only existing attachment) reuses the existing loop instead of binding a second minor that nothing reaps. - GetLoopBackingFile: strip the " (deleted)" marker losetup appends when the backing file was unlinked while still attached. Without it IsManagedFileDevicePath misses the basename and cleanup refuses to detach the loop, stranding the minor forever. - hasStatusNodesDiff: match file devices by FilePath instead of slice position. The candidate slice is built in raw `lvm pvs` report order (unsorted within a VG) and a loop PV can be reported under /dev/loopN or an alias across scans, so a positional compare churned status every reconcile. - extendFileDevicesIfNeeded: when every new loop is skipped only because an alias PV could not be resolved, surface a retrying condition and return an error so the reconcile requeues, instead of silently reporting the configuration applied while the file device never joined the VG. Add unit tests for each fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…size A thin pool sized 100% can never reach the full VG size because LVM reserves space for thin-pool metadata, so the reconciler perpetually sees actualSize < requestedSize and re-runs `lvextend -l 100%VG`. That command exits non-zero on a no-op resize; filterStdErr is meant to swallow the benign message, but it only matched the older LVM wording "No size change". Newer LVM (2.03.x, e.g. Ubuntu) instead prints "New size (N extents) matches existing size (M extents).", which slipped through the filter and surfaced as ThinPoolReconcileFailed / VGConfigurationApplied=False on a perfectly healthy pool. Add a regex for the new wording. Affects any 100%-sized thin pool, not only file-backed VGs. Adds a filterStdErr unit test covering both wordings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A thin pool sized "100%" was rejected at create with "Required space for thin-pools N is more than VG size M". The percentage is computed from the raw block-device/file sum, which is not extent-aligned, and the validator rounded it UP to the next extent — landing one extent past the VG size and failing the capacity check, even though the pool is created with %FREE and fits the actual (extent-aligned) VG. Round percentage sizes DOWN for the capacity check instead (a percentage of the VG can never legitimately exceed it; this also makes split percentages like "50%"+"50%" sum to at most the VG). Absolute sizes keep rounding up so a genuinely oversized absolute pool is still rejected. Adds tests for a single "100%" pool (passes) and "100%" + another pool (fails). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address issues found while reviewing the spec.fileDevices feature: - Drop unmanaged, purely loop-backed VGs from the cache (utils.FilterForeignLoopPVs). Removing loop from the foreign-PV filter let a guest VM's nested-LVM loop VG into the cache, where a name collision with a managed VG made findDuplicateVGNames take the healthy managed LVMVolumeGroup offline. Managed loop VGs (tagged storage.deckhouse.io/enabled=true) and block-backed VGs are kept. - Skip pvcreate for devices that are already an LVM PV in createVGComplex (with alias resolution for loop PVs), mirroring the guard in extendFileDevicesIfNeeded. Prevents a retry after a partial create from wedging VGConfigurationApplied with "already a PV". - Warn about decimal size units in the fileDevices size docs and the agent validation message: "1G" (10^9) is below the 1Gi minimum and, being immutable, can otherwise only be fixed by recreating the LVG. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cluster testing of file-backed VG creation surfaced a data-integrity bug:
under concurrent LVG creation (e.g. several LVGs applied at once), a
file-backed VG could come up with its single backing file attached to two
loop devices — one of them shown "(deleted)" — both registered as separate
PVs, doubling the VG and placing a thin pool partly on a deleted-file loop.
Reproduced on two fresh clusters.
Root cause: createVGComplex rolled back a failed create via the broad
cleanupFileDevices (the delete-path cleanup), which walks spec+status and
removes every backing file of the LVG. When a concurrent reconcile — or a
pvcreate/vgcreate that materially succeeded but returned a non-zero status
("File descriptor leaked on lvm invocation", exit 5) — had already turned
the loop into a live PV of the VG, the rollback deleted that PV's backing
file. The next reconcile then re-provisioned a second loop and vgextended
it, doubling the VG.
Fix: replace the broad rollback on the create and extend paths with a
scoped rollbackProvisionedFileDevices that tears down ONLY the file devices
the current call provisioned, and:
- lists PVs authoritatively (fresh `lvm pvs`, not the stale cache) and
SKIPS any provisioned loop that is already an LVM PV (canonical or via
a /dev/disk//dev/block alias), so an in-use loop is never torn down;
- leaves everything in place if the PV listing fails (a leaked loop/file
is recoverable and reused on the next reconcile; corrupting a live VG
is not);
- never removes a backing file whose loop it could not detach.
With this, a create whose pvcreate spuriously reports failure keeps the
loop+file, and the retry reuses the same loop instead of allocating a
second one.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three issues found in a from-scratch re-review: 1. validateLVGForUpdateFunc counted an already-provisioned file device as new whenever its directory is a symlink. status FilePath comes from `losetup --output BACK-FILE`, which canonicalizes symlink components, so it never matched the literal BuildFileDevicePath. Match by basename (stable: the hash is derived from the literal spec directory and losetup leaves the basename untouched) instead of full path, so thin-pool sizing validation no longer inflates the VG size and wrongly accepts a pool that then fails to create. 2. extendFileDevicesIfNeeded trusted the PV snapshot captured at the top of reconcileLVGUpdateFunc; a loop that became a canonical /dev/loopN PV after that snapshot (and is not an alias-form PV, the only form loopAlreadyRegisteredAsPV inspects) was handed to pvcreate again and failed "already a PV", wedging the condition. Take the union of the snapshot and a fresh cache read before deciding, mirroring createVGComplex. 3. CreatePV treated the benign "File descriptor N leaked on lvm invocation" exit (lvm.static under nsenter) as a hard failure, tripping the create/extend rollback against a device that pvcreate had in fact already turned into a healthy PV — the real trigger behind the doubled-VG corruption the rollback safety net contains. Route stderr through filterStdErr and error only when a non-benign line remains, mirroring ExtendLV. Added regression tests for 1 and 2. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses two review findings on the file-backed LVM feature: - Refuse to allocate a backing file larger than the free space of the node's filesystem. provisionFileDevices now queries available bytes (stat -f in PID 1's mount namespace, via the new GetAvailableBytes command) right after mkdir and before fallocate, and reports a clear condition instead of letting `fallocate -l` fill the node's root filesystem and trip kubelet DiskPressure eviction. The check uses the non-superuser available-block count (%a), so the filesystem reserve is left as built-in headroom; it is best-effort and falls through to fallocate (which still fails cleanly on ENOSPC) when free space cannot be determined. - Attach loop devices with `losetup --direct-io=on` so backing-file I/O bypasses the node filesystem's page cache, avoiding the double page cache (volume FS + backing file) that doubles RAM use and throttles throughput. The kernel silently falls back to buffered I/O on misalignment, so requesting it is always safe. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… dir spec.fileDevices[].directory previously accepted any absolute host path, so a typo could place a backing file under /, /etc or /var/lib/kubelet and fill the node's filesystem. The agent now confines backing files to a base directory, defaulting to /opt/deckhouse/sds/file-devices and overridable via the new fileDevicesDirectory module setting (e.g. to point at a dedicated data disk). Each directory must be that base or a subdirectory of it; anything outside the subtree is rejected on the VGConfigurationApplied condition. - openapi: new config-values fileDevicesDirectory (+ru doc) - daemonset: plumb it to the agent as FILE_DEVICES_DIRECTORY - agent config: read the env with the /opt/deckhouse/sds/file-devices default - lvg reconciler: validateFileDevice enforces the base-dir allowlist - docs/CRD: document the constraint and update examples Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…laim #6: extendFileDevicesIfNeeded previously requeued forever under the generic "Updating" reason whenever alias-form PV names could not be resolved, so a persistently broken resolver (missing nsenter binary, a dangling alias) was indistinguishable from an ordinary in-flight update and could never be alerted on. The reconciler now tracks consecutive resolver-only no-progress rounds per LVG and, after a threshold, switches the VGConfigurationApplied condition to a dedicated ReasonAliasResolutionFailed and logs at Error level. The streak resets as soon as a round makes progress, has nothing to do, or the LVG is deleted. #4: document space reclamation for file-backed devices in the FAQ — the discard/TRIM passdown chain (fstrim or mount -o discard punches holes in the preallocated backing file via the default discards=passdown thin pool), plus its caveats (chunk-size granularity, snapshots pinning chunks, shrink only after write+delete+trim). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a "LVMVolumeGroup with file-backed devices" Ginkgo Context (label e2e-tests) plus helpers. Unlike the other scenarios these need no dedicated disk or nested VM: the VG is created with spec.fileDevices on any node that runs the agent, and node-side state is checked over SSH. Backing files go in the default base dir /opt/deckhouse/sds/file-devices. Scenarios: - create a file-only VG → Ready; assert status.fileDevices (filePath under the base dir, loopDevice, pvUUID) and that the backing file, loop device and VG exist on the node; on delete, assert the file is removed and the loop detached; - thin-pool on a file-backed device → pool Ready + data LV present on node; - extend a file-backed VG by appending a fileDevices entry → VG grows; - reattach after the agent pod restarts → VG stays Ready, loop reattached; - negative: a directory outside the allowed base dir is rejected with VGConfigurationApplied=False / ValidationFailed and nothing is provisioned. Documents the scenarios and their prerequisites (branch build, node free space) in e2e/README.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add support for creating LVM Volume Groups on top of file-backed loop devices via a new
spec.fileDevicesfield inLVMVolumeGroupCRD. The agent creates preallocated files (fallocate), attaches them as loop devices (losetup), and uses them as LVM Physical Volumes. File devices can be used standalone or combined withblockDeviceSelectorin a single VG.Key changes:
LVMVolumeGroupandLVMVolumeGroupSetCRDs withspec.fileDevicesandstatus.nodes[].fileDevices.blockDeviceSelectoris now optional; a CEL rule requires at least one ofblockDeviceSelector/fileDevices, and existingfileDevicesentries are immutable.LVMVolumeGroupSetcontroller now propagatesspec.lvmVolumeGroupTemplate.fileDevicesinto the childLVMVolumeGroupresources it creates and updates, so a file-only (or mixed) Set template actually provisions file-backed VGs.loopfrom the agent's own LVM--configglobal_filter(LVMGlobalFilter) and fromForeignDeviceBasePrefixes/FilterForeignPVs, so the agent can see and manage its file-backed loop PVs. Ownership is established by the backing-file owner pattern (IsManagedFileDevicePath) gated on the VG'sstorage.deckhouse.io/lvmVolumeGroupNametag.CreateFileDevice,SetupLoopDevice,DetachLoopDevice,FindLoopDeviceByFile,GetLoopBackingFile,RemoveFileDevice,EnsureFileDeviceDirectory.extendFileDevicesIfNeeded), and cleanup in the delete flow.status.nodes[].fileDevices, including resolving alias-reported (/dev/disk/by-id,/dev/block/MAJ:MIN) loop PVs to their canonical/dev/loopN. The update/extend path applies the same alias resolution before deciding whether a managed loop is already a PV of the VG, so it never re-runspvcreateon an existing PV.NodeGroupConfiguration: the host-wide LVMglobal_filterstill rejects/dev/loop(hostlvm/pvscontinue to ignore loop devices) and the multipath loop blacklist is retained. Only the agent's own per-command--config global_filterallows loop, so managed file-backed PVs are visible to the agent but not to host-wide LVM.Thin-pool fixes (apply to block-device and file-backed VGs alike):
100%is now accepted at create time. During create validation the VG size is the raw (non-extent-aligned) sum of devices; percentage sizes are aligned down for the capacity check so a full-VG thin pool fits, while absolute sizes still round up so a genuinely oversized pool is rejected.lvextend -l 100%VG(when a100%thin pool already fills the VG, which it always does up to thin-pool metadata) is treated as success. The benign LVM message is filtered for both the old wording (No size change) and newer LVM 2.03.x wording (New size (N extents) matches existing size (M extents)), so a healthy pool no longer flapsVGConfigurationApplied=False.Hardening from code review:
losetup) are dropped from the cache byutils.FilterForeignLoopPVs. Removingloopfrom the foreign-PV filter otherwise let such a VG into the cache, where a name collision with a managed VG madefindDuplicateVGNamestake the healthy managedLVMVolumeGroupoffline. Managed loop VGs (taggedstorage.deckhouse.io/enabled=true) and block-backed VGs are kept.createVGComplexnow skipspvcreatefor any device that is already an LVM PV (with alias resolution for loop PVs), mirroring the guard inextendFileDevicesIfNeeded, so a retry after a partial create no longer wedgesVGConfigurationAppliedwith "already a PV".fileDevices[].sizedocs and the agent's validation message warn that decimal units (1G= 10⁹ bytes) are below the1Giminimum, since an existing entry'ssizeis immutable.Data-integrity fixes (found via manual cluster testing + a from-scratch re-review):
(deleted)), both registered as separate PVs, doubling the VG and placing a thin pool partly on a deleted-file loop. Root cause:CreatePVtreated the benignFile descriptor … leaked on lvm invocationexit (lvm.static under nsenter) as a failure even thoughpvcreatehad already written the PV label; the create rollback then used the broadcleanupFileDevicesand deleted that live PV's backing file, and the next reconcile re-provisioned a second loop. Fixed at two layers: (a)CreatePVnow routes stderr throughfilterStdErrand errors only on a non-benign line (mirroringExtendLV), removing the false failure; (b) the create/extend paths use a scopedrollbackProvisionedFileDevicesthat tears down only what the current call provisioned and never a loop that is already an LVM PV (verified against a freshlvm pvs, not the stale cache; if the listing fails it tears nothing down — a leaked loop/file is recoverable, corrupting a live VG is not).validateLVGForUpdateFuncmatched existing file devices by full path, butstatus.nodes[].fileDevices[].FilePathcomes fromlosetup, which canonicalizes symlink components of the directory. With a symlinkeddirectorythe already-provisioned device was counted as new on every reconcile, inflating the VG size used for thin-pool validation. Now matched by basename (stable across the symlink canonicalization).pvcreateguard.extendFileDevicesIfNeededtrusted the PV snapshot captured at the top of the reconcile; a loop that became a canonical/dev/loopNPV afterwards was re-pvcreated and failed "already a PV", wedging the condition. It now takes the union of that snapshot and a fresh cache read before deciding (mirroring the create path).Verified on a fresh cluster (deckhouse
pr302build): the exact scenario that previously doubled the thin-on-file VG now produces a correct single-PV VG (one loop, one PV), with nopvcreateerror, no rollback and no second loop in the agent log. All four combinations thin/thick × block/file reachReadyand provision real LVMLogicalVolumes.Why do we need it, and what problem does it solve?
When dedicated block devices are not available on a node, there is no way to create an LVM Volume Group using the current
sds-node-configurator. This feature allows users to allocate part of an existing filesystem for LVM by specifying a directory and size inspec.fileDevices, enabling LVM-based storage provisioning without dedicated disks.What is the expected result?
LVMVolumeGroupwithspec.fileDevicesprovisions preallocated files, attaches them as loop devices, creates PVs and a VGstatus.nodes[].fileDevicesreflects the file path, canonical loop device, PV size, and PV UUID for each file deviceLVMVolumeGroupwith file devices detaches loop devices and removes backing filesfileDevicesandblockDeviceSelectorcan be used together in a single VG; thin-pool sizing accounts for file-backed capacity on both the create and update pathsLVMVolumeGroupSetwhose template specifiesfileDevicesprovisions file-backed childLVMVolumeGroupresources on the selected nodes100%(e.g. on top of a single file device or block device) reachesVGConfigurationApplied=True/Readyinstead of being rejected at validation or flapping on a no-op resize1Giand per-VG backing-file uniqueness — are validated by the agent and surfaced on theVGConfigurationAppliedcondition (consistent with how the other size fields in these CRDs are validated). The CRDsizepattern only constrains the format.Checklist