Skip to content

fix(kernelio): create-or-replace for drop-in writes (live-caught bug)#115

Merged
remyluslosius merged 1 commit into
mainfrom
fix/kernelio-create-or-replace
Jun 20, 2026
Merged

fix(kernelio): create-or-replace for drop-in writes (live-caught bug)#115
remyluslosius merged 1 commit into
mainfrom
fix/kernelio-create-or-replace

Conversation

@remyluslosius

Copy link
Copy Markdown
Contributor

Live agent-mode validation on the test fleet (RHEL 9.6, 211) caught a real bug in the merged Phase 6/7 kernel-IO handlers. They used fsatomic.AtomicReplace to write drop-in files and AtomicRemove to remove them on rollback — but fsatomic splits the cases: AtomicWrite errors ErrAlreadyExists on an existing target; AtomicReplace/AtomicRemove error ErrNotExist on a missing one. A drop-in is usually new on first apply, so:

sysctl_set: persist write failed: fsatomic: target does not exist: /etc/sysctl.d/99-kensa.conf

On the live host this made apply fail → the transaction rolled back → and the rollback's AtomicRemove also errored ErrNotExist on the never-created file, so rollback returned early and left the runtime sysctl value unrestored (changed but not rolled back). The in-memory fake modeled AtomicReplace/Remove as a plain map set/delete (no existence semantics), so every unit test passed — the fake masked the bug; only the real fsatomic primitives on a real host surfaced it.

Fix

  • internal/agent/kernelio: WriteFile (create-OR-replace; AtomicWrite, fall back to AtomicReplace on ErrAlreadyExists; race-tolerant) and RemoveFile (AtomicRemove, ErrNotExist tolerated as no-op). Direct tests.
  • The 4 affected handlers (sysctl_set, kernel_module_disable, audit_rule_set, dconf_set) switch their drop-in writes/removes to WriteFile/RemoveFile. mount_option_set is unaffected (/etc/fstab always exists).
  • FakeSysctlTransport now models the real fsatomic existence semantics (ErrNotExist/ErrAlreadyExists) — so the unit tests catch this class going forward (the pre-fix handlers now FAIL the fake).

Live re-validation on 211 (sysctl_set, benign log_martians, restored after each run)

  • Apply now commits (FIXED) with correct runtime value + drop-in content.
  • A forced apply-time rollback (2-step rule whose 2nd step the kernel rejects) restores the runtime value and removes the drop-in byte-perfect. Before the fix, the same scenario left the value stuck.

Verification

go test ./... green; go build ./... clean; golangci-lint 0; comment-lint clean (post-commit); specter sync all pass.

Failure-mode analysis

  1. Wrong in prod? This is the production bug — agent-mode remediation of sysctl/module/audit/dconf would fail to apply and fail to fully roll back. The fix corrects both paths; the hardened fake prevents regression. Shell-path (non-agent) hosts were never affected.
  2. Captured-state sufficiency: unchanged (fix is in the write/remove primitives). Rollback now actually completes, so the captured runtime value is restored (the pre-fix early-return prevented it).
  3. Edge case / gated: WriteFile's two-step is race-tolerant. Live-validated for sysctl; the other three handlers use the identical helper + primitives.

🤖 Generated with Claude Code

LIVE agent-mode validation on the test fleet (RHEL 9.6, 211) caught a
real bug in the merged Phase 6/7 kernel-IO handlers: they used
fsatomic.AtomicReplace to write drop-in files (sysctl persist, modprobe
blacklist, audit rules.d, dconf profile/snippet/lock) and AtomicRemove to
remove them on rollback. But fsatomic deliberately splits the cases —
AtomicWrite errors ErrAlreadyExists on an existing target, AtomicReplace
and AtomicRemove error ErrNotExist on a missing one. A drop-in file is
usually NEW on first apply, so AtomicReplace failed:

    sysctl_set: persist write failed: fsatomic: target does not exist:
    /etc/sysctl.d/99-kensa.conf

On the live host this caused apply to fail → the transaction rolled back →
and the rollback's AtomicRemove ALSO errored ErrNotExist on the
never-created file, so rollback returned early and left the runtime sysctl
value UNRESTORED (changed but not rolled back). The in-memory test fake
modeled AtomicReplace/Remove as a plain map set/delete (no existence
semantics), so every unit test passed — the fake masked the bug; only the
real fsatomic primitives on a real host surfaced it.

Fix:
- internal/agent/kernelio: WriteFile (create-OR-replace: AtomicWrite, fall
  back to AtomicReplace on ErrAlreadyExists; race-tolerant) and RemoveFile
  (AtomicRemove, ErrNotExist tolerated as a no-op success). Direct tests.
- The 4 affected handlers (sysctl_set, kernel_module_disable,
  audit_rule_set, dconf_set) switch their drop-in writes/removes to
  WriteFile/RemoveFile. mount_option_set is unaffected (/etc/fstab always
  exists; AtomicReplace is correct there).
- FakeSysctlTransport now MODELS the real fsatomic existence semantics:
  AtomicReplace/AtomicRemove return ErrNotExist on a missing file,
  AtomicWrite returns ErrAlreadyExists on an existing one. This is what
  makes the unit tests catch this class of bug going forward — with the
  hardened fake, the pre-fix handlers FAIL.

Live re-validation on 211 (sysctl_set, benign log_martians fixture,
restored after each run): apply now COMMITS with the correct runtime
value + drop-in content; a forced apply-time rollback (2-step rule whose
2nd step the kernel rejects) restores the runtime value AND removes the
drop-in BYTE-PERFECT. Both the apply-commit and apply-then-rollback paths
verified on the real host.

Failure-mode analysis:
1. What could this do wrong in production? This IS the production bug —
   agent-mode remediation of sysctl/module/audit/dconf would fail to apply
   (and, worse, fail to fully roll back, leaving runtime state changed).
   The fix makes both paths correct and the hardened fake prevents
   regression. Shell-path (non-agent) hosts were never affected (printf >
   file create-or-replaces fine).
2. Captured-state sufficiency: unchanged — the fix is in the write/remove
   primitives, not the captured PreState. Rollback now actually completes
   (RemoveFile tolerates absence) so the captured runtime value IS
   restored, which the pre-fix early-return prevented.
3. Edge case not safe for / gated: WriteFile's two-step (AtomicWrite then
   AtomicReplace) is race-tolerant against a concurrent writer
   creating/removing the file between attempts. Live-validated for sysctl;
   the other three handlers use the identical helper + fsatomic primitives.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@remyluslosius remyluslosius merged commit 6010e1d into main Jun 20, 2026
18 checks passed
@remyluslosius remyluslosius deleted the fix/kernelio-create-or-replace branch June 20, 2026 05:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant