Skip to content

file_lock attacker-DoS mitigation (sidecar-lock TTL + observability; PR #641 R2-S-3 follow-up) #645

@michael-wojcik

Description

@michael-wojcik

Background

During PR #641 Round-2 blind review (R2-S-3 finding), security-engineer-r2 flagged that shared/claude_md_manager.py::file_lock has a 5-second timeout with fail-open behavior. An attacker (same-user, since this is local-trust scope) holding the sidecar .{filename}.lock file for >5 seconds causes legitimate writers to silently degrade to unlocked writes.

# pact-plugin/hooks/shared/claude_md_manager.py (file_lock)
# Acquire fcntl.flock on .{filename}.lock with 5s timeout
# On timeout or any disk error → fail-open (proceed without lock)

Fail-open is correct under current threat models (same-user-trust, where the attacker has equivalent capability via direct os tooling anyway), but the timeout is short and there's no observability for "how often does fail-open fire?"

Symptoms attacker could induce

  • Concurrent CLAUDE.md writes that race past lock ordering → torn writes (PACT_MANAGED block partial overwrite by routing-block update, or vice versa).
  • Race window during which strip_orphan_routing_markers runs without holding the lock → potential to lose a writer's append.

The likelihood is low (same-user is the only adversary) but the surface is enumerable.

Proposal — defense-in-depth

Option A — TTL on sidecar lock

# Before fcntl.flock acquisition, check the sidecar lock's mtime.
# If mtime > now - LOCK_TTL_SECONDS (e.g., 30s), the previous holder is presumed dead;
# unlink the sidecar and reacquire.
LOCK_TTL_SECONDS = 30

def file_lock(target: Path):
    sidecar = target.parent / f".{target.name}.lock"
    if sidecar.exists():
        age = time.time() - sidecar.stat().st_mtime
        if age > LOCK_TTL_SECONDS:
            sidecar.unlink(missing_ok=True)  # presumed dead holder
    # ... existing acquisition with 5s timeout ...

This bounds the worst-case wait at LOCK_TTL_SECONDS (30s) for legitimate writers, instead of leaving them indefinitely waiting on a hung process or fail-open silently after 5s.

Option B — fail-open observability

# On fail-open, emit a journal event so audits can enumerate fail-open frequency
def file_lock(target: Path):
    try:
        # ... acquisition ...
        yield
    except (fcntl.error, TimeoutError) as e:
        _journal_lock_fail_open(str(target), e.errno, e.args[0])
        yield  # proceed without lock

session_journal.py accepts a new lock_fail_open event type. Doesn't change behavior; gives security review observability.

Recommendation

Implement both: Option B first (observability is cheap), Option A as the security hardening once we have data on actual fail-open frequency.

Acceptance criteria

  • lock_fail_open journal event emitted on every file_lock fail-open path.
  • LOCK_TTL_SECONDS sidecar-lock TTL implemented; lock files older than 30s are presumed dead and unlinked.
  • Tests cover: TTL-based unlink + reacquire path, fail-open observability path.
  • Existing fail-open tests still pass (TTL logic does not break the contract).

Cross-references

  • PR Restore session-startup ritual (#628) #641 R2-S-3 (security-engineer-r2 finding, deferred to follow-up)
  • shared/claude_md_manager.py::file_lock (5s timeout, fail-open)
  • Pinned: file_lock is the canonical helper for shared-file read-mutate-write
  • session_journal.py — canonical event-persistence layer

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions