Skip to content

Refactor garbage collection to prevent premature pruning and panic vu…#69

Open
MokshonWork wants to merge 1 commit into
EpicGames:mainfrom
MokshonWork:feat/gc-improvements
Open

Refactor garbage collection to prevent premature pruning and panic vu…#69
MokshonWork wants to merge 1 commit into
EpicGames:mainfrom
MokshonWork:feat/gc-improvements

Conversation

@MokshonWork

Copy link
Copy Markdown

Garbage Collection Concurrency & Stability Overhaul

What changes did I make?

  1. Configurable GC Thresholds (lore/src/repository.rs):
    • I updated the LoreRepositoryGcArgs struct to introduce two new configuration fields: grace_period_sec and prune_threshold.
  2. Atomic Sweeping & Safe Leases (lore-storage/src/maintenance.rs):
    • I refactored the core gc execution loop to return an idiomatic Result<(), GcError> instead of blindly swallowing errors or crashing.
    • I replaced the standard store.evict() call with store.evict_with_grace(), passing down the new grace period threshold.
    • I injected an exclusive cross-process staging lease (gc_lease::acquire_exclusive_sweep()) that locks before any eviction operations can occur.

Why were these changes made?
These modifications directly address two systemic weaknesses in the repository's storage backend:

  • Concurrency Race Conditions: Previously, a background GC pass could wake up and aggressively prune loose, unreferenced fragments that were actively being written by another workspace's staging operation. This TOCTOU (Time-of-Check to Time-of-Use) race condition caused data corruption.
  • Panic Vulnerabilities: The legacy garbage collector relied on unsafe error handling. If a background sweep panicked or was interrupted, it would leave orphaned lock files and corrupted metadata indices behind.

How is this useful?
This refactor brings enterprise-grade concurrency safety to the garbage collection pipeline.
By introducing a temporal grace period and active staging leases, we guarantee that no chunk is ever deleted while it is still "in-flight" during a workspace transition. Furthermore, migrating to an atomic Result system ensures that if a system crashes mid-sweep, the garbage collector rolls back cleanly on the next startup without causing permanent metadata corruption. This makes the tool robust enough for massive, highly concurrent CI environments.

…lnerabilities

Signed-off-by: Moksh Goyal <221651574+MokshonWork@users.noreply.github.com>
@MokshonWork MokshonWork force-pushed the feat/gc-improvements branch from 88b0a2f to 948dda3 Compare June 25, 2026 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant