Skip to content

Add run-scoped tensor offload temp cleanup#4522

Open
pcnudde wants to merge 1 commit intoNVIDIA:mainfrom
pcnudde:codex/run-scoped-temp-root
Open

Add run-scoped tensor offload temp cleanup#4522
pcnudde wants to merge 1 commit intoNVIDIA:mainfrom
pcnudde:codex/run-scoped-temp-root

Conversation

@pcnudde
Copy link
Copy Markdown
Collaborator

@pcnudde pcnudde commented May 5, 2026

Summary

Catchall solution for disk tensor offload cleanup without having to find each memory leak

  • create a per-FedAvg-run tensor disk offload root when disk offload is enabled
  • store that root in FOBS context so disk tensor downloads create temp dirs under the run-owned root
  • remove the run root during FedAvg cleanup so retained result/lazy references cannot keep disk files behind after run teardown

@pcnudde pcnudde force-pushed the codex/run-scoped-temp-root branch from 51ac0db to 8c86a95 Compare May 5, 2026 23:06
@pcnudde
Copy link
Copy Markdown
Collaborator Author

pcnudde commented May 5, 2026

/build

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 5, 2026

Greptile Summary

This PR introduces a run-scoped temp directory for tensor disk offload in FedAvg. A root dir is created at the start of FedAvg.run() inside a try/finally, registered in the cell's FOBS context so all download_tensors_to_disk calls place their per-download temp dirs beneath it, and then shutil.rmtreed at teardown — eliminating the prior disk-leak problem regardless of how many lazy references remain.

  • fedavg.py: mkdtemp is now inside the try block (fixing the prior review feedback), and the finally clears both the FOBS context and the root dir tree.
  • tensor_disk_offload_context.py: apply now requires root_dir when enabling (raises ValueError instead of the previously unsafe assert), and restore sets TENSOR_DISK_OFFLOAD_ROOT_DIR to None. However, it does not save and restore a pre-existing TENSOR_DISK_OFFLOAD_ROOT_DIR, so a future nested-scope scenario would silently break.
  • tensor_downloader.py: download_tensors_to_disk reads TENSOR_DISK_OFFLOAD_ROOT_DIR from FOBS context and raises RuntimeError if absent, replacing the prior assert.

Confidence Score: 5/5

Safe to merge — the core cleanup mechanism is correct and all three prior review concerns have been addressed.

The root dir is created, registered in FOBS context, and destroyed entirely within the same try/finally guard. download_tensors_to_disk now raises a clear RuntimeError instead of silently falling back to the system temp dir. The one structural gap — restore writing TENSOR_DISK_OFFLOAD_ROOT_DIR=None without saving a pre-existing value — only matters in nested-scope scenarios that do not exist in the current codebase.

The small nesting edge-case in tensor_disk_offload_context.py is worth tracking if the disk-offload API is ever extended beyond FedAvg.

Important Files Changed

Filename Overview
nvflare/app_common/utils/tensor_disk_offload_context.py Adds TENSOR_DISK_OFFLOAD_ROOT_DIR to FOBS context; replaces unsafe asserts with ValueError; restore path correctly clears root_dir but doesn't save/restore a pre-existing root_dir value
nvflare/app_common/workflows/fedavg.py Moves mkdtemp inside try block (addressing prior review), registers root_dir in FOBS context, and rmtrees the root in finally — cleanup is correct and robust
nvflare/app_opt/pt/tensor_downloader.py Reads TENSOR_DISK_OFFLOAD_ROOT_DIR from FOBS context and raises RuntimeError (replacing unsafe assert) if not set; creates temp_dir under root so rmtree at run teardown covers it
tests/unit_test/app_common/workflow/fedavg_test.py New test verifies rmtree wipes run-scoped root including retained sub-dirs; existing tests updated for new root_dir parameter
tests/unit_test/app_opt/pt/test_disk_tensor_consumer.py All existing tests updated to supply _FakeCell with root_dir; new test verifies temp dir is created inside the scoped root
tests/unit_test/app_common/utils/tensor_disk_offload_context_test.py Tests updated to pass root_dir; new test verifies apply sets and restore clears TENSOR_DISK_OFFLOAD_ROOT_DIR correctly
tests/unit_test/apis/impl/controller_test.py BlockingDownloadCell updated with get_fobs_context to supply root_dir; fake_mkdtemp signature updated for new dir kwarg

Sequence Diagram

sequenceDiagram
    participant FA as FedAvg.run()
    participant CTX as tensor_disk_offload_context
    participant CELL as Cell FOBS context
    participant TD as tensor_downloader
    participant FS as Filesystem

    FA->>FS: tempfile.mkdtemp() -> disk_offload_root_dir
    FA->>CTX: apply_enable_tensor_disk_offload(enabled=True, root_dir)
    CTX->>CELL: update_fobs_context({enable=True, root_dir=root_dir})

    loop Each round
        FA->>TD: send_model triggers download_tensors_to_disk
        TD->>CELL: get_fobs_context()[TENSOR_DISK_OFFLOAD_ROOT_DIR]
        CELL-->>TD: root_dir
        TD->>FS: tempfile.mkdtemp(dir=root_dir) -> temp_dir
        TD->>TD: DiskTensorConsumer writes chunks to temp_dir
        TD-->>FA: LazyTensorDict(key_to_file, temp_dir)
    end

    FA->>CTX: restore_enable_tensor_disk_offload(previous, root_dir)
    CTX->>CELL: update_fobs_context({enable=previous, root_dir=None})
    FA->>FS: shutil.rmtree(disk_offload_root_dir, ignore_errors=True)
    Note over FS: All temp_dirs under root_dir are wiped
Loading

Reviews (3): Last reviewed commit: "Add run-scoped tensor offload temp clean..." | Re-trigger Greptile

Comment thread nvflare/app_common/utils/tensor_disk_offload_context.py
Comment thread nvflare/app_opt/pt/tensor_downloader.py
@pcnudde pcnudde force-pushed the codex/run-scoped-temp-root branch from 8c86a95 to 940f543 Compare May 5, 2026 23:11
Comment thread nvflare/app_common/workflows/fedavg.py Outdated
@pcnudde pcnudde force-pushed the codex/run-scoped-temp-root branch from 940f543 to 4ebf761 Compare May 5, 2026 23:17
@pcnudde
Copy link
Copy Markdown
Collaborator Author

pcnudde commented May 5, 2026

/build

@pcnudde pcnudde requested a review from YuanTingHsieh May 5, 2026 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant