Add run-scoped tensor offload temp cleanup#4522
Conversation
51ac0db to
8c86a95
Compare
|
/build |
Greptile SummaryThis PR introduces a run-scoped temp directory for tensor disk offload in FedAvg. A root dir is created at the start of
Confidence Score: 5/5Safe to merge — the core cleanup mechanism is correct and all three prior review concerns have been addressed. The root dir is created, registered in FOBS context, and destroyed entirely within the same try/finally guard. download_tensors_to_disk now raises a clear RuntimeError instead of silently falling back to the system temp dir. The one structural gap — restore writing TENSOR_DISK_OFFLOAD_ROOT_DIR=None without saving a pre-existing value — only matters in nested-scope scenarios that do not exist in the current codebase. The small nesting edge-case in tensor_disk_offload_context.py is worth tracking if the disk-offload API is ever extended beyond FedAvg. Important Files Changed
Sequence DiagramsequenceDiagram
participant FA as FedAvg.run()
participant CTX as tensor_disk_offload_context
participant CELL as Cell FOBS context
participant TD as tensor_downloader
participant FS as Filesystem
FA->>FS: tempfile.mkdtemp() -> disk_offload_root_dir
FA->>CTX: apply_enable_tensor_disk_offload(enabled=True, root_dir)
CTX->>CELL: update_fobs_context({enable=True, root_dir=root_dir})
loop Each round
FA->>TD: send_model triggers download_tensors_to_disk
TD->>CELL: get_fobs_context()[TENSOR_DISK_OFFLOAD_ROOT_DIR]
CELL-->>TD: root_dir
TD->>FS: tempfile.mkdtemp(dir=root_dir) -> temp_dir
TD->>TD: DiskTensorConsumer writes chunks to temp_dir
TD-->>FA: LazyTensorDict(key_to_file, temp_dir)
end
FA->>CTX: restore_enable_tensor_disk_offload(previous, root_dir)
CTX->>CELL: update_fobs_context({enable=previous, root_dir=None})
FA->>FS: shutil.rmtree(disk_offload_root_dir, ignore_errors=True)
Note over FS: All temp_dirs under root_dir are wiped
Reviews (3): Last reviewed commit: "Add run-scoped tensor offload temp clean..." | Re-trigger Greptile |
8c86a95 to
940f543
Compare
940f543 to
4ebf761
Compare
|
/build |
Summary
Catchall solution for disk tensor offload cleanup without having to find each memory leak