Clear retained training result after accept events#4520
Clear retained training result after accept events#4520pcnudde wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
|
/build |
Greptile SummaryThis PR fixes a memory retention bug where the raw client
Confidence Score: 4/5Safe to merge for the main The core fix in
Important Files Changed
Sequence DiagramsequenceDiagram
participant RT as result_received_cb
participant PR as _process_result
participant AT as _accept_train_result
participant EV as event bus
participant FL as FLContext
RT->>PR: client_task, fl_ctx
PR->>EV: BEFORE_CONTRIBUTION_ACCEPT
PR->>AT: client_name, result, fl_ctx
AT->>FL: set_prop(TRAINING_RESULT, result)
AT-->>PR: accepted=True
PR->>EV: AFTER_CONTRIBUTION_ACCEPT
Note over EV,FL: Handlers can read TRAINING_RESULT here
PR->>FL: _clear_training_result() [finally]
FL-->>FL: set_prop(TRAINING_RESULT, None)
PR->>PR: FLModelUtils.from_shareable(result)
PR->>RT: callback(result_model)
PR->>RT: client_task.result = None
|
|
Thanks for the review. The This change is focused on the normal FedAvg result-processing path where accepted client training results are converted, aggregated, and then should be released immediately. In that path, I agree the unknown-task path can be evaluated separately, but it is not important for this specific FedAvg memory-retention fix. |
|
/build |
|
/build |
What changed
This PR clears
AppConstants.TRAINING_RESULTfrom the server-sideFLContextimmediately after the contribution-accept event window completes._accept_train_result()still stores the raw clientShareableinFLContextso existingBEFORE_CONTRIBUTION_ACCEPT/AFTER_CONTRIBUTION_ACCEPThandlers can inspect it. After those handlers run,_process_result()now resets the property value toNonewhile preserving the existing private/sticky attributes.Bug
In a normal FedAvg-style workflow, the server receives a client model update as a raw
Shareable.BaseModelController._accept_train_result()stores that raw result underAppConstants.TRAINING_RESULTfor contribution-accept event handlers.After the result is accepted,
_process_result()converts the rawShareableinto anFLModeland FedAvg aggregates it immediately. The communication/task path then clearsclient_task.result, but the same rawShareablecan still be reachable through:That keeps the client update live even though FedAvg has already aggregated it. With disk tensor offload disabled, this can retain a full in-memory client model update. For example, with a 5 GB model, built-in streaming FedAvg can remain roughly one extra model-sized update higher than necessary after each accepted result, and can briefly hold the previous retained update plus the next incoming update during result processing.
This is a lifetime bug, not allocator caching: the tensor objects are still strongly reachable from
FLContext, so garbage collection cannot reclaim them.Benefit
The raw client result is now scoped to the event window where it is intended to be visible. After that point, only the objects FedAvg actually needs remain reachable, such as the aggregation accumulator and the updated/global model state.
For large in-memory models, this avoids retaining an unnecessary extra client update after aggregation. In the 5 GB model example, steady memory after aggregating a result can drop from roughly
global model + accumulator + retained raw resultto justglobal model + accumulator.