Skip to content

Clear retained training result after accept events#4520

Open
pcnudde wants to merge 1 commit intoNVIDIA:mainfrom
pcnudde:codex/clear-training-result-leak
Open

Clear retained training result after accept events#4520
pcnudde wants to merge 1 commit intoNVIDIA:mainfrom
pcnudde:codex/clear-training-result-leak

Conversation

@pcnudde
Copy link
Copy Markdown
Collaborator

@pcnudde pcnudde commented May 5, 2026

What changed

This PR clears AppConstants.TRAINING_RESULT from the server-side FLContext immediately after the contribution-accept event window completes.

_accept_train_result() still stores the raw client Shareable in FLContext so existing BEFORE_CONTRIBUTION_ACCEPT / AFTER_CONTRIBUTION_ACCEPT handlers can inspect it. After those handlers run, _process_result() now resets the property value to None while preserving the existing private/sticky attributes.

Bug

In a normal FedAvg-style workflow, the server receives a client model update as a raw Shareable. BaseModelController._accept_train_result() stores that raw result under AppConstants.TRAINING_RESULT for contribution-accept event handlers.

After the result is accepted, _process_result() converts the raw Shareable into an FLModel and FedAvg aggregates it immediately. The communication/task path then clears client_task.result, but the same raw Shareable can still be reachable through:

controller.fl_ctx -> FLContext.props -> AppConstants.TRAINING_RESULT -> Shareable -> tensors

That keeps the client update live even though FedAvg has already aggregated it. With disk tensor offload disabled, this can retain a full in-memory client model update. For example, with a 5 GB model, built-in streaming FedAvg can remain roughly one extra model-sized update higher than necessary after each accepted result, and can briefly hold the previous retained update plus the next incoming update during result processing.

This is a lifetime bug, not allocator caching: the tensor objects are still strongly reachable from FLContext, so garbage collection cannot reclaim them.

Benefit

The raw client result is now scoped to the event window where it is intended to be visible. After that point, only the objects FedAvg actually needs remain reachable, such as the aggregation accumulator and the updated/global model state.

For large in-memory models, this avoids retaining an unnecessary extra client update after aggregation. In the 5 GB model example, steady memory after aggregating a result can drop from roughly global model + accumulator + retained raw result to just global model + accumulator.

@pcnudde pcnudde requested a review from YuanTingHsieh May 5, 2026 19:59
@pcnudde pcnudde marked this pull request as ready for review May 5, 2026 19:59
@pcnudde
Copy link
Copy Markdown
Collaborator Author

pcnudde commented May 5, 2026

/build

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 5, 2026

Greptile Summary

This PR fixes a memory retention bug where the raw client Shareable stored under AppConstants.TRAINING_RESULT remained strongly reachable via FLContext props long after FedAvg had already aggregated it. The fix wraps the contribution-accept event window in a try/finally and introduces _clear_training_result to null out the prop immediately after handlers finish, while preserving the original private/sticky metadata.

  • _clear_training_result is a new @staticmethod that reads existing prop attributes before overwriting the value with None, keeping attribute semantics consistent.
  • _process_result now guarantees cleanup even when BEFORE_CONTRIBUTION_ACCEPT, _accept_train_result, or AFTER_CONTRIBUTION_ACCEPT raise exceptions.
  • process_result_of_unknown_task shares the same _accept_train_result call path that sets TRAINING_RESULT, but does not invoke the new helper, leaving the raw Shareable retained on that code path.

Confidence Score: 4/5

Safe to merge for the main _process_result path; the fix is correct and well-tested. The process_result_of_unknown_task path still retains TRAINING_RESULT after acceptance, but this is a pre-existing gap rather than a regression introduced here.

The core fix in _process_result is correct and covered by two targeted tests. The one gap is that process_result_of_unknown_task calls _accept_train_result — which sets TRAINING_RESULT — but never calls _clear_training_result, so the same object lifetime issue the PR describes persists on that alternate path.

nvflare/app_common/workflows/base_model_controller.py — specifically the process_result_of_unknown_task method, which was not updated to call _clear_training_result.

Important Files Changed

Filename Overview
nvflare/app_common/workflows/base_model_controller.py Adds _clear_training_result static helper and wraps the BEFORE/AFTER_CONTRIBUTION_ACCEPT event window in a try/finally to clear AppConstants.TRAINING_RESULT after every accepted result. process_result_of_unknown_task uses the same _accept_train_result but does not call the new helper, leaving the original retention issue alive on that code path.
tests/unit_test/app_common/workflow/fedavg_test.py Adds two targeted regression tests: one verifying the TRAINING_RESULT prop is None after _process_result while still visible inside AFTER_CONTRIBUTION_ACCEPT, and one verifying weakref-based tensor reclamation after gc.collect(). Both tests are well-scoped and cover the intended fix.

Sequence Diagram

sequenceDiagram
    participant RT as result_received_cb
    participant PR as _process_result
    participant AT as _accept_train_result
    participant EV as event bus
    participant FL as FLContext

    RT->>PR: client_task, fl_ctx
    PR->>EV: BEFORE_CONTRIBUTION_ACCEPT
    PR->>AT: client_name, result, fl_ctx
    AT->>FL: set_prop(TRAINING_RESULT, result)
    AT-->>PR: accepted=True
    PR->>EV: AFTER_CONTRIBUTION_ACCEPT
    Note over EV,FL: Handlers can read TRAINING_RESULT here
    PR->>FL: _clear_training_result() [finally]
    FL-->>FL: set_prop(TRAINING_RESULT, None)
    PR->>PR: FLModelUtils.from_shareable(result)
    PR->>RT: callback(result_model)
    PR->>RT: client_task.result = None
Loading

Comments Outside Diff (1)

  1. nvflare/app_common/workflows/base_model_controller.py, line 300-310 (link)

    P2 process_result_of_unknown_task does not clear TRAINING_RESULT

    process_result_of_unknown_task calls the same _accept_train_result that sets AppConstants.TRAINING_RESULT in fl_ctx on acceptance, but never calls _clear_training_result. The raw client Shareable will therefore remain reachable through fl_ctxTRAINING_RESULT for late/unknown-task results — exactly the lifetime issue this PR intends to fix. Since _clear_training_result is now a standalone utility, applying it here would close the gap.

Reviews (1): Last reviewed commit: "Clear retained training result after acc..." | Re-trigger Greptile

Copy link
Copy Markdown
Collaborator Author

pcnudde commented May 5, 2026

Thanks for the review. The process_result_of_unknown_task() path is a different, late/unknown-task handling path and is intentionally out of scope for this PR.

This change is focused on the normal FedAvg result-processing path where accepted client training results are converted, aggregated, and then should be released immediately. In that path, TRAINING_RESULT remains available for the contribution-accept event window and is cleared right after, which fixes the large in-memory model retention case described in the PR.

I agree the unknown-task path can be evaluated separately, but it is not important for this specific FedAvg memory-retention fix.

@pcnudde pcnudde requested a review from nvidianz May 5, 2026 21:13
@pcnudde
Copy link
Copy Markdown
Collaborator Author

pcnudde commented May 5, 2026

/build

@pcnudde pcnudde changed the title [codex] Clear retained training result after accept events Clear retained training result after accept events May 5, 2026
@pcnudde
Copy link
Copy Markdown
Collaborator Author

pcnudde commented May 5, 2026

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant