Clear retained training result after accept events by pcnudde · Pull Request #4520 · NVIDIA/NVFlare

pcnudde · 2026-05-05T19:56:14Z

What changed

This PR clears AppConstants.TRAINING_RESULT from the server-side FLContext immediately after the contribution-accept event window completes.

_accept_train_result() still stores the raw client Shareable in FLContext so existing BEFORE_CONTRIBUTION_ACCEPT / AFTER_CONTRIBUTION_ACCEPT handlers can inspect it. After those handlers run, _process_result() now resets the property value to None while preserving the existing private/sticky attributes.

Bug

In a normal FedAvg-style workflow, the server receives a client model update as a raw Shareable. BaseModelController._accept_train_result() stores that raw result under AppConstants.TRAINING_RESULT for contribution-accept event handlers.

After the result is accepted, _process_result() converts the raw Shareable into an FLModel and FedAvg aggregates it immediately. The communication/task path then clears client_task.result, but the same raw Shareable can still be reachable through:

controller.fl_ctx -> FLContext.props -> AppConstants.TRAINING_RESULT -> Shareable -> tensors

That keeps the client update live even though FedAvg has already aggregated it. With disk tensor offload disabled, this can retain a full in-memory client model update. For example, with a 5 GB model, built-in streaming FedAvg can remain roughly one extra model-sized update higher than necessary after each accepted result, and can briefly hold the previous retained update plus the next incoming update during result processing.

This is a lifetime bug, not allocator caching: the tensor objects are still strongly reachable from FLContext, so garbage collection cannot reclaim them.

Benefit

The raw client result is now scoped to the event window where it is intended to be visible. After that point, only the objects FedAvg actually needs remain reachable, such as the aggregation accumulator and the updated/global model state.

For large in-memory models, this avoids retaining an unnecessary extra client update after aggregation. In the 5 GB model example, steady memory after aggregating a result can drop from roughly global model + accumulator + retained raw result to just global model + accumulator.

pcnudde · 2026-05-05T19:59:28Z

/build

greptile-apps · 2026-05-05T20:02:21Z

Greptile Summary

This PR fixes a memory retention bug where the raw client Shareable stored under AppConstants.TRAINING_RESULT remained strongly reachable via FLContext props long after FedAvg had already aggregated it. The fix wraps the contribution-accept event window in a try/finally and introduces _clear_training_result to null out the prop immediately after handlers finish, while preserving the original private/sticky metadata.

_clear_training_result is a new @staticmethod that reads existing prop attributes before overwriting the value with None, keeping attribute semantics consistent.
_process_result now guarantees cleanup even when BEFORE_CONTRIBUTION_ACCEPT, _accept_train_result, or AFTER_CONTRIBUTION_ACCEPT raise exceptions.
process_result_of_unknown_task shares the same _accept_train_result call path that sets TRAINING_RESULT, but does not invoke the new helper, leaving the raw Shareable retained on that code path.

Confidence Score: 4/5

Safe to merge for the main _process_result path; the fix is correct and well-tested. The process_result_of_unknown_task path still retains TRAINING_RESULT after acceptance, but this is a pre-existing gap rather than a regression introduced here.

The core fix in _process_result is correct and covered by two targeted tests. The one gap is that process_result_of_unknown_task calls _accept_train_result — which sets TRAINING_RESULT — but never calls _clear_training_result, so the same object lifetime issue the PR describes persists on that alternate path.

nvflare/app_common/workflows/base_model_controller.py — specifically the process_result_of_unknown_task method, which was not updated to call _clear_training_result.

Important Files Changed

Filename	Overview
nvflare/app_common/workflows/base_model_controller.py	Adds `_clear_training_result` static helper and wraps the BEFORE/AFTER_CONTRIBUTION_ACCEPT event window in a `try/finally` to clear `AppConstants.TRAINING_RESULT` after every accepted result. `process_result_of_unknown_task` uses the same `_accept_train_result` but does not call the new helper, leaving the original retention issue alive on that code path.
tests/unit_test/app_common/workflow/fedavg_test.py	Adds two targeted regression tests: one verifying the `TRAINING_RESULT` prop is `None` after `_process_result` while still visible inside `AFTER_CONTRIBUTION_ACCEPT`, and one verifying weakref-based tensor reclamation after `gc.collect()`. Both tests are well-scoped and cover the intended fix.

Sequence Diagram

sequenceDiagram
    participant RT as result_received_cb
    participant PR as _process_result
    participant AT as _accept_train_result
    participant EV as event bus
    participant FL as FLContext

    RT->>PR: client_task, fl_ctx
    PR->>EV: BEFORE_CONTRIBUTION_ACCEPT
    PR->>AT: client_name, result, fl_ctx
    AT->>FL: set_prop(TRAINING_RESULT, result)
    AT-->>PR: accepted=True
    PR->>EV: AFTER_CONTRIBUTION_ACCEPT
    Note over EV,FL: Handlers can read TRAINING_RESULT here
    PR->>FL: _clear_training_result() [finally]
    FL-->>FL: set_prop(TRAINING_RESULT, None)
    PR->>PR: FLModelUtils.from_shareable(result)
    PR->>RT: callback(result_model)
    PR->>RT: client_task.result = None

Comments Outside Diff (1)

nvflare/app_common/workflows/base_model_controller.py, line 300-310 (link)

process_result_of_unknown_task does not clear TRAINING_RESULT

process_result_of_unknown_task calls the same _accept_train_result that sets AppConstants.TRAINING_RESULT in fl_ctx on acceptance, but never calls _clear_training_result. The raw client Shareable will therefore remain reachable through fl_ctx → TRAINING_RESULT for late/unknown-task results — exactly the lifetime issue this PR intends to fix. Since _clear_training_result is now a standalone utility, applying it here would close the gap.

_{Reviews (1): Last reviewed commit: "Clear retained training result after acc..." | Re-trigger Greptile}

pcnudde · 2026-05-05T21:06:21Z

Thanks for the review. The process_result_of_unknown_task() path is a different, late/unknown-task handling path and is intentionally out of scope for this PR.

This change is focused on the normal FedAvg result-processing path where accepted client training results are converted, aggregated, and then should be released immediately. In that path, TRAINING_RESULT remains available for the contribution-accept event window and is cleared right after, which fixes the large in-memory model retention case described in the PR.

I agree the unknown-task path can be evaluated separately, but it is not important for this specific FedAvg memory-retention fix.

pcnudde · 2026-05-05T21:14:18Z

/build

pcnudde · 2026-05-05T22:49:35Z

/build

Clear retained training result after accept events

dec4b42

pcnudde requested a review from YuanTingHsieh May 5, 2026 19:59

pcnudde marked this pull request as ready for review May 5, 2026 19:59

pcnudde requested a review from nvidianz May 5, 2026 21:13

pcnudde changed the title ~~[codex] Clear retained training result after accept events~~ Clear retained training result after accept events May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear retained training result after accept events#4520

Clear retained training result after accept events#4520
pcnudde wants to merge 1 commit intoNVIDIA:mainfrom
pcnudde:codex/clear-training-result-leak

pcnudde commented May 5, 2026 •

edited

Loading

Uh oh!

pcnudde commented May 5, 2026

Uh oh!

greptile-apps Bot commented May 5, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

pcnudde commented May 5, 2026

Uh oh!

pcnudde commented May 5, 2026

Uh oh!

pcnudde commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pcnudde commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Bug

Benefit

Uh oh!

pcnudde commented May 5, 2026

Uh oh!

greptile-apps Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

pcnudde commented May 5, 2026

Uh oh!

pcnudde commented May 5, 2026

Uh oh!

pcnudde commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pcnudde commented May 5, 2026 •

edited

Loading

greptile-apps Bot commented May 5, 2026 •

edited

Loading