Skip to content

Release temp .mlpackage eagerly in MLModel.__del__#2671

Open
john-rocky wants to merge 1 commit intoapple:mainfrom
john-rocky:fix-mlmodel-temp-package-leak
Open

Release temp .mlpackage eagerly in MLModel.__del__#2671
john-rocky wants to merge 1 commit intoapple:mainfrom
john-rocky:fix-mlmodel-temp-package-leak

Conversation

@john-rocky
Copy link
Copy Markdown
Contributor

Release temp .mlpackage eagerly in MLModel

Fixes #2670

What this changes

Adds MLModel.close() and MLModel.__del__ so the temporary
.mlpackage allocated by _create_mlpackage (when an MLModel is
constructed from a Model_pb2 spec) is released when the Python
object goes out of scope, rather than only at interpreter exit.

The existing atexit.register(cleanup, self.package_path) is kept as
a safety net for long-lived objects that are still alive at shutdown.

Why

coremltools.optimize.coreml.experimental.linear_quantize_activations
constructs one temp MLModel per op-group batch during calibration
(see _ModelDebugger.predict_intermediate_outputs). Because cleanup
was tied to atexit, a single calibration run on a medium-sized
transformer leaks tens of gigabytes of $TMPDIR before the process
finishes — often enough to fill the disk and crash the calibration
halfway through.

Reproduced on coremltools==9.0, macOS 15 (Darwin 25.0.0), Apple
Silicon, Python 3.12. The leak measures +6 temp .mlpackage /
+402.9 MB
on a toy 8-layer 2048-dim reproducer with 4 calibration
samples, and ~38 GB in realistic Gemma 4 E2B chunk2 runs. With
the patch applied, the same reproducer leaks 0 directories / 0 MB.

Approach

  • Drop self.__proxy__ first so the CoreML runtime can close its mmap
    of the compiled .mlmodelc before the Python side removes the
    backing directory.
  • rmtree(self.package_path, ignore_errors=True) only when
    is_temp_package is true — user-provided paths are never touched.
  • close() is idempotent and safe to call more than once.
  • __del__ swallows all exceptions (interpreter-shutdown-safe).

Test plan

New file coremltools/test/modelpackage/test_mlmodel_temp_cleanup.py
(included in this PR) covers:

  • close() removes the temp package and is idempotent.
  • __del__ removes the temp package when the Python object goes out
    of scope.
  • A loop of 5 MLModel constructions + del leaves the tmp-dir count
    unchanged (the regression test for the activation-calibration bug).
  • User-owned .mlpackage paths loaded from disk are never flagged
    is_temp_package and must survive del.

Plus the existing suite: pytest coremltools/test/modelpackage -q.

Backwards compatibility

close() and __del__ are additive. The only observable change is
that temp packages disappear sooner than they used to. Any user code
that was relying on the temp path being readable after the MLModel
was destroyed (there should be none — the directory was already
documented as internal) should switch to MLModel.save(path).

@TobyRoseman
Copy link
Copy Markdown
Collaborator

What happens if a someone tries to use a model after calling .close()? I think maybe close should be an underscore method.

class TestMLModelTempCleanup:
def test_close_removes_temp_package(self, tmp_path, monkeypatch):
"""`MLModel.close()` deletes the temp .mlpackage and is idempotent."""
monkeypatch.setenv("TMPDIR", str(tmp_path))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another way to could do this? If we ever run tests in parallel, I think this will cause issues.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Rewrote the tests to never touch $TMPDIR or tempfile.tempdir; instead each test captures the specific model.package_path it just created and asserts cleanup on that path only. Counting tmp*.mlpackage in the shared tmpdir was both races with parallel tests and weaker than necessary — the new test_no_leak_across_many_mlmodel_constructions collects every model's own path into a list and asserts each one is gone after del, which is a strictly stronger claim.

Verified under pytest -n 4 (pytest-xdist) locally: 4 passed, no races.

test_user_provided_package_path_is_not_deleted still uses tmp_path, but only as the destination for MLModel.save() — it's the per-test unique directory pytest hands out, not a $TMPDIR override, so still parallel-safe.

`MLModel` constructed from a `Model_pb2` spec allocates a temporary
`.mlpackage` under `$TMPDIR` via `_create_mlpackage`. Historically
the only cleanup path was `atexit.register(cleanup, self.package_path)`,
so the directory survived until the Python process exited.

In workflows that construct many short-lived MLModel objects — most
visibly `coremltools.optimize.coreml.experimental.linear_quantize_activations`,
which builds one temporary model per op-group batch in
`_ModelDebugger.predict_intermediate_outputs` — this caused tens of
gigabytes of leaked temp directories.

Measured on macOS 15 / Apple Silicon / coremltools 9.0 / Python 3.12:

  - 8-layer 2048-dim toy model, 4 calibration samples:
      leaks +6 `.mlpackage` dirs / +402.9 MB per call.
  - Gemma 4 E2B chunk2 (~300 MB per package):
      ~38 GB retained before the process finishes; fills a 49 GB disk.

With this patch both numbers drop to 0.

This change adds `MLModel.close()` and `MLModel.__del__` that release
the temp package as soon as the Python object goes out of scope.
The C++ proxy is dropped first so the compiled `.mlmodelc` mmap is
released before `rmtree`. User-owned `.mlpackage` paths loaded from
disk are never flagged `is_temp_package` and are never touched.

The existing `atexit` hook is kept as a safety net for objects that
are still alive at interpreter shutdown.

Includes regression tests under
`coremltools/test/modelpackage/test_mlmodel_temp_cleanup.py` that
cover `close()` idempotency, `__del__` cleanup, no-leak across many
constructions, and that user-provided paths are preserved.
@john-rocky
Copy link
Copy Markdown
Contributor Author

Thanks for the review. Agreed — renamed to _close in 8846dbd8. Rationale for the underscore:

  • The real fix for the motivating workload (activation-statistics collection) is __del__, which needs no public API. Nobody has to call _close directly for the leak to go away.
  • After _close returns the instance can't be used for predict (proxy is dropped, backing temp dir is gone). Making it public would force us to define "is this model still usable?" checks across every method, which is a lot of surface for a marginal use case.
  • Underscore signals "internal: use del model or let GC do it" without closing the door on power-user calls.

Updated the docstring to say so explicitly. Force-pushed rather than adding a fixup commit since the PR is still a single logical change — let me know if you'd prefer fixup commits kept around for easier re-review.

Also rebased on top of latest main (8147ec18) while I was in there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

linear_quantize_activations leaks ~1 temp .mlpackage per calibration op-group (tens of GB on realistic models)

2 participants