Reduce CoreML conversion peak + ANE steady-state memory (port PR #27 C++ levers to master)#28
Closed
ChinChangYang wants to merge 2 commits into
Closed
Reduce CoreML conversion peak + ANE steady-state memory (port PR #27 C++ levers to master)#28ChinChangYang wants to merge 2 commits into
ChinChangYang wants to merge 2 commits into
Conversation
6b52a56 to
d694da2
Compare
Cuts memory during the on-device KataGo -> CoreML conversion and while running the ANE/CoreML path, with byte-identical converter output: - The converter's weight tensors become non-owning views into the parsed model instead of owning extra FP32 copies; derived/transposed tensors keep an owned buffer. This drops redundant resident weight copies during conversion. CoreML model serialization is made deterministic (SetSerializationDeterministic) so the output is byte-stable. - The KataGo model parser streams the gzip through a bounded ~1 MB refill buffer instead of decompressing the whole file into memory, while preserving the existing NaN/Inf weight validation. - ModelDesc gains releaseWeights(), which frees the in-memory weight arrays (keeping scalar shape metadata). The Metal backend calls it on the ANE (CoreML) path after converting from the model file on disk, gated by a new ComputeContext::aneOnly flag so it only fires when every configured device is ANE -- the GPU/MPSGraph path keeps its weights. The call is serialized under computeHandleMutex and only scalar dims are read afterward. Measured on b18c384nbt (19x19) over the ANE path: idle steady-state RSS 0.59 GB -> 0.19 GB; peak (load+convert) 0.87 GB -> 0.48 GB. Cross-backend parity vs an Eigen reference is unchanged on both the GPU and ANE paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The build-macos-metal job cached CMakeCache.txt/build.ninja keyed only on CMakeLists.txt hashes. Those cached files bake in version-pinned Homebrew Cellar paths from pkg-config (e.g. -L/opt/homebrew/Cellar/protobuf/<ver>/lib). When Homebrew bumps protobuf/abseil, those paths vanish; an incremental `cmake .` does not refresh the cached pkg-config vars, so the re-link fails with "ld: library 'protobuf' not found". Include the installed protobuf/abseil versions in the cache key and scope restore-keys to those versions, so a dependency bump invalidates the stale configure and forces a fresh pkg-config resolution instead of resurrecting dead Cellar paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
85031b3 to
eda3c06
Compare
Owner
Author
|
Moved to lightvector#1202 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ports the C++ memory levers of #27 onto
masterto cut KataGo→CoreML conversion peak memory and ANE steady-state RSS. Output stays numerically equivalent (cross-backend parity preserved). Applied as three independent, separately-verified levers:SetSerializationDeterministic(true)).KataGoParserstreams the gzip via a bounded ~1 MB refill buffer instead of holding the entire decompressed file; master's NaN/Inf weight validation is preserved.ModelDesc::releaseWeights()frees the engine's in-memory weight vectors after the ANE/CoreML path converts from disk, guarded by a newComputeContext::aneOnlyflag so it fires only when every configured device isMETAL_MUX_ANE(the GPU/MPSGraph path keeps its weights). Serialized undercomputeHandleMutex; idempotent; only scalar dims are read afterward.Why this differs from #27
#27 is based on the diverged
ios-dev; this is a hand-port ontomaster. Excluded by request: iOS entitlements (A4), the newruncoremlconverttestssuite, and #27's docs. A few review-driven hardenings were added on top of the verbatim port (gzFile leak-on-bad_allocfix,clearWeights/m_ownedconsistency, documented load-bearing invariants).Measured result (b18c384nbt, 19x19, ANE path)
Steady-state drop is A3 freeing the FP32 weights; peak drop combines A1+A2+A3. Larger nets (e.g. b40c768) scale the absolute savings up.
Test plan
cmake -DUSE_BACKEND=METAL -G Ninja && ninja)./katago runtestsand./katago runnnlayertestspass./katago testgpuerrorvs Eigen reference — GPU path parity (~0.00002% winrate err) and ANE path parity (avg 0.082% / max 0.186% winrate err, no NaN), proving the converter changes + weight-freeing produce correct inferenceaneOnlyfalse)Implementation plan included at
docs/superpowers/plans/2026-05-30-port-coreml-conversion-memory-levers.md.🤖 Generated with Claude Code