Reduce CoreML conversion peak + ANE steady-state memory (port PR #27 C++ levers to master) by ChinChangYang · Pull Request #28 · ChinChangYang/KataGo

ChinChangYang · 2026-05-30T07:25:57Z

Summary

Ports the C++ memory levers of #27 onto master to cut KataGo→CoreML conversion peak memory and ANE steady-state RSS. Output stays numerically equivalent (cross-backend parity preserved). Applied as three independent, separately-verified levers:

A1 — converter weight tensors become non-owning views into the parsed model (drops 2 of 3 resident FP32 weight copies); derived/transpose tensors use an owned-buffer deque. Plus deterministic protobuf serialization (SetSerializationDeterministic(true)).
A2 — KataGoParser streams the gzip via a bounded ~1 MB refill buffer instead of holding the entire decompressed file; master's NaN/Inf weight validation is preserved.
A3 — ModelDesc::releaseWeights() frees the engine's in-memory weight vectors after the ANE/CoreML path converts from disk, guarded by a new ComputeContext::aneOnly flag so it fires only when every configured device is METAL_MUX_ANE (the GPU/MPSGraph path keeps its weights). Serialized under computeHandleMutex; idempotent; only scalar dims are read afterward.

Why this differs from #27

#27 is based on the diverged ios-dev; this is a hand-port onto master. Excluded by request: iOS entitlements (A4), the new runcoremlconverttests suite, and #27's docs. A few review-driven hardenings were added on top of the verbatim port (gzFile leak-on-bad_alloc fix, clearWeights/m_owned consistency, documented load-bearing invariants).

Measured result (b18c384nbt, 19x19, ANE path)

Metric	master	this branch	Δ
Idle steady-state RSS	0.588 GB	0.194 GB	−67%
Peak RSS (load+convert)	0.865 GB	0.483 GB	−44%

Steady-state drop is A3 freeing the FP32 weights; peak drop combines A1+A2+A3. Larger nets (e.g. b40c768) scale the absolute savings up.

Test plan

Clean Metal rebuild (cmake -DUSE_BACKEND=METAL -G Ninja && ninja)
./katago runtests and ./katago runnnlayertests pass
./katago testgpuerror vs Eigen reference — GPU path parity (~0.00002% winrate err) and ANE path parity (avg 0.082% / max 0.186% winrate err, no NaN), proving the converter changes + weight-freeing produce correct inference
ANE-path RSS A/B (above); GPU path unaffected by A3 (aneOnly false)
Optional: re-measure on a 40-block net for the full conversion-peak win

Implementation plan included at docs/superpowers/plans/2026-05-30-port-coreml-conversion-memory-levers.md.

🤖 Generated with Claude Code

Cuts memory during the on-device KataGo -> CoreML conversion and while running the ANE/CoreML path, with byte-identical converter output: - The converter's weight tensors become non-owning views into the parsed model instead of owning extra FP32 copies; derived/transposed tensors keep an owned buffer. This drops redundant resident weight copies during conversion. CoreML model serialization is made deterministic (SetSerializationDeterministic) so the output is byte-stable. - The KataGo model parser streams the gzip through a bounded ~1 MB refill buffer instead of decompressing the whole file into memory, while preserving the existing NaN/Inf weight validation. - ModelDesc gains releaseWeights(), which frees the in-memory weight arrays (keeping scalar shape metadata). The Metal backend calls it on the ANE (CoreML) path after converting from the model file on disk, gated by a new ComputeContext::aneOnly flag so it only fires when every configured device is ANE -- the GPU/MPSGraph path keeps its weights. The call is serialized under computeHandleMutex and only scalar dims are read afterward. Measured on b18c384nbt (19x19) over the ANE path: idle steady-state RSS 0.59 GB -> 0.19 GB; peak (load+convert) 0.87 GB -> 0.48 GB. Cross-backend parity vs an Eigen reference is unchanged on both the GPU and ANE paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The build-macos-metal job cached CMakeCache.txt/build.ninja keyed only on CMakeLists.txt hashes. Those cached files bake in version-pinned Homebrew Cellar paths from pkg-config (e.g. -L/opt/homebrew/Cellar/protobuf/<ver>/lib). When Homebrew bumps protobuf/abseil, those paths vanish; an incremental `cmake .` does not refresh the cached pkg-config vars, so the re-link fails with "ld: library 'protobuf' not found". Include the installed protobuf/abseil versions in the cache key and scope restore-keys to those versions, so a dependency bump invalidates the stale configure and forces a fresh pkg-config resolution instead of resurrecting dead Cellar paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ChinChangYang · 2026-05-30T09:15:33Z

Moved to lightvector#1202

ChinChangYang force-pushed the feature/coreml-conversion-memory-levers branch from 6b52a56 to d694da2 Compare May 30, 2026 07:30

ChinChangYang and others added 2 commits May 30, 2026 17:11

ChinChangYang force-pushed the feature/coreml-conversion-memory-levers branch from 85031b3 to eda3c06 Compare May 30, 2026 09:12

ChinChangYang closed this May 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce CoreML conversion peak + ANE steady-state memory (port PR #27 C++ levers to master)#28

Reduce CoreML conversion peak + ANE steady-state memory (port PR #27 C++ levers to master)#28
ChinChangYang wants to merge 2 commits into
masterfrom
feature/coreml-conversion-memory-levers

ChinChangYang commented May 30, 2026

Uh oh!

ChinChangYang commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChinChangYang commented May 30, 2026

Summary

Why this differs from #27

Measured result (b18c384nbt, 19x19, ANE path)

Test plan

Uh oh!

ChinChangYang commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant