Skip to content

Reduce CoreML conversion peak + ANE steady-state memory (port PR #27 C++ levers to master)#28

Closed
ChinChangYang wants to merge 2 commits into
masterfrom
feature/coreml-conversion-memory-levers
Closed

Reduce CoreML conversion peak + ANE steady-state memory (port PR #27 C++ levers to master)#28
ChinChangYang wants to merge 2 commits into
masterfrom
feature/coreml-conversion-memory-levers

Conversation

@ChinChangYang

Copy link
Copy Markdown
Owner

Summary

Ports the C++ memory levers of #27 onto master to cut KataGo→CoreML conversion peak memory and ANE steady-state RSS. Output stays numerically equivalent (cross-backend parity preserved). Applied as three independent, separately-verified levers:

  • A1 — converter weight tensors become non-owning views into the parsed model (drops 2 of 3 resident FP32 weight copies); derived/transpose tensors use an owned-buffer deque. Plus deterministic protobuf serialization (SetSerializationDeterministic(true)).
  • A2KataGoParser streams the gzip via a bounded ~1 MB refill buffer instead of holding the entire decompressed file; master's NaN/Inf weight validation is preserved.
  • A3ModelDesc::releaseWeights() frees the engine's in-memory weight vectors after the ANE/CoreML path converts from disk, guarded by a new ComputeContext::aneOnly flag so it fires only when every configured device is METAL_MUX_ANE (the GPU/MPSGraph path keeps its weights). Serialized under computeHandleMutex; idempotent; only scalar dims are read afterward.

Why this differs from #27

#27 is based on the diverged ios-dev; this is a hand-port onto master. Excluded by request: iOS entitlements (A4), the new runcoremlconverttests suite, and #27's docs. A few review-driven hardenings were added on top of the verbatim port (gzFile leak-on-bad_alloc fix, clearWeights/m_owned consistency, documented load-bearing invariants).

Measured result (b18c384nbt, 19x19, ANE path)

Metric master this branch Δ
Idle steady-state RSS 0.588 GB 0.194 GB −67%
Peak RSS (load+convert) 0.865 GB 0.483 GB −44%

Steady-state drop is A3 freeing the FP32 weights; peak drop combines A1+A2+A3. Larger nets (e.g. b40c768) scale the absolute savings up.

Test plan

  • Clean Metal rebuild (cmake -DUSE_BACKEND=METAL -G Ninja && ninja)
  • ./katago runtests and ./katago runnnlayertests pass
  • ./katago testgpuerror vs Eigen reference — GPU path parity (~0.00002% winrate err) and ANE path parity (avg 0.082% / max 0.186% winrate err, no NaN), proving the converter changes + weight-freeing produce correct inference
  • ANE-path RSS A/B (above); GPU path unaffected by A3 (aneOnly false)
  • Optional: re-measure on a 40-block net for the full conversion-peak win

Implementation plan included at docs/superpowers/plans/2026-05-30-port-coreml-conversion-memory-levers.md.

🤖 Generated with Claude Code

@ChinChangYang ChinChangYang force-pushed the feature/coreml-conversion-memory-levers branch from 6b52a56 to d694da2 Compare May 30, 2026 07:30
ChinChangYang and others added 2 commits May 30, 2026 17:11
Cuts memory during the on-device KataGo -> CoreML conversion and while
running the ANE/CoreML path, with byte-identical converter output:

- The converter's weight tensors become non-owning views into the parsed
  model instead of owning extra FP32 copies; derived/transposed tensors keep
  an owned buffer. This drops redundant resident weight copies during
  conversion. CoreML model serialization is made deterministic
  (SetSerializationDeterministic) so the output is byte-stable.

- The KataGo model parser streams the gzip through a bounded ~1 MB refill
  buffer instead of decompressing the whole file into memory, while
  preserving the existing NaN/Inf weight validation.

- ModelDesc gains releaseWeights(), which frees the in-memory weight arrays
  (keeping scalar shape metadata). The Metal backend calls it on the ANE
  (CoreML) path after converting from the model file on disk, gated by a new
  ComputeContext::aneOnly flag so it only fires when every configured device
  is ANE -- the GPU/MPSGraph path keeps its weights. The call is serialized
  under computeHandleMutex and only scalar dims are read afterward.

Measured on b18c384nbt (19x19) over the ANE path: idle steady-state RSS
0.59 GB -> 0.19 GB; peak (load+convert) 0.87 GB -> 0.48 GB. Cross-backend
parity vs an Eigen reference is unchanged on both the GPU and ANE paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The build-macos-metal job cached CMakeCache.txt/build.ninja keyed only on
CMakeLists.txt hashes. Those cached files bake in version-pinned Homebrew
Cellar paths from pkg-config (e.g. -L/opt/homebrew/Cellar/protobuf/<ver>/lib).
When Homebrew bumps protobuf/abseil, those paths vanish; an incremental
`cmake .` does not refresh the cached pkg-config vars, so the re-link fails
with "ld: library 'protobuf' not found".

Include the installed protobuf/abseil versions in the cache key and scope
restore-keys to those versions, so a dependency bump invalidates the stale
configure and forces a fresh pkg-config resolution instead of resurrecting
dead Cellar paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ChinChangYang ChinChangYang force-pushed the feature/coreml-conversion-memory-levers branch from 85031b3 to eda3c06 Compare May 30, 2026 09:12
@ChinChangYang

Copy link
Copy Markdown
Owner Author

Moved to lightvector#1202

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant