Skip to content

Apple Silicon (MPS) support + pipeline fixes and small perf wins#122

Open
maxwbuckley wants to merge 7 commits into
Tencent:mainfrom
maxwbuckley:add-mps-support
Open

Apple Silicon (MPS) support + pipeline fixes and small perf wins#122
maxwbuckley wants to merge 7 commits into
Tencent:mainfrom
maxwbuckley:add-mps-support

Conversation

@maxwbuckley
Copy link
Copy Markdown

@maxwbuckley maxwbuckley commented Apr 16, 2026

Summary

  • MPS backend support: adds Apple Silicon detection across inference.py, predict.py, dwpose_detector.py, pipeline_mimicmotion.py, and wholebody.py. Before this branch, attempting to run MimicMotion on an Apple Silicon machine simply didn't work — both because device detection hard-coded cuda/cpu and because the pipeline tried to offload models to CPU between stages (a strict loss on unified memory). After this branch it runs.
  • Latent bug fix on main: mimicmotion/utils/loader.py was calling torch.serialization.safe_globals(*allowed_modules), which raises TypeError: _safe_globals.__init__() takes 2 positional arguments but 3 were given on any modern torch (verified on 2.8 and 2.11). This means the safe_globals code path has been dead on main for any reasonably recent torch. Fixed to safe_globals(allowed_modules).
  • macOS dependency fallbacks: decord (x86-only wheels) and torchvision.io.write_video (ffmpeg dependency flaky on Apple Silicon) now gracefully fall back to OpenCV. Cross-verified on a CUDA host where the cv2 fallback produces identical frame shapes / effective strides / per-channel means vs. the decord path within codec tolerance.
  • Small, bit-identical perf + hygiene improvements (details below).

What actually changes

MPS / compatibility

File Change
inference.py, predict.py, mimicmotion/dwpose/dwpose_detector.py Device detection now picks cuda > mps > cpu
mimicmotion/dwpose/wholebody.py Provider selection handles torch.device objects and prefers CoreMLExecutionProvider when on MPS (falls back to CPU EP if unavailable)
mimicmotion/pipelines/pipeline_mimicmotion.py Skips .cpu() / .to(device) round-trips outside CUDA — on unified memory these are strict loss; on CPU they're no-ops
mimicmotion/dwpose/preprocess.py decord import is now optional; cv2 fallback reads + decodes frames, BGR→RGB, preserves stride semantics
mimicmotion/utils/utils.py torchvision.io.write_video import optional; cv2 VideoWriter fallback for save_to_mp4
environment_macos.yaml, configs/test_mps.yaml New env file + a minimal test config tuned for MPS smoke-testing

Bug fixes

  • loader.py safe_globals: safe_globals(*allowed_modules)safe_globals(allowed_modules). The splat form has been broken on main for any torch where safe_globals exists as a context manager; the branch value works and the codepath is now actually reachable.

Memory / loading improvements (required for MPS, beneficial on CUDA)

These changes were necessary to fit the full pipeline within the 48 GB unified memory of an Apple M3 Max — without them, peak memory during model loading exceeded what was available. They also reduce peak memory on CUDA hosts.

  • load_state_dict(..., assign=True) in loader.py: directly swaps parameter tensor references instead of copying into existing ones, rather than allocating a copy for each parameter.
  • del checkpoint after load_state_dict and del mimicmotion_models after pipeline construction: releases intermediate references earlier so Python can free them.
    • Caveat to be aware of: assign=True causes parameter dtype to follow the checkpoint rather than the module constructor default. This is fine for the standard inference.py path (which sets torch.set_default_dtype(torch.float16) before create_pipeline, so both pre-branch and post-branch paths end up fp16), but anyone calling create_pipeline directly without setting the default dtype should be aware.

Small perf wins (bit-identical outputs)

  • Hoist triangular tile-blend weight out of the denoise loop (pipeline_mimicmotion.py): the weight = torch.minimum((torch.arange(tile_size)+0.5)*2/tile_size, 2-w) tensor depends only on tile_size and device, both fixed for the whole call. It was being rebuilt on every denoise iteration. Now computed once.
  • Drop redundant pose_pixels.copy() (inference.py): the array comes directly from np.concatenate, which already returns a fresh contiguous array. Removes one full-frame allocation per inference.

End-to-end validation on CUDA (RTX 5090, torch 2.11.0+cu128)

Ran python inference.py --inference_config configs/test.yaml against pose1.mp4 (530 frames) + demo1.jpg at 576×1024 with 25 denoise steps.

  • Exit code: 0
  • Output: outputs/pose1_*.mp4, 11.18 MB, 530 frames, 576×1024 @ 15 fps, all frames decode cleanly
  • Wall clock: 15:50.62
  • Zero occurrences of Error, Traceback, OOM, CUDA error in the full log
  • Per-step denoise time stable at ~3.78 s across all 200 loop iterations — no drift, no leak

Resource trace (784 one-second samples)

metric p50 p95 peak
GPU util 100 % 100 % 100 %
VRAM 17.7 GB 25.2 GB 25.2 GB
Power 571 W 579 W 583 W
GPU temp 75 °C 77 °C 80 °C
System RAM (process RSS max) 22.76 GB

VRAM peak (~25 GB) lands during VAE decode, not denoising. Denoising is stable at ~17.7 GB. Worth noting for anyone running on a 24 GB card — this workload fits on 32 GB but not comfortably on 24 GB.

Fuzz tests run on this rig

Beyond the end-to-end check:

  • save_to_mp4 cv2 fallback: 30 randomized shape/fps/length combos + pathlib.Path input — 30/30 pass, all outputs round-trip through cv2.
  • get_video_pose cv2 branch: 20 randomized synthetic videos — verified frame count, shape, BGR→RGB ordering, stride semantics, end-to-end with a stubbed DWPose detector. 20/20 pass.
  • Cross-backend comparison: decord vs cv2 reader on the same inputs — 20/20 match on shape, effective stride, and per-channel means (within mp4v/YUV 4:2:0 codec noise).
  • load_state_dict(assign=True) semantics: partial state dicts preserve untouched params, Parameter identity preserved, requires_grad preserved, fp16/fp32 forward passes work on CUDA.
  • Device detection + offload gating for cpu / cuda / mps / torch.device objects.

Known limitations / things intentionally not in scope

  • End-to-end validated on an Apple M3 Max (MPS backend) and a Windows machine with an RTX 5090 (CUDA backend).
  • Several additional optimization opportunities were identified during this work (PoseNet output caching across timesteps, CFG UNet batching, latent pre-allocation, torch.compile, VAE decode streaming, etc.) and are intentionally not part of this PR so it stays focused and reviewable. They'll be handled in follow-ups.
  • No changes to scheduler, UNet architecture, or training code.

Test plan

  • Branch imports cleanly on a host with neither decord nor torchvision preinstalled
  • Static syntax check (py_compile) on all touched files
  • Fuzz test save_to_mp4 cv2 fallback (30/30)
  • Fuzz test get_video_pose cv2 fallback (20/20)
  • Cross-verify decord vs cv2 reader produces equivalent frame tensors (20/20)
  • Verify load_state_dict(assign=True) semantics on a synthetic module with mixed dtypes
  • Verify device detection + _offload gating for cpu / cuda / mps / torch.device
  • End-to-end inference.py run on CUDA (RTX 5090, Windows) with real weights — exit 0, valid mp4 output
  • End-to-end inference.py run on MPS (M3 Max, macOS) with real weights
  • Monitor wall clock, GPU util, VRAM, power, RAM throughout the CUDA run

🤖 Generated with Claude Code

maxwbuckley and others added 7 commits April 9, 2026 10:46
Enable MimicMotion to run on macOS with Apple Silicon GPUs via PyTorch MPS.

- Add MPS device detection alongside CUDA in inference.py, predict.py, and
  dwpose_detector.py
- Map MPS to CoreMLExecutionProvider (with CPU fallback) for ONNX-based
  DWPose inference
- Guard torch.cuda.empty_cache() behind device type check in pipeline
- Add cv2 fallback for decord (no macOS ARM wheels) and write_video
  (removed in newer torchvision)
- Fix safe_globals call in loader.py (takes list, not unpacked args)
- Add environment_macos.yaml and configs/test_mps.yaml for Mac testing

Verified end-to-end: 530-frame video generated at 384x640 on M3 Max.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Delete checkpoint dict immediately after load_state_dict to free ~1.5GB
- Delete MimicMotionModel wrapper after extracting submodels
- Skip model CPU offloading on MPS (unified memory makes it pointless
  and wastes time on .to() calls)
- Update test_mps.yaml with minimal config (stride=64, 2 steps, 256p)
  for fast iteration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
load_state_dict(assign=True) directly swaps parameter tensors from the
checkpoint into the model instead of copying data. This avoids holding
both the random-init UNet params and checkpoint params simultaneously.

Measured: RSS at model load drops from 6.04 GB to 3.89 GB (-2.15 GB).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Internal analysis document — not intended for upstream PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t copy

- pipeline_mimicmotion.py: the triangular tile-blend `weight` tensor depends
  only on `tile_size` and `device`, both fixed for the whole call. It was
  being recomputed (arange + minimum) on every one of the 25 denoise steps.
  Move it above the timestep loop. Bit-identical output.
- inference.py: `pose_pixels` comes directly from `np.concatenate`, which
  already returns a fresh contiguous array, so the `.copy()` before
  `torch.from_numpy` was a duplicate allocation. Remove it.

Verified on RTX 5090 with configs/test.yaml end-to-end: 530-frame output
mp4 produced, exit 0, no errors, steady 3.78 s/step across all 200 denoise
iterations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Kept locally outside the repo; the content is not part of this PR's
shippable changes and is better tracked separately from the code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cv2 was imported unconditionally but missing from environment_macos.yaml.
write_video import succeeds without PyAV but fails at runtime — now
catches that and falls through to the cv2 path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant