Apple Silicon (MPS) support + pipeline fixes and small perf wins by maxwbuckley · Pull Request #122 · Tencent/MimicMotion

maxwbuckley · 2026-04-16T09:34:20Z

Summary

MPS backend support: adds Apple Silicon detection across inference.py, predict.py, dwpose_detector.py, pipeline_mimicmotion.py, and wholebody.py. Before this branch, attempting to run MimicMotion on an Apple Silicon machine simply didn't work — both because device detection hard-coded cuda/cpu and because the pipeline tried to offload models to CPU between stages (a strict loss on unified memory). After this branch it runs.
Latent bug fix on main: mimicmotion/utils/loader.py was calling torch.serialization.safe_globals(*allowed_modules), which raises TypeError: _safe_globals.__init__() takes 2 positional arguments but 3 were given on any modern torch (verified on 2.8 and 2.11). This means the safe_globals code path has been dead on main for any reasonably recent torch. Fixed to safe_globals(allowed_modules).
macOS dependency fallbacks: decord (x86-only wheels) and torchvision.io.write_video (ffmpeg dependency flaky on Apple Silicon) now gracefully fall back to OpenCV. Cross-verified on a CUDA host where the cv2 fallback produces identical frame shapes / effective strides / per-channel means vs. the decord path within codec tolerance.
Small, bit-identical perf + hygiene improvements (details below).

What actually changes

MPS / compatibility

File	Change
`inference.py`, `predict.py`, `mimicmotion/dwpose/dwpose_detector.py`	Device detection now picks cuda > mps > cpu
`mimicmotion/dwpose/wholebody.py`	Provider selection handles `torch.device` objects and prefers `CoreMLExecutionProvider` when on MPS (falls back to CPU EP if unavailable)
`mimicmotion/pipelines/pipeline_mimicmotion.py`	Skips `.cpu()` / `.to(device)` round-trips outside CUDA — on unified memory these are strict loss; on CPU they're no-ops
`mimicmotion/dwpose/preprocess.py`	`decord` import is now optional; cv2 fallback reads + decodes frames, BGR→RGB, preserves stride semantics
`mimicmotion/utils/utils.py`	`torchvision.io.write_video` import optional; cv2 `VideoWriter` fallback for `save_to_mp4`
`environment_macos.yaml`, `configs/test_mps.yaml`	New env file + a minimal test config tuned for MPS smoke-testing

Bug fixes

loader.py safe_globals: safe_globals(*allowed_modules) → safe_globals(allowed_modules). The splat form has been broken on main for any torch where safe_globals exists as a context manager; the branch value works and the codepath is now actually reachable.

Memory / loading improvements (required for MPS, beneficial on CUDA)

These changes were necessary to fit the full pipeline within the 48 GB unified memory of an Apple M3 Max — without them, peak memory during model loading exceeded what was available. They also reduce peak memory on CUDA hosts.

load_state_dict(..., assign=True) in loader.py: directly swaps parameter tensor references instead of copying into existing ones, rather than allocating a copy for each parameter.
del checkpoint after load_state_dict and del mimicmotion_models after pipeline construction: releases intermediate references earlier so Python can free them.
- Caveat to be aware of: assign=True causes parameter dtype to follow the checkpoint rather than the module constructor default. This is fine for the standard inference.py path (which sets torch.set_default_dtype(torch.float16) before create_pipeline, so both pre-branch and post-branch paths end up fp16), but anyone calling create_pipeline directly without setting the default dtype should be aware.

Small perf wins (bit-identical outputs)

Hoist triangular tile-blend weight out of the denoise loop (pipeline_mimicmotion.py): the weight = torch.minimum((torch.arange(tile_size)+0.5)*2/tile_size, 2-w) tensor depends only on tile_size and device, both fixed for the whole call. It was being rebuilt on every denoise iteration. Now computed once.
Drop redundant pose_pixels.copy() (inference.py): the array comes directly from np.concatenate, which already returns a fresh contiguous array. Removes one full-frame allocation per inference.

End-to-end validation on CUDA (RTX 5090, torch 2.11.0+cu128)

Ran python inference.py --inference_config configs/test.yaml against pose1.mp4 (530 frames) + demo1.jpg at 576×1024 with 25 denoise steps.

Exit code: 0
Output: outputs/pose1_*.mp4, 11.18 MB, 530 frames, 576×1024 @ 15 fps, all frames decode cleanly
Wall clock: 15:50.62
Zero occurrences of Error, Traceback, OOM, CUDA error in the full log
Per-step denoise time stable at ~3.78 s across all 200 loop iterations — no drift, no leak

Resource trace (784 one-second samples)

metric	p50	p95	peak
GPU util	100 %	100 %	100 %
VRAM	17.7 GB	25.2 GB	25.2 GB
Power	571 W	579 W	583 W
GPU temp	75 °C	77 °C	80 °C
System RAM (process RSS max)	—	—	22.76 GB

VRAM peak (~25 GB) lands during VAE decode, not denoising. Denoising is stable at ~17.7 GB. Worth noting for anyone running on a 24 GB card — this workload fits on 32 GB but not comfortably on 24 GB.

Fuzz tests run on this rig

Beyond the end-to-end check:

save_to_mp4 cv2 fallback: 30 randomized shape/fps/length combos + pathlib.Path input — 30/30 pass, all outputs round-trip through cv2.
get_video_pose cv2 branch: 20 randomized synthetic videos — verified frame count, shape, BGR→RGB ordering, stride semantics, end-to-end with a stubbed DWPose detector. 20/20 pass.
Cross-backend comparison: decord vs cv2 reader on the same inputs — 20/20 match on shape, effective stride, and per-channel means (within mp4v/YUV 4:2:0 codec noise).
load_state_dict(assign=True) semantics: partial state dicts preserve untouched params, Parameter identity preserved, requires_grad preserved, fp16/fp32 forward passes work on CUDA.
Device detection + offload gating for cpu / cuda / mps / torch.device objects.

Known limitations / things intentionally not in scope

End-to-end validated on an Apple M3 Max (MPS backend) and a Windows machine with an RTX 5090 (CUDA backend).
Several additional optimization opportunities were identified during this work (PoseNet output caching across timesteps, CFG UNet batching, latent pre-allocation, torch.compile, VAE decode streaming, etc.) and are intentionally not part of this PR so it stays focused and reviewable. They'll be handled in follow-ups.
No changes to scheduler, UNet architecture, or training code.

Test plan

🤖 Generated with Claude Code

Enable MimicMotion to run on macOS with Apple Silicon GPUs via PyTorch MPS. - Add MPS device detection alongside CUDA in inference.py, predict.py, and dwpose_detector.py - Map MPS to CoreMLExecutionProvider (with CPU fallback) for ONNX-based DWPose inference - Guard torch.cuda.empty_cache() behind device type check in pipeline - Add cv2 fallback for decord (no macOS ARM wheels) and write_video (removed in newer torchvision) - Fix safe_globals call in loader.py (takes list, not unpacked args) - Add environment_macos.yaml and configs/test_mps.yaml for Mac testing Verified end-to-end: 530-frame video generated at 384x640 on M3 Max. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Delete checkpoint dict immediately after load_state_dict to free ~1.5GB - Delete MimicMotionModel wrapper after extracting submodels - Skip model CPU offloading on MPS (unified memory makes it pointless and wastes time on .to() calls) - Update test_mps.yaml with minimal config (stride=64, 2 steps, 256p) for fast iteration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

load_state_dict(assign=True) directly swaps parameter tensors from the checkpoint into the model instead of copying data. This avoids holding both the random-init UNet params and checkpoint params simultaneously. Measured: RSS at model load drops from 6.04 GB to 3.89 GB (-2.15 GB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Internal analysis document — not intended for upstream PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…t copy - pipeline_mimicmotion.py: the triangular tile-blend `weight` tensor depends only on `tile_size` and `device`, both fixed for the whole call. It was being recomputed (arange + minimum) on every one of the 25 denoise steps. Move it above the timestep loop. Bit-identical output. - inference.py: `pose_pixels` comes directly from `np.concatenate`, which already returns a fresh contiguous array, so the `.copy()` before `torch.from_numpy` was a duplicate allocation. Remove it. Verified on RTX 5090 with configs/test.yaml end-to-end: 530-frame output mp4 produced, exit 0, no errors, steady 3.78 s/step across all 200 denoise iterations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Kept locally outside the repo; the content is not part of this PR's shippable changes and is better tracked separately from the code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cv2 was imported unconditionally but missing from environment_macos.yaml. write_video import succeeds without PyAV but fails at runtime — now catches that and falls through to the cv2 path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

maxwbuckley and others added 7 commits April 9, 2026 10:46

[docs] add performance analysis and benchmark results

f428e61

Internal analysis document — not intended for upstream PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

[docs] remove PERF_ANALYSIS.md from branch

3b4b9b9

Kept locally outside the repo; the content is not part of this PR's shippable changes and is better tracked separately from the code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apple Silicon (MPS) support + pipeline fixes and small perf wins#122

Apple Silicon (MPS) support + pipeline fixes and small perf wins#122
maxwbuckley wants to merge 7 commits into
Tencent:mainfrom
maxwbuckley:add-mps-support

maxwbuckley commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maxwbuckley commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What actually changes

MPS / compatibility

Bug fixes

Memory / loading improvements (required for MPS, beneficial on CUDA)

Small perf wins (bit-identical outputs)

End-to-end validation on CUDA (RTX 5090, torch 2.11.0+cu128)

Resource trace (784 one-second samples)

Fuzz tests run on this rig

Known limitations / things intentionally not in scope

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maxwbuckley commented Apr 16, 2026 •

edited

Loading