Apple Silicon (MPS) support + pipeline fixes and small perf wins#122
Open
maxwbuckley wants to merge 7 commits into
Open
Apple Silicon (MPS) support + pipeline fixes and small perf wins#122maxwbuckley wants to merge 7 commits into
maxwbuckley wants to merge 7 commits into
Conversation
Enable MimicMotion to run on macOS with Apple Silicon GPUs via PyTorch MPS. - Add MPS device detection alongside CUDA in inference.py, predict.py, and dwpose_detector.py - Map MPS to CoreMLExecutionProvider (with CPU fallback) for ONNX-based DWPose inference - Guard torch.cuda.empty_cache() behind device type check in pipeline - Add cv2 fallback for decord (no macOS ARM wheels) and write_video (removed in newer torchvision) - Fix safe_globals call in loader.py (takes list, not unpacked args) - Add environment_macos.yaml and configs/test_mps.yaml for Mac testing Verified end-to-end: 530-frame video generated at 384x640 on M3 Max. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Delete checkpoint dict immediately after load_state_dict to free ~1.5GB - Delete MimicMotionModel wrapper after extracting submodels - Skip model CPU offloading on MPS (unified memory makes it pointless and wastes time on .to() calls) - Update test_mps.yaml with minimal config (stride=64, 2 steps, 256p) for fast iteration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
load_state_dict(assign=True) directly swaps parameter tensors from the checkpoint into the model instead of copying data. This avoids holding both the random-init UNet params and checkpoint params simultaneously. Measured: RSS at model load drops from 6.04 GB to 3.89 GB (-2.15 GB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Internal analysis document — not intended for upstream PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t copy - pipeline_mimicmotion.py: the triangular tile-blend `weight` tensor depends only on `tile_size` and `device`, both fixed for the whole call. It was being recomputed (arange + minimum) on every one of the 25 denoise steps. Move it above the timestep loop. Bit-identical output. - inference.py: `pose_pixels` comes directly from `np.concatenate`, which already returns a fresh contiguous array, so the `.copy()` before `torch.from_numpy` was a duplicate allocation. Remove it. Verified on RTX 5090 with configs/test.yaml end-to-end: 530-frame output mp4 produced, exit 0, no errors, steady 3.78 s/step across all 200 denoise iterations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Kept locally outside the repo; the content is not part of this PR's shippable changes and is better tracked separately from the code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cv2 was imported unconditionally but missing from environment_macos.yaml. write_video import succeeds without PyAV but fails at runtime — now catches that and falls through to the cv2 path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
inference.py,predict.py,dwpose_detector.py,pipeline_mimicmotion.py, andwholebody.py. Before this branch, attempting to run MimicMotion on an Apple Silicon machine simply didn't work — both because device detection hard-codedcuda/cpuand because the pipeline tried to offload models to CPU between stages (a strict loss on unified memory). After this branch it runs.mimicmotion/utils/loader.pywas callingtorch.serialization.safe_globals(*allowed_modules), which raisesTypeError: _safe_globals.__init__() takes 2 positional arguments but 3 were givenon any modern torch (verified on 2.8 and 2.11). This means thesafe_globalscode path has been dead onmainfor any reasonably recent torch. Fixed tosafe_globals(allowed_modules).decord(x86-only wheels) andtorchvision.io.write_video(ffmpeg dependency flaky on Apple Silicon) now gracefully fall back to OpenCV. Cross-verified on a CUDA host where the cv2 fallback produces identical frame shapes / effective strides / per-channel means vs. the decord path within codec tolerance.What actually changes
MPS / compatibility
inference.py,predict.py,mimicmotion/dwpose/dwpose_detector.pymimicmotion/dwpose/wholebody.pytorch.deviceobjects and prefersCoreMLExecutionProviderwhen on MPS (falls back to CPU EP if unavailable)mimicmotion/pipelines/pipeline_mimicmotion.py.cpu()/.to(device)round-trips outside CUDA — on unified memory these are strict loss; on CPU they're no-opsmimicmotion/dwpose/preprocess.pydecordimport is now optional; cv2 fallback reads + decodes frames, BGR→RGB, preserves stride semanticsmimicmotion/utils/utils.pytorchvision.io.write_videoimport optional; cv2VideoWriterfallback forsave_to_mp4environment_macos.yaml,configs/test_mps.yamlBug fixes
loader.pysafe_globals:safe_globals(*allowed_modules)→safe_globals(allowed_modules). The splat form has been broken on main for any torch wheresafe_globalsexists as a context manager; the branch value works and the codepath is now actually reachable.Memory / loading improvements (required for MPS, beneficial on CUDA)
These changes were necessary to fit the full pipeline within the 48 GB unified memory of an Apple M3 Max — without them, peak memory during model loading exceeded what was available. They also reduce peak memory on CUDA hosts.
load_state_dict(..., assign=True)inloader.py: directly swaps parameter tensor references instead of copying into existing ones, rather than allocating a copy for each parameter.del checkpointafterload_state_dictanddel mimicmotion_modelsafter pipeline construction: releases intermediate references earlier so Python can free them.assign=Truecauses parameter dtype to follow the checkpoint rather than the module constructor default. This is fine for the standardinference.pypath (which setstorch.set_default_dtype(torch.float16)beforecreate_pipeline, so both pre-branch and post-branch paths end up fp16), but anyone callingcreate_pipelinedirectly without setting the default dtype should be aware.Small perf wins (bit-identical outputs)
pipeline_mimicmotion.py): theweight = torch.minimum((torch.arange(tile_size)+0.5)*2/tile_size, 2-w)tensor depends only ontile_sizeanddevice, both fixed for the whole call. It was being rebuilt on every denoise iteration. Now computed once.pose_pixels.copy()(inference.py): the array comes directly fromnp.concatenate, which already returns a fresh contiguous array. Removes one full-frame allocation per inference.End-to-end validation on CUDA (RTX 5090, torch 2.11.0+cu128)
Ran
python inference.py --inference_config configs/test.yamlagainstpose1.mp4(530 frames) +demo1.jpgat 576×1024 with 25 denoise steps.outputs/pose1_*.mp4, 11.18 MB, 530 frames, 576×1024 @ 15 fps, all frames decode cleanlyError,Traceback,OOM,CUDA errorin the full logResource trace (784 one-second samples)
VRAM peak (~25 GB) lands during VAE decode, not denoising. Denoising is stable at ~17.7 GB. Worth noting for anyone running on a 24 GB card — this workload fits on 32 GB but not comfortably on 24 GB.
Fuzz tests run on this rig
Beyond the end-to-end check:
save_to_mp4cv2 fallback: 30 randomized shape/fps/length combos +pathlib.Pathinput — 30/30 pass, all outputs round-trip through cv2.get_video_posecv2 branch: 20 randomized synthetic videos — verified frame count, shape, BGR→RGB ordering, stride semantics, end-to-end with a stubbed DWPose detector. 20/20 pass.load_state_dict(assign=True)semantics: partial state dicts preserve untouched params, Parameter identity preserved,requires_gradpreserved, fp16/fp32 forward passes work on CUDA.torch.deviceobjects.Known limitations / things intentionally not in scope
torch.compile, VAE decode streaming, etc.) and are intentionally not part of this PR so it stays focused and reviewable. They'll be handled in follow-ups.Test plan
decordnortorchvisionpreinstalledpy_compile) on all touched filessave_to_mp4cv2 fallback (30/30)get_video_posecv2 fallback (20/20)load_state_dict(assign=True)semantics on a synthetic module with mixed dtypes_offloadgating for cpu / cuda / mps / torch.deviceinference.pyrun on CUDA (RTX 5090, Windows) with real weights — exit 0, valid mp4 outputinference.pyrun on MPS (M3 Max, macOS) with real weights🤖 Generated with Claude Code