perf: VRAM & inference optimizations with CLI feature flags#54
perf: VRAM & inference optimizations with CLI feature flags#54cmoyates wants to merge 24 commits intonikopueringer:mainfrom
Conversation
|
Cool stuff. Some questions- I thought I was already casting the model to fp16. Perhaps i was doing so incorrectly? Or is this casting a different aspect of the process to fp16? The model is almost 400MB at fp32, and fp16 knocked it down to ~300MB I believe, based on my tests. Since the weights are so small, I came to the conclusion that trying to further quantization on the weights led to diminishing returns. Likewise, is the CNN taking up that much memory? From what I've seen, it's pretty light. I am dubious of claude's estimations here for vram savings. Have you run tests while observing nvidia-smi? Thanks for taking a crack at optimization! Let me know your thoughts! |
Neat thing about what you can see there: those numbers aren't estimations; those are benchmarks that I ran here on my MacBook. It is running the MPS compelled compatibility thing on the main model, but these are real numbers. As far as casting the model to fp16, I'm not sure. I'll be completely honest with you: I have just been asking various AI things, "What potential paths of optimization do they see in this?" and doing what they suggest and benchmarking the results. |
|
Here are the exact numbers from the benchmarks: https://github.com/cmoyates/CorridorKey/blob/feature/misc-optimizations/benchmarks/RESULTS.md. I'm assuming claude got the numbers in the PR description from there. Also here's the exact plan I followed for those first few optimizations: https://github.com/cmoyates/CorridorKey/blob/feature/misc-optimizations/docs/plans/2026-03-07-feat-vram-performance-optimizations-plan.md |
|
Mm, interesting. Sounds like this is Mac focused at the moment. I’m hesitant to dive in too deep on to Mac optimization a day before public launch, but something like this could be extremely useful to people. have you verified the results with your eyeballs? Compared frames between outputs? if there are optimization gains to be made, it’s great that you’re highlighting how they could be achieved. But since I don’t have a Mac platform to test on, I’m gonna take this a bit slow and really rely on others to test and confirm this stuff. I’m still pondering if we should split off a Mac-focused version since this goes pretty deep into Mac architecture. what are your thoughts? |
|
It should be cross platform (I am working on an equivalent set of changes for the MLX repo atm). Only reason it appears to be mac focused is because that's what I'm running it on. I'll send out some messages in the discord for have some people with Nvidia GPUs run it to double check once I've cleaned it up a bit. |
f200aba to
450cdcf
Compare
|
Okay, awesome. If this is cross platform, that’s great. |
…remove per-tile empty_cache - Extract add_optimization_args() to eliminate duplicated CLI flags - Fix _clamp() ignoring its min parameter - Remove expensive per-tile torch.cuda.empty_cache() calls - Fix lossy threshold max_abs_err 0.02→0.06 (was stricter than fp16) - Remove hardcoded absolute path from .claude/settings.json - Document bicubic vs Lanczos4 tradeoff in GPU postprocess - Replace vague humility clamp comment with rationale Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…remove per-tile empty_cache - Extract add_optimization_args() to eliminate duplicated CLI flags - Fix _clamp() ignoring its min parameter - Remove expensive per-tile torch.cuda.empty_cache() calls - Fix lossy threshold max_abs_err 0.02→0.06 (was stricter than fp16) - Remove hardcoded absolute path from .claude/settings.json - Document bicubic vs Lanczos4 tradeoff in GPU postprocess - Replace vague humility clamp comment with rationale Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
f5a93f2 to
50a152e
Compare
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4-phase plan: FP16 weights, GPU-side post-processing, backbone/refiner resolution decoupling, tiled refiner. Includes quality validation strategy w/ pixel-diff gates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restructure from 4-phase to 5-phase (0-4). Phase 0 covers benchmark script, baseline capture, quality gate tests, pixel diff reporting, and results table — all before any optimization work begins. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- benchmarks/bench_phase.py: timing, memory, pixel diff CLI - tests/test_quality_gate.py: per-channel quality gates + CI smoke tests - .gitignore: baseline .npy files excluded - Plan updated w/ reference clip paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- torch.mps.driver_allocated_size -> driver_allocated_memory - Relax color range test tolerance for Lanczos4 ringing (~0.08 overshoot) - All Phase 0 acceptance criteria checked off Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add model.half() after load_state_dict. Autocast already ran FP16 activations; aligning weight storage halves static footprint and reduces activation memory under autocast. Quality gate thresholds updated: added fp16 tier between lossless and lossy — FP16 rounding is not bit-exact but 0% pixels > 1e-2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- F.interpolate(bicubic) replaces cv2.resize(Lanczos4) for prediction upscaling — stays on GPU - despill/srgb_to_linear/premultiply/compositing all run as GPU tensor ops - .cpu().numpy() deferred to final return (single batch transfer) - clean_matte stays CPU (cv2.connectedComponents) — only alpha transferred - Checkerboard + linear variant cached per resolution as GPU tensors - Dilation kernel cached in module-level dict - Bicubic clamp added (can overshoot [0,1] unlike Lanczos4) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Median 5.42s (-31% vs baseline, -5% vs P1), 26.10 GB mem (-6.13 GB vs baseline). FG max err 0.083 from Lanczos4→bicubic change but 0.06% pixels > 1e-2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Encoder runs at backbone_size (e.g. 1024) while refiner stays at full img_size (2048). Decoder outputs upsampled to full res before refiner. Defaults to None (no change). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1.53s median (+5.3x), 8.18 GB mem (-74.6%). Quality lossy w/o retrain — viable as fast preview mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
512x512 tiles, 64px overlap, CPU accumulator. Per-tile GPU cache flush (MPS/CUDA-aware). Tent weight normalization prevents seam artifacts. Default enabled in CorridorKeyEngine (tile_size=512). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mix-and-match flags for quality/perf tradeoffs. Includes wizard presets and benchmark matrix script. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
64px→96px overlap: ~12% better MAE at zero cost. Errors at subject edges not tile seams (grid analysis). 15.36 GB vs 32.23 GB baseline (-52.3%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wire --fp16, --gpu-postprocess, --backbone-size, --refiner-tile-size, --refiner-tile-overlap through CLI/wizard/backend to engine. Add wizard preset menu (Quality/Fast Preview/Low VRAM/Legacy/Custom). Add benchmarks/bench_matrix.py for cross-preset comparison. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…remove per-tile empty_cache - Extract add_optimization_args() to eliminate duplicated CLI flags - Fix _clamp() ignoring its min parameter - Remove expensive per-tile torch.cuda.empty_cache() calls - Fix lossy threshold max_abs_err 0.02→0.06 (was stricter than fp16) - Remove hardcoded absolute path from .claude/settings.json - Document bicubic vs Lanczos4 tradeoff in GPU postprocess - Replace vague humility clamp comment with rationale Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Runs each benchmark preset on a single frame, saves all output passes as PNGs for manual side-by-side review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
.half() + autocast + FP32 input caused type conflicts that hung on Windows 4090. Autocast alone handles FP16 correctly. Also make autocast conditional on fp16 flag and add pre-warmup progress print. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PXR24 deadlocks on some Windows CUDA setups. ZIP is universally compatible with negligible size difference (~10KB/frame). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Upstream pins torch to cu126 index which has no macOS wheels. Add sys_platform markers so Linux/Windows use CUDA index, macOS falls back to PyPI for MPS-compatible builds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
58978c1 to
cc7a92c
Compare
|
going through PRs, just wanted to let you know that I'm still aware this is open. There have been a few other optimizations that I'm going through. Let me know if there's an update here I should merge |
|
I'm 99% sure the stuff the other guys have been doing includes all of this but I'll ask one of them to take a look once things start to slow down |
|
Who will go through 3000+ lines to review this PR? Using AI to write codes are fine, but the PR needs to be of reasonable size so that others can review it. And using AI to review PRs is a bad precedence. Perhaps ask Claude (or whichever AI you are using) to be smarter and suggest simple and effective changes. And if you still need all these, then split those into multiple PRs so that we can review it independently. |
Have you taken a look at the P/R? 3000 lines of code change in a P/R can be a reasonable amount and it seems like you did not even bother to look into what's in the P/R before writting this comment. I assume you saw that this user wrote that they were using a LLM to help them out and you assumed it was a bad P/R because of it. After skimming through it briefly you can quickly see that there is not in fact 3000 lines of code since a lot of the new lines in here are documentation, only about 1500 lines remain of actual code change which i see quite often in P/Rs that used no AI. This is not my expertise so i will not be reviewing it but this comment seems to be bad faith AI hate. |
Summary
Phased performance optimization of the inference pipeline, reducing VRAM usage and improving throughput. All optimizations are behind CLI flags — defaults match pre-optimization behavior.
--fp16) — 7.2 GB VRAM savings, 27% faster--gpu-color) — eliminates CPU↔GPU transfers--refiner-res) — run refiner at lower res--tile-size,--tile-overlap) — 52% VRAM reductionWhy
Large-resolution frames (2K+) exhaust GPU memory on consumer cards. These changes make CorridorKey usable on 8–12 GB GPUs without sacrificing output quality at default settings.
Test plan
uv run pytest -m "not gpu"passesuv run ruff check && uv run ruff format --checkcleanmain--fp16 --gpu-color --tile-size 512reduces VRAM significantly🤖 Generated with Claude Code