perf: VRAM & inference optimizations with CLI feature flags by cmoyates · Pull Request #54 · nikopueringer/CorridorKey

cmoyates · 2026-03-07T21:50:14Z

Summary

Phased performance optimization of the inference pipeline, reducing VRAM usage and improving throughput. All optimizations are behind CLI flags — defaults match pre-optimization behavior.

FP16 weight casting (--fp16) — 7.2 GB VRAM savings, 27% faster
GPU-side color math (--gpu-color) — eliminates CPU↔GPU transfers
Decoupled backbone/refiner resolutions (--refiner-res) — run refiner at lower res
Tiled CNN refiner with tent blending (--tile-size, --tile-overlap) — 52% VRAM reduction
Benchmark infrastructure + quality gate tests

Why

Large-resolution frames (2K+) exhaust GPU memory on consumer cards. These changes make CorridorKey usable on 8–12 GB GPUs without sacrificing output quality at default settings.

Test plan

uv run pytest -m "not gpu" passes
uv run ruff check && uv run ruff format --check clean
Default flags produce identical output to main
--fp16 --gpu-color --tile-size 512 reduces VRAM significantly

🤖 Generated with Claude Code

nikopueringer · 2026-03-07T22:10:41Z

Cool stuff. Some questions-

I thought I was already casting the model to fp16. Perhaps i was doing so incorrectly? Or is this casting a different aspect of the process to fp16? The model is almost 400MB at fp32, and fp16 knocked it down to ~300MB I believe, based on my tests. Since the weights are so small, I came to the conclusion that trying to further quantization on the weights led to diminishing returns.

Likewise, is the CNN taking up that much memory? From what I've seen, it's pretty light. I am dubious of claude's estimations here for vram savings. Have you run tests while observing nvidia-smi?

Thanks for taking a crack at optimization! Let me know your thoughts!

cmoyates · 2026-03-07T22:55:01Z

Cool stuff. Some questions-

I thought I was already casting the model to fp16. Perhaps i was doing so incorrectly? Or is this casting a different aspect of the process to fp16? The model is almost 400MB at fp32, and fp16 knocked it down to ~300MB I believe, based on my tests. Since the weights are so small, I came to the conclusion that trying to further quantization on the weights led to diminishing returns.

Likewise, is the CNN taking up that much memory? From what I've seen, it's pretty light. I am dubious of claude's estimations here for vram savings. Have you run tests while observing nvidia-smi?

Thanks for taking a crack at optimization! Let me know your thoughts!

Neat thing about what you can see there: those numbers aren't estimations; those are benchmarks that I ran here on my MacBook. It is running the MPS compelled compatibility thing on the main model, but these are real numbers.

As far as casting the model to fp16, I'm not sure. I'll be completely honest with you: I have just been asking various AI things, "What potential paths of optimization do they see in this?" and doing what they suggest and benchmarking the results.

cmoyates · 2026-03-07T22:57:07Z

Here are the exact numbers from the benchmarks: https://github.com/cmoyates/CorridorKey/blob/feature/misc-optimizations/benchmarks/RESULTS.md. I'm assuming claude got the numbers in the PR description from there.

Also here's the exact plan I followed for those first few optimizations: https://github.com/cmoyates/CorridorKey/blob/feature/misc-optimizations/docs/plans/2026-03-07-feat-vram-performance-optimizations-plan.md

nikopueringer · 2026-03-07T23:52:07Z

Mm, interesting. Sounds like this is Mac focused at the moment. I’m hesitant to dive in too deep on to Mac optimization a day before public launch, but something like this could be extremely useful to people.

have you verified the results with your eyeballs? Compared frames between outputs?

if there are optimization gains to be made, it’s great that you’re highlighting how they could be achieved. But since I don’t have a Mac platform to test on, I’m gonna take this a bit slow and really rely on others to test and confirm this stuff. I’m still pondering if we should split off a Mac-focused version since this goes pretty deep into Mac architecture.

what are your thoughts?

cmoyates · 2026-03-07T23:56:16Z

It should be cross platform (I am working on an equivalent set of changes for the MLX repo atm). Only reason it appears to be mac focused is because that's what I'm running it on. I'll send out some messages in the discord for have some people with Nvidia GPUs run it to double check once I've cleaned it up a bit.

nikopueringer · 2026-03-08T00:28:47Z

Okay, awesome. If this is cross platform, that’s great.

…remove per-tile empty_cache - Extract add_optimization_args() to eliminate duplicated CLI flags - Fix _clamp() ignoring its min parameter - Remove expensive per-tile torch.cuda.empty_cache() calls - Fix lossy threshold max_abs_err 0.02→0.06 (was stricter than fp16) - Remove hardcoded absolute path from .claude/settings.json - Document bicubic vs Lanczos4 tradeoff in GPU postprocess - Replace vague humility clamp comment with rationale Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

4-phase plan: FP16 weights, GPU-side post-processing, backbone/refiner resolution decoupling, tiled refiner. Includes quality validation strategy w/ pixel-diff gates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Restructure from 4-phase to 5-phase (0-4). Phase 0 covers benchmark script, baseline capture, quality gate tests, pixel diff reporting, and results table — all before any optimization work begins. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- benchmarks/bench_phase.py: timing, memory, pixel diff CLI - tests/test_quality_gate.py: per-channel quality gates + CI smoke tests - .gitignore: baseline .npy files excluded - Plan updated w/ reference clip paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- torch.mps.driver_allocated_size -> driver_allocated_memory - Relax color range test tolerance for Lanczos4 ringing (~0.08 overshoot) - All Phase 0 acceptance criteria checked off Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add model.half() after load_state_dict. Autocast already ran FP16 activations; aligning weight storage halves static footprint and reduces activation memory under autocast. Quality gate thresholds updated: added fp16 tier between lossless and lossy — FP16 rounding is not bit-exact but 0% pixels > 1e-2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- F.interpolate(bicubic) replaces cv2.resize(Lanczos4) for prediction upscaling — stays on GPU - despill/srgb_to_linear/premultiply/compositing all run as GPU tensor ops - .cpu().numpy() deferred to final return (single batch transfer) - clean_matte stays CPU (cv2.connectedComponents) — only alpha transferred - Checkerboard + linear variant cached per resolution as GPU tensors - Dilation kernel cached in module-level dict - Bicubic clamp added (can overshoot [0,1] unlike Lanczos4) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Median 5.42s (-31% vs baseline, -5% vs P1), 26.10 GB mem (-6.13 GB vs baseline). FG max err 0.083 from Lanczos4→bicubic change but 0.06% pixels > 1e-2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Encoder runs at backbone_size (e.g. 1024) while refiner stays at full img_size (2048). Decoder outputs upsampled to full res before refiner. Defaults to None (no change). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

1.53s median (+5.3x), 8.18 GB mem (-74.6%). Quality lossy w/o retrain — viable as fast preview mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

512x512 tiles, 64px overlap, CPU accumulator. Per-tile GPU cache flush (MPS/CUDA-aware). Tent weight normalization prevents seam artifacts. Default enabled in CorridorKeyEngine (tile_size=512). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Mix-and-match flags for quality/perf tradeoffs. Includes wizard presets and benchmark matrix script. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

64px→96px overlap: ~12% better MAE at zero cost. Errors at subject edges not tile seams (grid analysis). 15.36 GB vs 32.23 GB baseline (-52.3%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Wire --fp16, --gpu-postprocess, --backbone-size, --refiner-tile-size, --refiner-tile-overlap through CLI/wizard/backend to engine. Add wizard preset menu (Quality/Fast Preview/Low VRAM/Legacy/Custom). Add benchmarks/bench_matrix.py for cross-preset comparison. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…remove per-tile empty_cache - Extract add_optimization_args() to eliminate duplicated CLI flags - Fix _clamp() ignoring its min parameter - Remove expensive per-tile torch.cuda.empty_cache() calls - Fix lossy threshold max_abs_err 0.02→0.06 (was stricter than fp16) - Remove hardcoded absolute path from .claude/settings.json - Document bicubic vs Lanczos4 tradeoff in GPU postprocess - Replace vague humility clamp comment with rationale Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Runs each benchmark preset on a single frame, saves all output passes as PNGs for manual side-by-side review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

.half() + autocast + FP32 input caused type conflicts that hung on Windows 4090. Autocast alone handles FP16 correctly. Also make autocast conditional on fp16 flag and add pre-warmup progress print. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

PXR24 deadlocks on some Windows CUDA setups. ZIP is universally compatible with negligible size difference (~10KB/frame). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Upstream pins torch to cu126 index which has no macOS wheels. Add sys_platform markers so Linux/Windows use CUDA index, macOS falls back to PyPI for MPS-compatible builds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

nikopueringer · 2026-03-09T23:06:11Z

going through PRs, just wanted to let you know that I'm still aware this is open. There have been a few other optimizations that I'm going through. Let me know if there's an update here I should merge

cmoyates · 2026-03-09T23:54:27Z

I'm 99% sure the stuff the other guys have been doing includes all of this but I'll ask one of them to take a look once things start to slow down

HYP3R00T · 2026-03-13T17:23:43Z

Who will go through 3000+ lines to review this PR? Using AI to write codes are fine, but the PR needs to be of reasonable size so that others can review it. And using AI to review PRs is a bad precedence. Perhaps ask Claude (or whichever AI you are using) to be smarter and suggest simple and effective changes. And if you still need all these, then split those into multiple PRs so that we can review it independently.

tonioss22 · 2026-03-13T21:18:00Z

Who will go through 3000+ lines to review this PR? Using AI to write codes are fine, but the PR needs to be of reasonable size so that others can review it. And using AI to review PRs is a bad precedence. Perhaps ask Claude (or whichever AI you are using) to be smarter and suggest simple and effective changes. And if you still need all these, then split those into multiple PRs so that we can review it independently.

Have you taken a look at the P/R? 3000 lines of code change in a P/R can be a reasonable amount and it seems like you did not even bother to look into what's in the P/R before writting this comment. I assume you saw that this user wrote that they were using a LLM to help them out and you assumed it was a bad P/R because of it.

After skimming through it briefly you can quickly see that there is not in fact 3000 lines of code since a lot of the new lines in here are documentation, only about 1500 lines remain of actual code change which i see quite often in P/Rs that used no AI. This is not my expertise so i will not be reviewing it but this comment seems to be bad faith AI hate.

cmoyates marked this pull request as draft March 7, 2026 21:52

cmoyates force-pushed the feature/misc-optimizations branch from f200aba to 450cdcf Compare March 8, 2026 00:01

cmoyates force-pushed the feature/misc-optimizations branch from f5a93f2 to 50a152e Compare March 8, 2026 13:33

cmoyates and others added 19 commits March 8, 2026 18:02

Add CLAUDE.md for Claude Code context

9a7c6f2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add Claude Code hooks + CLAUDE.md refinements

34d4677

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add VRAM & perf optimization plan doc

4c523d3

4-phase plan: FP16 weights, GPU-side post-processing, backbone/refiner resolution decoupling, tiled refiner. Includes quality validation strategy w/ pixel-diff gates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add benchmark results doc w/ Phase 0 baseline measurements

d400e29

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add Phase 1 benchmark results to RESULTS.md

272a4f2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add Phase 2 benchmark results to RESULTS.md

d416402

Median 5.42s (-31% vs baseline, -5% vs P1), 26.10 GB mem (-6.13 GB vs baseline). FG max err 0.083 from Lanczos4→bicubic change but 0.06% pixels > 1e-2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Phase 3: decouple backbone/refiner resolutions

ea30174

Encoder runs at backbone_size (e.g. 1024) while refiner stays at full img_size (2048). Decoder outputs upsampled to full res before refiner. Defaults to None (no change). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add Phase 3 benchmark results to RESULTS.md

4637f85

1.53s median (+5.3x), 8.18 GB mem (-74.6%). Quality lossy w/o retrain — viable as fast preview mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add Phase 5: CLI feature flags for all optimizations

773bcf5

Mix-and-match flags for quality/perf tradeoffs. Includes wizard presets and benchmark matrix script. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Phase 4 results: 96px overlap default, -52% VRAM

94e4fde

64px→96px overlap: ~12% better MAE at zero cost. Errors at subject edges not tile seams (grid analysis). 15.36 GB vs 32.23 GB baseline (-52.3%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Phase 5 benchmark matrix results documented

b38a4a6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add CUDA testing guide for optimization branch

f3da7b7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cmoyates and others added 5 commits March 8, 2026 18:02

Add visual preset comparison script

ade2cb8

Runs each benchmark preset on a single frame, saves all output passes as PNGs for manual side-by-side review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Switch EXR compression PXR24 → ZIP (Windows deadlock fix)

e1457e5

PXR24 deadlocks on some Windows CUDA setups. ZIP is universally compatible with negligible size difference (~10KB/frame). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cmoyates force-pushed the feature/misc-optimizations branch from 58978c1 to cc7a92c Compare March 8, 2026 20:32

cmoyates changed the title ~~VRAM & performance optimizations (5 phases)~~ perf: VRAM & inference optimizations with CLI feature flags Mar 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: VRAM & inference optimizations with CLI feature flags#54

perf: VRAM & inference optimizations with CLI feature flags#54
cmoyates wants to merge 24 commits intonikopueringer:mainfrom
cmoyates:feature/misc-optimizations

cmoyates commented Mar 7, 2026 •

edited

Loading

Uh oh!

nikopueringer commented Mar 7, 2026

Uh oh!

cmoyates commented Mar 7, 2026

Uh oh!

cmoyates commented Mar 7, 2026 •

edited

Loading

Uh oh!

nikopueringer commented Mar 7, 2026

Uh oh!

cmoyates commented Mar 7, 2026

Uh oh!

nikopueringer commented Mar 8, 2026

Uh oh!

nikopueringer commented Mar 9, 2026

Uh oh!

cmoyates commented Mar 9, 2026

Uh oh!

HYP3R00T commented Mar 13, 2026

Uh oh!

tonioss22 commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

cmoyates commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Test plan

Uh oh!

nikopueringer commented Mar 7, 2026

Uh oh!

cmoyates commented Mar 7, 2026

Uh oh!

cmoyates commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikopueringer commented Mar 7, 2026

Uh oh!

cmoyates commented Mar 7, 2026

Uh oh!

nikopueringer commented Mar 8, 2026

Uh oh!

nikopueringer commented Mar 9, 2026

Uh oh!

cmoyates commented Mar 9, 2026

Uh oh!

HYP3R00T commented Mar 13, 2026

Uh oh!

tonioss22 commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cmoyates commented Mar 7, 2026 •

edited

Loading

cmoyates commented Mar 7, 2026 •

edited

Loading