Skip to content

perf: VRAM & inference optimizations with CLI feature flags#54

Draft
cmoyates wants to merge 24 commits intonikopueringer:mainfrom
cmoyates:feature/misc-optimizations
Draft

perf: VRAM & inference optimizations with CLI feature flags#54
cmoyates wants to merge 24 commits intonikopueringer:mainfrom
cmoyates:feature/misc-optimizations

Conversation

@cmoyates
Copy link
Contributor

@cmoyates cmoyates commented Mar 7, 2026

Summary

Phased performance optimization of the inference pipeline, reducing VRAM usage and improving throughput. All optimizations are behind CLI flags — defaults match pre-optimization behavior.

  • FP16 weight casting (--fp16) — 7.2 GB VRAM savings, 27% faster
  • GPU-side color math (--gpu-color) — eliminates CPU↔GPU transfers
  • Decoupled backbone/refiner resolutions (--refiner-res) — run refiner at lower res
  • Tiled CNN refiner with tent blending (--tile-size, --tile-overlap) — 52% VRAM reduction
  • Benchmark infrastructure + quality gate tests

Why

Large-resolution frames (2K+) exhaust GPU memory on consumer cards. These changes make CorridorKey usable on 8–12 GB GPUs without sacrificing output quality at default settings.

Test plan

  • uv run pytest -m "not gpu" passes
  • uv run ruff check && uv run ruff format --check clean
  • Default flags produce identical output to main
  • --fp16 --gpu-color --tile-size 512 reduces VRAM significantly

🤖 Generated with Claude Code

@cmoyates cmoyates marked this pull request as draft March 7, 2026 21:52
@nikopueringer
Copy link
Owner

Cool stuff. Some questions-

I thought I was already casting the model to fp16. Perhaps i was doing so incorrectly? Or is this casting a different aspect of the process to fp16? The model is almost 400MB at fp32, and fp16 knocked it down to ~300MB I believe, based on my tests. Since the weights are so small, I came to the conclusion that trying to further quantization on the weights led to diminishing returns.

Likewise, is the CNN taking up that much memory? From what I've seen, it's pretty light. I am dubious of claude's estimations here for vram savings. Have you run tests while observing nvidia-smi?

Thanks for taking a crack at optimization! Let me know your thoughts!

@cmoyates
Copy link
Contributor Author

cmoyates commented Mar 7, 2026

Cool stuff. Some questions-

I thought I was already casting the model to fp16. Perhaps i was doing so incorrectly? Or is this casting a different aspect of the process to fp16? The model is almost 400MB at fp32, and fp16 knocked it down to ~300MB I believe, based on my tests. Since the weights are so small, I came to the conclusion that trying to further quantization on the weights led to diminishing returns.

Likewise, is the CNN taking up that much memory? From what I've seen, it's pretty light. I am dubious of claude's estimations here for vram savings. Have you run tests while observing nvidia-smi?

Thanks for taking a crack at optimization! Let me know your thoughts!

Neat thing about what you can see there: those numbers aren't estimations; those are benchmarks that I ran here on my MacBook. It is running the MPS compelled compatibility thing on the main model, but these are real numbers.

As far as casting the model to fp16, I'm not sure. I'll be completely honest with you: I have just been asking various AI things, "What potential paths of optimization do they see in this?" and doing what they suggest and benchmarking the results.

@cmoyates
Copy link
Contributor Author

cmoyates commented Mar 7, 2026

Here are the exact numbers from the benchmarks: https://github.com/cmoyates/CorridorKey/blob/feature/misc-optimizations/benchmarks/RESULTS.md. I'm assuming claude got the numbers in the PR description from there.

Also here's the exact plan I followed for those first few optimizations: https://github.com/cmoyates/CorridorKey/blob/feature/misc-optimizations/docs/plans/2026-03-07-feat-vram-performance-optimizations-plan.md

@nikopueringer
Copy link
Owner

Mm, interesting. Sounds like this is Mac focused at the moment. I’m hesitant to dive in too deep on to Mac optimization a day before public launch, but something like this could be extremely useful to people.

have you verified the results with your eyeballs? Compared frames between outputs?

if there are optimization gains to be made, it’s great that you’re highlighting how they could be achieved. But since I don’t have a Mac platform to test on, I’m gonna take this a bit slow and really rely on others to test and confirm this stuff. I’m still pondering if we should split off a Mac-focused version since this goes pretty deep into Mac architecture.

what are your thoughts?

@cmoyates
Copy link
Contributor Author

cmoyates commented Mar 7, 2026

It should be cross platform (I am working on an equivalent set of changes for the MLX repo atm). Only reason it appears to be mac focused is because that's what I'm running it on. I'll send out some messages in the discord for have some people with Nvidia GPUs run it to double check once I've cleaned it up a bit.

@cmoyates cmoyates force-pushed the feature/misc-optimizations branch from f200aba to 450cdcf Compare March 8, 2026 00:01
@nikopueringer
Copy link
Owner

Okay, awesome. If this is cross platform, that’s great.

cmoyates added a commit to cmoyates/CorridorKey that referenced this pull request Mar 8, 2026
…remove per-tile empty_cache

- Extract add_optimization_args() to eliminate duplicated CLI flags
- Fix _clamp() ignoring its min parameter
- Remove expensive per-tile torch.cuda.empty_cache() calls
- Fix lossy threshold max_abs_err 0.02→0.06 (was stricter than fp16)
- Remove hardcoded absolute path from .claude/settings.json
- Document bicubic vs Lanczos4 tradeoff in GPU postprocess
- Replace vague humility clamp comment with rationale

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cmoyates added a commit to cmoyates/CorridorKey that referenced this pull request Mar 8, 2026
…remove per-tile empty_cache

- Extract add_optimization_args() to eliminate duplicated CLI flags
- Fix _clamp() ignoring its min parameter
- Remove expensive per-tile torch.cuda.empty_cache() calls
- Fix lossy threshold max_abs_err 0.02→0.06 (was stricter than fp16)
- Remove hardcoded absolute path from .claude/settings.json
- Document bicubic vs Lanczos4 tradeoff in GPU postprocess
- Replace vague humility clamp comment with rationale

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cmoyates cmoyates force-pushed the feature/misc-optimizations branch from f5a93f2 to 50a152e Compare March 8, 2026 13:33
cmoyates and others added 19 commits March 8, 2026 18:02
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4-phase plan: FP16 weights, GPU-side post-processing,
backbone/refiner resolution decoupling, tiled refiner.
Includes quality validation strategy w/ pixel-diff gates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restructure from 4-phase to 5-phase (0-4). Phase 0 covers benchmark script, baseline capture, quality gate tests, pixel diff reporting, and results table — all before any optimization work begins.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- benchmarks/bench_phase.py: timing, memory, pixel diff CLI
- tests/test_quality_gate.py: per-channel quality gates + CI smoke tests
- .gitignore: baseline .npy files excluded
- Plan updated w/ reference clip paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- torch.mps.driver_allocated_size -> driver_allocated_memory
- Relax color range test tolerance for Lanczos4 ringing (~0.08 overshoot)
- All Phase 0 acceptance criteria checked off

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add model.half() after load_state_dict. Autocast already ran FP16
activations; aligning weight storage halves static footprint and
reduces activation memory under autocast.

Quality gate thresholds updated: added fp16 tier between lossless
and lossy — FP16 rounding is not bit-exact but 0% pixels > 1e-2.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- F.interpolate(bicubic) replaces cv2.resize(Lanczos4) for prediction upscaling — stays on GPU
- despill/srgb_to_linear/premultiply/compositing all run as GPU tensor ops
- .cpu().numpy() deferred to final return (single batch transfer)
- clean_matte stays CPU (cv2.connectedComponents) — only alpha transferred
- Checkerboard + linear variant cached per resolution as GPU tensors
- Dilation kernel cached in module-level dict
- Bicubic clamp added (can overshoot [0,1] unlike Lanczos4)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Median 5.42s (-31% vs baseline, -5% vs P1), 26.10 GB mem (-6.13 GB vs baseline).
FG max err 0.083 from Lanczos4→bicubic change but 0.06% pixels > 1e-2.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Encoder runs at backbone_size (e.g. 1024) while refiner
stays at full img_size (2048). Decoder outputs upsampled
to full res before refiner. Defaults to None (no change).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1.53s median (+5.3x), 8.18 GB mem (-74.6%). Quality lossy
w/o retrain — viable as fast preview mode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
512x512 tiles, 64px overlap, CPU accumulator.
Per-tile GPU cache flush (MPS/CUDA-aware).
Tent weight normalization prevents seam artifacts.
Default enabled in CorridorKeyEngine (tile_size=512).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mix-and-match flags for quality/perf tradeoffs.
Includes wizard presets and benchmark matrix script.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
64px→96px overlap: ~12% better MAE at zero cost.
Errors at subject edges not tile seams (grid analysis).
15.36 GB vs 32.23 GB baseline (-52.3%).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wire --fp16, --gpu-postprocess, --backbone-size, --refiner-tile-size,
--refiner-tile-overlap through CLI/wizard/backend to engine. Add wizard
preset menu (Quality/Fast Preview/Low VRAM/Legacy/Custom). Add
benchmarks/bench_matrix.py for cross-preset comparison.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cmoyates and others added 5 commits March 8, 2026 18:02
…remove per-tile empty_cache

- Extract add_optimization_args() to eliminate duplicated CLI flags
- Fix _clamp() ignoring its min parameter
- Remove expensive per-tile torch.cuda.empty_cache() calls
- Fix lossy threshold max_abs_err 0.02→0.06 (was stricter than fp16)
- Remove hardcoded absolute path from .claude/settings.json
- Document bicubic vs Lanczos4 tradeoff in GPU postprocess
- Replace vague humility clamp comment with rationale

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Runs each benchmark preset on a single frame, saves all output passes as PNGs for manual side-by-side review.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
.half() + autocast + FP32 input caused type conflicts that hung on
Windows 4090. Autocast alone handles FP16 correctly. Also make
autocast conditional on fp16 flag and add pre-warmup progress print.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PXR24 deadlocks on some Windows CUDA setups. ZIP is universally
compatible with negligible size difference (~10KB/frame).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Upstream pins torch to cu126 index which has no macOS wheels.
Add sys_platform markers so Linux/Windows use CUDA index, macOS
falls back to PyPI for MPS-compatible builds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cmoyates cmoyates force-pushed the feature/misc-optimizations branch from 58978c1 to cc7a92c Compare March 8, 2026 20:32
@cmoyates cmoyates changed the title VRAM & performance optimizations (5 phases) perf: VRAM & inference optimizations with CLI feature flags Mar 8, 2026
@nikopueringer
Copy link
Owner

going through PRs, just wanted to let you know that I'm still aware this is open. There have been a few other optimizations that I'm going through. Let me know if there's an update here I should merge

@cmoyates
Copy link
Contributor Author

cmoyates commented Mar 9, 2026

I'm 99% sure the stuff the other guys have been doing includes all of this but I'll ask one of them to take a look once things start to slow down

@HYP3R00T
Copy link
Contributor

Who will go through 3000+ lines to review this PR? Using AI to write codes are fine, but the PR needs to be of reasonable size so that others can review it. And using AI to review PRs is a bad precedence. Perhaps ask Claude (or whichever AI you are using) to be smarter and suggest simple and effective changes. And if you still need all these, then split those into multiple PRs so that we can review it independently.

@tonioss22
Copy link

Who will go through 3000+ lines to review this PR? Using AI to write codes are fine, but the PR needs to be of reasonable size so that others can review it. And using AI to review PRs is a bad precedence. Perhaps ask Claude (or whichever AI you are using) to be smarter and suggest simple and effective changes. And if you still need all these, then split those into multiple PRs so that we can review it independently.

Have you taken a look at the P/R? 3000 lines of code change in a P/R can be a reasonable amount and it seems like you did not even bother to look into what's in the P/R before writting this comment. I assume you saw that this user wrote that they were using a LLM to help them out and you assumed it was a bad P/R because of it.

After skimming through it briefly you can quickly see that there is not in fact 3000 lines of code since a lot of the new lines in here are documentation, only about 1500 lines remain of actual code change which i see quite often in P/Rs that used no AI. This is not my expertise so i will not be reviewing it but this comment seems to be bad faith AI hate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants