perf: implement multithreaded I/O optimization for high-core-count CPUs by SpaceMarty · Pull Request #124 · nikopueringer/CorridorKey

SpaceMarty · 2026-03-10T09:15:23Z

Refactor optimization logic into a reusable RuntimeThreadPool context manager in device_utils.py.
Applied this refactor to clip_manager.py and backend/service.py to improve inference throughput on systems like Threadripper by overlapping I/O with GPU inference.

Refactor optimization logic into a reusable RuntimeThreadPool context manager in device_utils.py. Applied this refactor to clip_manager.py and backend/service.py to improve inference throughput on systems like Threadripper by overlapping I/O with GPU inference.

nikopueringer · 2026-03-11T21:00:59Z

this seems like it would be super useful. Looks like some tests failed. If the tests themselves have issues, feel free to let me know.

HYP3R00T · 2026-03-11T23:31:52Z

@SpaceMarty, I tried executing the cli for this PR locally and ran into error. Can you please look into it?

uv run corridorkey wizard .\ClipsForInference\sample1\
[05:00:32] INFO     Auto-selected device: cuda                                                                                                                                  device_utils.py:22
           INFO     Using device: cuda                                                                                                                                      corridorkey_cli.py:216
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ CORRIDOR KEY — SMART WIZARD                                                                                                                                                                    │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Windows Path: .\ClipsForInference\sample1\
Running locally: .\ClipsForInference\sample1[/bold]

Found 1 potential clip folders.
                Status Report                 
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ Category                   ┃ Count ┃ Clips ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ Ready (AlphaHint)          │     1 │ —     │
├────────────────────────────┼───────┼───────┤
│ Masked (VideoMaMaMaskHint) │     0 │ —     │
├────────────────────────────┼───────┼───────┤
│ Raw (Input only)           │     0 │ —     │
└────────────────────────────┴───────┴───────┘
╭─────────────────────────────────────────────────────────────────────────────────────────── Actions ────────────────────────────────────────────────────────────────────────────────────────────╮
│ i — Run Inference (1 ready clips)                                                                                                                                                              │
│ r — Re-scan folders                                                                                                                                                                            │
│ q — Quit                                                                                                                                                                                       │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Select action [v/g/i/r/q] (q): i
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Corridor Key Inference                                                                                                                                                                         │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Inference Settings                                                                                                                                                                             │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Input colorspace [linear/srgb] (srgb):
Despill strength (0–10, 10 = max despill) (5):
Enable auto-despeckle (removes tracking dots)? [y/n] (y):
Despeckle size (min pixels for a spot) (400):
Refiner strength multiplier (experimental) (1.0):
[05:00:37] INFO     Found 1 clips ready for inference.                                                                                                                         clip_manager.py:549
[05:00:39] INFO     Not Apple Silicon — using torch backend                                                                                                                          backend.py:53
           INFO     Torch engine loaded: CorridorKey.pth (device=cuda)                                                                                                              backend.py:238
           INFO     Loading CorridorKey from D:\CorridorKey\CorridorKeyModule\checkpoints\CorridorKey.pth                                                                   inference_engine.py:52
           INFO     Initializing hiera_base_plus_224.mae_in1k_ft_in1k (img_size=2048)                                                                                     model_transformer.py:162
[05:00:40] INFO     Skipped downloading base weights (relying on custom checkpoint)                                                                                       model_transformer.py:167
           INFO     Patched input layer: 3 → 4 channels (extra initialized to 0)                                                                                          model_transformer.py:243
           INFO     Feature channels: [112, 224, 448, 896]                                                                                                                model_transformer.py:180
[05:00:41] INFO     Running Inference on:                                                                                                                                      clip_manager.py:568
           INFO       Input frames: 1, Alpha frames: 1 -> Processing 1 frames                                                                                                  clip_manager.py:583
⠋  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/1 0:00:00libpng warning: iCCP: profile 'ICC Profile': 'RGB ': RGB color space not permitted on grayscale PNG
           ERROR    Frame processing failed: 'CorridorKeyEngine' object has no attribute 'preprocess'                                                                          clip_manager.py:753
           INFO     Clip  Complete.                                                                                                                                            clip_manager.py:759
⠋  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/1 0:00:00
Inference batch complete. Press Enter to re-scan:

shezmic · 2026-03-14T06:07:14Z

Code Review — Community Contribution

Hey @SpaceMarty, thanks for tackling I/O parallelisation — this is a real bottleneck on high-core-count systems. The RuntimeThreadPool context manager is a clean abstraction. However, I found several issues that would cause incorrect results or runtime failures on the current codebase.

1. `engine.preprocess()` / `engine.infer()` / `engine.postprocess()` don't exist

Both clip_manager.py and backend/service.py diffs call:

inp_t, h, w = engine.preprocess(img_srgb, mask_linear, input_is_linear=input_is_linear)
pred_alpha, pred_fg = engine.infer(inp_t, refiner_scale=refiner_scale)
res = engine.postprocess(pred_alpha, pred_fg, h, w, ...)

But the current CorridorKeyEngine only exposes process_frame() as a single method. These three methods don't exist in inference_engine.py. This would fail immediately with AttributeError at runtime. The split is a good idea for enabling GPU/CPU overlap, but the engine API would need to be refactored first — and inference_engine.py is listed as an absolute no-go zone for contributors since the resize/normalise/tensor handling is calibrated to the model's training distribution.

2. `as_completed` breaks frame ordering

In clip_manager.py, frames are submitted as futures and processed via as_completed:

futures = [executor.submit(_process_single_frame, i) for i in range(num_frames)]
for fut in as_completed(futures):
    ...

as_completed returns futures in the order they finish, not the order they were submitted. This means:

on_frame_complete(i, num_frames) fires out of order — progress bars would jump erratically
More critically: VideoCapture.read() is sequential. Even though is_video=True limits to 1 thread, all range(num_frames) futures are submitted at once but the closure captures input_cap/alpha_cap which advance sequentially. The single-thread executor processes them in FIFO order, which happens to work — but this is fragile and non-obvious.

3. `break` → `return False` changes error recovery semantics

The original code uses break when a video read fails (cap.read() returns False), which terminates the frame loop for that clip and moves to cleanup. The refactored version uses return False from the thread worker, but the main loop just catches it with:

for fut in as_completed(futures):
    try:
        fut.result()
    except Exception as e:
        logger.error(f"Frame processing failed: {e}")

This silently logs and continues, meaning subsequent frames are still attempted even after a video read failure. The original break semantics intended that a failed video read means the stream is exhausted — continuing is wrong.

4. Missing `despill_strength` in `clip_manager.py` postprocess call

In clip_manager.py, despill_strength is unpacked from settings:

despill_strength = settings.despill_strength

But it's never passed to engine.postprocess():

res = engine.postprocess(
    pred_alpha, pred_fg, h, w,
    fg_is_straight=USE_STRAIGHT_MODEL,
    auto_despeckle=settings.auto_despeckle,
    despeckle_size=settings.despeckle_size,
    refiner_scale=settings.refiner_scale,   # ← despill_strength missing
)

This would mean despill is silently disabled for all frames processed through clip_manager.

5. Raw `print()` left in worker

def _process_single_frame(i: int):
    if i % 10 == 0:
        print(f"  Frame {i}/{num_frames}...", end="\r")

Should use logger.info() — the project has been moving from print() to logging (PR #154 specifically addresses this in clip_manager).

6. `cv2.setNumThreads(0)` concern

Setting OpenCV threads to 0 means "let OpenCV decide" on some builds, not "disable threading". The intent is to constrain, but the actual behaviour is platform-dependent. cv2.setNumThreads(1) would be a safer choice for the intended effect.

Recommendation

The core idea — overlapping I/O with GPU compute — is sound and would be valuable. But it requires:

The inference engine to expose separate preprocess/infer/postprocess methods (which is an engine refactor outside contributor scope)
Careful ordering guarantees to preserve frame sequence
Proper error propagation to maintain break semantics

It might be worth discussing with the maintainer whether the engine API split is on the roadmap, and building this on top of that change rather than against the current API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: implement multithreaded I/O optimization for high-core-count CPUs#124

perf: implement multithreaded I/O optimization for high-core-count CPUs#124
SpaceMarty wants to merge 1 commit intonikopueringer:mainfrom
SpaceMarty:feat/multithreaded-io-optimization

SpaceMarty commented Mar 10, 2026

Uh oh!

nikopueringer commented Mar 11, 2026

Uh oh!

HYP3R00T commented Mar 11, 2026

Uh oh!

shezmic commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

SpaceMarty commented Mar 10, 2026

Uh oh!

nikopueringer commented Mar 11, 2026

Uh oh!

HYP3R00T commented Mar 11, 2026

Uh oh!

shezmic commented Mar 14, 2026

Code Review — Community Contribution

1. engine.preprocess() / engine.infer() / engine.postprocess() don't exist

2. as_completed breaks frame ordering

3. break → return False changes error recovery semantics

4. Missing despill_strength in clip_manager.py postprocess call

5. Raw print() left in worker

6. cv2.setNumThreads(0) concern

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1. `engine.preprocess()` / `engine.infer()` / `engine.postprocess()` don't exist

2. `as_completed` breaks frame ordering

3. `break` → `return False` changes error recovery semantics

4. Missing `despill_strength` in `clip_manager.py` postprocess call

5. Raw `print()` left in worker

6. `cv2.setNumThreads(0)` concern