Skip to content

perf: implement multithreaded I/O optimization for high-core-count CPUs#124

Open
SpaceMarty wants to merge 1 commit intonikopueringer:mainfrom
SpaceMarty:feat/multithreaded-io-optimization
Open

perf: implement multithreaded I/O optimization for high-core-count CPUs#124
SpaceMarty wants to merge 1 commit intonikopueringer:mainfrom
SpaceMarty:feat/multithreaded-io-optimization

Conversation

@SpaceMarty
Copy link

Refactor optimization logic into a reusable RuntimeThreadPool context manager in device_utils.py.
Applied this refactor to clip_manager.py and backend/service.py to improve inference throughput on systems like Threadripper by overlapping I/O with GPU inference.

Refactor optimization logic into a reusable RuntimeThreadPool context manager in device_utils.py. Applied this refactor to clip_manager.py and backend/service.py to improve inference throughput on systems like Threadripper by overlapping I/O with GPU inference.
@nikopueringer
Copy link
Owner

this seems like it would be super useful. Looks like some tests failed. If the tests themselves have issues, feel free to let me know.

@HYP3R00T
Copy link
Contributor

@SpaceMarty, I tried executing the cli for this PR locally and ran into error. Can you please look into it?

uv run corridorkey wizard .\ClipsForInference\sample1\
[05:00:32] INFO     Auto-selected device: cuda                                                                                                                                  device_utils.py:22
           INFO     Using device: cuda                                                                                                                                      corridorkey_cli.py:216
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ CORRIDOR KEY — SMART WIZARD                                                                                                                                                                    │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Windows Path: .\ClipsForInference\sample1\
Running locally: .\ClipsForInference\sample1[/bold]

Found 1 potential clip folders.
                Status Report                 
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ Category                   ┃ Count ┃ Clips ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ Ready (AlphaHint)          │     1 │ —     │
├────────────────────────────┼───────┼───────┤
│ Masked (VideoMaMaMaskHint) │     0 │ —     │
├────────────────────────────┼───────┼───────┤
│ Raw (Input only)           │     0 │ —     │
└────────────────────────────┴───────┴───────┘
╭─────────────────────────────────────────────────────────────────────────────────────────── Actions ────────────────────────────────────────────────────────────────────────────────────────────╮
│ i — Run Inference (1 ready clips)                                                                                                                                                              │
│ r — Re-scan folders                                                                                                                                                                            │
│ q — Quit                                                                                                                                                                                       │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Select action [v/g/i/r/q] (q): i
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Corridor Key Inference                                                                                                                                                                         │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Inference Settings                                                                                                                                                                             │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Input colorspace [linear/srgb] (srgb):
Despill strength (0–10, 10 = max despill) (5):
Enable auto-despeckle (removes tracking dots)? [y/n] (y):
Despeckle size (min pixels for a spot) (400):
Refiner strength multiplier (experimental) (1.0):
[05:00:37] INFO     Found 1 clips ready for inference.                                                                                                                         clip_manager.py:549
[05:00:39] INFO     Not Apple Silicon — using torch backend                                                                                                                          backend.py:53
           INFO     Torch engine loaded: CorridorKey.pth (device=cuda)                                                                                                              backend.py:238
           INFO     Loading CorridorKey from D:\CorridorKey\CorridorKeyModule\checkpoints\CorridorKey.pth                                                                   inference_engine.py:52
           INFO     Initializing hiera_base_plus_224.mae_in1k_ft_in1k (img_size=2048)                                                                                     model_transformer.py:162
[05:00:40] INFO     Skipped downloading base weights (relying on custom checkpoint)                                                                                       model_transformer.py:167
           INFO     Patched input layer: 3 → 4 channels (extra initialized to 0)                                                                                          model_transformer.py:243
           INFO     Feature channels: [112, 224, 448, 896]                                                                                                                model_transformer.py:180
[05:00:41] INFO     Running Inference on:                                                                                                                                      clip_manager.py:568
           INFO       Input frames: 1, Alpha frames: 1 -> Processing 1 frames                                                                                                  clip_manager.py:583
⠋  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/1 0:00:00libpng warning: iCCP: profile 'ICC Profile': 'RGB ': RGB color space not permitted on grayscale PNG
           ERROR    Frame processing failed: 'CorridorKeyEngine' object has no attribute 'preprocess'                                                                          clip_manager.py:753
           INFO     Clip  Complete.                                                                                                                                            clip_manager.py:759
⠋  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/1 0:00:00
Inference batch complete. Press Enter to re-scan:

@shezmic
Copy link
Contributor

shezmic commented Mar 14, 2026

Code Review — Community Contribution

Hey @SpaceMarty, thanks for tackling I/O parallelisation — this is a real bottleneck on high-core-count systems. The RuntimeThreadPool context manager is a clean abstraction. However, I found several issues that would cause incorrect results or runtime failures on the current codebase.


1. engine.preprocess() / engine.infer() / engine.postprocess() don't exist

Both clip_manager.py and backend/service.py diffs call:

inp_t, h, w = engine.preprocess(img_srgb, mask_linear, input_is_linear=input_is_linear)
pred_alpha, pred_fg = engine.infer(inp_t, refiner_scale=refiner_scale)
res = engine.postprocess(pred_alpha, pred_fg, h, w, ...)

But the current CorridorKeyEngine only exposes process_frame() as a single method. These three methods don't exist in inference_engine.py. This would fail immediately with AttributeError at runtime. The split is a good idea for enabling GPU/CPU overlap, but the engine API would need to be refactored first — and inference_engine.py is listed as an absolute no-go zone for contributors since the resize/normalise/tensor handling is calibrated to the model's training distribution.

2. as_completed breaks frame ordering

In clip_manager.py, frames are submitted as futures and processed via as_completed:

futures = [executor.submit(_process_single_frame, i) for i in range(num_frames)]
for fut in as_completed(futures):
    ...

as_completed returns futures in the order they finish, not the order they were submitted. This means:

  • on_frame_complete(i, num_frames) fires out of order — progress bars would jump erratically
  • More critically: VideoCapture.read() is sequential. Even though is_video=True limits to 1 thread, all range(num_frames) futures are submitted at once but the closure captures input_cap/alpha_cap which advance sequentially. The single-thread executor processes them in FIFO order, which happens to work — but this is fragile and non-obvious.

3. breakreturn False changes error recovery semantics

The original code uses break when a video read fails (cap.read() returns False), which terminates the frame loop for that clip and moves to cleanup. The refactored version uses return False from the thread worker, but the main loop just catches it with:

for fut in as_completed(futures):
    try:
        fut.result()
    except Exception as e:
        logger.error(f"Frame processing failed: {e}")

This silently logs and continues, meaning subsequent frames are still attempted even after a video read failure. The original break semantics intended that a failed video read means the stream is exhausted — continuing is wrong.

4. Missing despill_strength in clip_manager.py postprocess call

In clip_manager.py, despill_strength is unpacked from settings:

despill_strength = settings.despill_strength

But it's never passed to engine.postprocess():

res = engine.postprocess(
    pred_alpha, pred_fg, h, w,
    fg_is_straight=USE_STRAIGHT_MODEL,
    auto_despeckle=settings.auto_despeckle,
    despeckle_size=settings.despeckle_size,
    refiner_scale=settings.refiner_scale,   # ← despill_strength missing
)

This would mean despill is silently disabled for all frames processed through clip_manager.

5. Raw print() left in worker

def _process_single_frame(i: int):
    if i % 10 == 0:
        print(f"  Frame {i}/{num_frames}...", end="\r")

Should use logger.info() — the project has been moving from print() to logging (PR #154 specifically addresses this in clip_manager).

6. cv2.setNumThreads(0) concern

Setting OpenCV threads to 0 means "let OpenCV decide" on some builds, not "disable threading". The intent is to constrain, but the actual behaviour is platform-dependent. cv2.setNumThreads(1) would be a safer choice for the intended effect.


Recommendation

The core idea — overlapping I/O with GPU compute — is sound and would be valuable. But it requires:

  1. The inference engine to expose separate preprocess/infer/postprocess methods (which is an engine refactor outside contributor scope)
  2. Careful ordering guarantees to preserve frame sequence
  3. Proper error propagation to maintain break semantics

It might be worth discussing with the maintainer whether the engine API split is on the roadmap, and building this on top of that change rather than against the current API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants