Skip to content

Reader fails on non-default CUDA device under multi-rank training (DataParallel pinned to cuda:0) #1456

@eitanporat

Description

@eitanporat

Summary

Reader(gpu='cuda:N') (or any non-default CUDA device) crashes whenever the host process is not using cuda:0. The recognizer is unconditionally wrapped in torch.nn.DataParallel(model) without device_ids, so DataParallel falls back to all visible GPUs and pins the primary to cuda:0. On a multi-rank training job (DDP/FSDP), every rank with local_rank != 0 errors out.

This is the same root cause as #295 (closed in 2022 without fix). It is still present in the current master.

Reproduction

import torch
import easyocr

# Pretend we're rank 1 in a multi-rank job — current device is cuda:1.
torch.cuda.set_device(1)

reader = easyocr.Reader(['en'], gpu='cuda:1')   # also fails with gpu=True
img = '<any image>'
reader.readtext(img)

Crash:

RuntimeError: module must have its parameters and buffers on device cuda:0
(device_ids[0]) but found one of them on device: cuda:1

We hit this on a 4×GB200 node running an FSDP fine-tune; ranks 1/2/3 all errored on every frame.

Root cause

In easyocr/recognition.py:

model = torch.nn.DataParallel(model).to(device)

(also easyocr/detection.py per #295)

Without device_ids=[idx], DataParallel picks range(torch.cuda.device_count()) and treats device_ids[0] (cuda:0) as authoritative. The subsequent .to(device) doesn't override DataParallel's device_ids.

Reader.__init__ already supports gpu='cuda:N' (it goes straight into self.device), but the DP wrap nullifies it.

Suggested fix

Either:

  1. Drop the DataParallel wrap entirely (the model is already moved to the right device with .to(device); DP makes no sense for inference and conflicts with parent-process distributed training):

    model = model.to(device)
  2. Honor the requested device when DP is kept:

    device_ids = [torch.device(device).index] if 'cuda' in str(device) else None
    model = torch.nn.DataParallel(model, device_ids=device_ids).to(device)

Option 1 is preferable — DataParallel is deprecated by PyTorch in favor of DDP, and inference doesn't benefit from it.

Workaround (current)

Stub nn.DataParallel to a passthrough during Reader init so the model loads straight onto the rank-local device:

import torch, easyocr
orig_dp = torch.nn.DataParallel
torch.nn.DataParallel = lambda m, *a, **kw: m
try:
    reader = easyocr.Reader(['en'], gpu=f'cuda:{torch.cuda.current_device()}')
finally:
    torch.nn.DataParallel = orig_dp

Happy to send a PR if a maintainer confirms which option is preferred.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions