fix: guard file unlink in audio extraction to prevent crash in library mode by enwaiax · Pull Request #1637 · NVIDIA/NeMo-Retriever

enwaiax · 2026-03-16T15:33:46Z

Summary

Fix OSError: [Errno 36] File name too long crash in _extract_from_audio() when processing audio files in library mode.

Problem

PR #1119 (commit f511b315, 2025-11-13, "Dataloader video ingest pipeline support") introduced a regression that breaks all audio extraction in library mode.

The root cause is in api/src/nv_ingest_api/internal/extract/audio/audio_extraction.py lines 61-80. PR #1119 added logic to support content being either:

Dataloader/V2 API path: content = base64(file_path_string) → decode path → read file → delete temp file
Library mode (SimpleBroker) path: content = base64(audio_binary) → use directly

However, Path(base64_file_path).unlink(missing_ok=True) was placed unconditionally outside both branches. When the except catches UnicodeDecodeError (audio binary can't decode as UTF-8), base64_file_path still holds the original base64 string (~2MB for a 1.5MB WAV file), and unlink() triggers OSError: [Errno 36] File name too long.

Why only library mode is affected

Mode	Client	Data path	Content value	Result
K8s / Docker Compose	`RestClient` (default)	V2 API intercepts WAV/MP3 in `v2/ingest.py:963-1000`, writes to temp file, passes file path as content	`base64("/tmp/chunk_0001.mp3")`	Decodes as UTF-8 ✅, file exists ✅, unlink works ✅
Library mode	`SimpleClient`	Bypasses V2 API, raw base64 binary goes directly into pipeline	`base64(<1.5MB WAV binary>)`	UTF-8 decode fails → `except: pass` → unlink on 2MB string → OSError ❌

Fix

Replace the unconditional unlink with a source_file_path sentinel variable that is only set when content is actually resolved from an on-disk file:

# Before (buggy):
base64_file_path = base64_audio
# ... try/except ...
Path(base64_file_path).unlink(missing_ok=True)  # always runs, crashes on long base64

# After (fixed):
source_file_path = None
# ... only set when Path(decoded_path).exists() ...
if source_file_path is not None:
    Path(source_file_path).unlink(missing_ok=True)  # only runs for real file paths

Reproduction

Note: extract_method="audio" must be specified explicitly. Without it, the client defaults to extract_method="pdfium" for WAV files (interface.py:1130-1138), and the audio extractor stage is never invoked. This matches the audio.md documentation.

import os, time
os.environ["NVIDIA_API_KEY"] = "nvapi-..."

from nv_ingest.framework.orchestration.ray.util.pipeline.pipeline_runners import run_pipeline
from nv_ingest_api.util.message_brokers.simple_message_broker import SimpleClient
from nv_ingest_client.client import Ingestor, NvIngestClient

pipeline = run_pipeline(block=False, disable_dynamic_scaling=True, run_in_subprocess=True)
time.sleep(20)

client = NvIngestClient(
    message_client_allocator=SimpleClient,
    message_client_port=7671,
    message_client_hostname="localhost",
)
ingestor = (
    Ingestor(client=client)
    .files("data/multimodal_test.wav")
    .extract(document_type="wav", extract_method="audio")
)
results, failures = ingestor.ingest(return_failures=True)
# Before fix: failures[0] → OSError: [Errno 36] File name too long
# After fix:  pipeline reaches RIVA ASR call successfully

Test plan

10/10 existing unit tests pass (api_tests/internal/extract/audio/test_audio_extraction.py)
Library mode path — base64 audio binary: infer() called with correct data, no OSError
K8s/Dataloader path — base64 file path: temp file read → infer() called → temp file deleted
Edge cases: empty content, None content, non-AUDIO type, decodable-but-nonexistent path
End-to-end library mode with real 1.5MB WAV (data/multimodal_test.wav): pipeline reaches RIVA ASR

Fixes: NVBug 5984261

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

…y mode In library mode, audio content arrives as base64-encoded binary data (not a file path). PR NVIDIA#1119 added file-path support for Dataloader but left Path.unlink() unconditional, causing OSError (ENAMETOOLONG) when the base64 string (~2MB) is treated as a filename. Use a `source_file_path` sentinel so unlink only runs when content was actually resolved from an on-disk file (Dataloader/V2 API path). Fixes: NVBug 5984261 Made-with: Cursor

copy-pr-bot · 2026-03-16T15:33:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

enwaiax requested a review from a team as a code owner March 16, 2026 15:33

enwaiax requested a review from charlesbluca March 16, 2026 15:33

enwaiax changed the title ~~fix: guard file unlink in audio extraction to prevent crash in librar…~~ fix: guard file unlink in audio extraction to prevent crash in library mode Mar 16, 2026

jperez999 approved these changes Mar 16, 2026

View reviewed changes

enwaiax added 2 commits March 17, 2026 07:22

Merge branch 'main' into fix/audio-extraction-libmode-unlink-crash

7a39432

Merge branch 'main' into fix/audio-extraction-libmode-unlink-crash

47ee927

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: guard file unlink in audio extraction to prevent crash in library mode#1637

fix: guard file unlink in audio extraction to prevent crash in library mode#1637
enwaiax wants to merge 3 commits intoNVIDIA:mainfrom
enwaiax:fix/audio-extraction-libmode-unlink-crash

enwaiax commented Mar 16, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

enwaiax commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Why only library mode is affected

Fix

Reproduction

Test plan

Description

Checklist

Uh oh!

copy-pr-bot bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

enwaiax commented Mar 16, 2026 •

edited

Loading