Skip to content

fix: guard file unlink in audio extraction to prevent crash in library mode#1637

Open
enwaiax wants to merge 3 commits intoNVIDIA:mainfrom
enwaiax:fix/audio-extraction-libmode-unlink-crash
Open

fix: guard file unlink in audio extraction to prevent crash in library mode#1637
enwaiax wants to merge 3 commits intoNVIDIA:mainfrom
enwaiax:fix/audio-extraction-libmode-unlink-crash

Conversation

@enwaiax
Copy link
Collaborator

@enwaiax enwaiax commented Mar 16, 2026

Summary

Fix OSError: [Errno 36] File name too long crash in _extract_from_audio() when processing audio files in library mode.

Problem

PR #1119 (commit f511b315, 2025-11-13, "Dataloader video ingest pipeline support") introduced a regression that breaks all audio extraction in library mode.

The root cause is in api/src/nv_ingest_api/internal/extract/audio/audio_extraction.py lines 61-80. PR #1119 added logic to support content being either:

  • Dataloader/V2 API path: content = base64(file_path_string) → decode path → read file → delete temp file
  • Library mode (SimpleBroker) path: content = base64(audio_binary) → use directly

However, Path(base64_file_path).unlink(missing_ok=True) was placed unconditionally outside both branches. When the except catches UnicodeDecodeError (audio binary can't decode as UTF-8), base64_file_path still holds the original base64 string (~2MB for a 1.5MB WAV file), and unlink() triggers OSError: [Errno 36] File name too long.

Why only library mode is affected

Mode Client Data path Content value Result
K8s / Docker Compose RestClient (default) V2 API intercepts WAV/MP3 in v2/ingest.py:963-1000, writes to temp file, passes file path as content base64("/tmp/chunk_0001.mp3") Decodes as UTF-8 ✅, file exists ✅, unlink works ✅
Library mode SimpleClient Bypasses V2 API, raw base64 binary goes directly into pipeline base64(<1.5MB WAV binary>) UTF-8 decode fails → except: pass → unlink on 2MB string → OSError

Fix

Replace the unconditional unlink with a source_file_path sentinel variable that is only set when content is actually resolved from an on-disk file:

# Before (buggy):
base64_file_path = base64_audio
# ... try/except ...
Path(base64_file_path).unlink(missing_ok=True)  # always runs, crashes on long base64

# After (fixed):
source_file_path = None
# ... only set when Path(decoded_path).exists() ...
if source_file_path is not None:
    Path(source_file_path).unlink(missing_ok=True)  # only runs for real file paths

Reproduction

Note: extract_method="audio" must be specified explicitly. Without it, the client defaults to extract_method="pdfium" for WAV files (interface.py:1130-1138), and the audio extractor stage is never invoked. This matches the audio.md documentation.

import os, time
os.environ["NVIDIA_API_KEY"] = "nvapi-..."

from nv_ingest.framework.orchestration.ray.util.pipeline.pipeline_runners import run_pipeline
from nv_ingest_api.util.message_brokers.simple_message_broker import SimpleClient
from nv_ingest_client.client import Ingestor, NvIngestClient

pipeline = run_pipeline(block=False, disable_dynamic_scaling=True, run_in_subprocess=True)
time.sleep(20)

client = NvIngestClient(
    message_client_allocator=SimpleClient,
    message_client_port=7671,
    message_client_hostname="localhost",
)
ingestor = (
    Ingestor(client=client)
    .files("data/multimodal_test.wav")
    .extract(document_type="wav", extract_method="audio")
)
results, failures = ingestor.ingest(return_failures=True)
# Before fix: failures[0] → OSError: [Errno 36] File name too long
# After fix:  pipeline reaches RIVA ASR call successfully

Test plan

  • 10/10 existing unit tests pass (api_tests/internal/extract/audio/test_audio_extraction.py)
  • Library mode path — base64 audio binary: infer() called with correct data, no OSError
  • K8s/Dataloader path — base64 file path: temp file read → infer() called → temp file deleted
  • Edge cases: empty content, None content, non-AUDIO type, decodable-but-nonexistent path
  • End-to-end library mode with real 1.5MB WAV (data/multimodal_test.wav): pipeline reaches RIVA ASR

Fixes: NVBug 5984261

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

…y mode

In library mode, audio content arrives as base64-encoded binary data
(not a file path). PR NVIDIA#1119 added file-path support for Dataloader but
left Path.unlink() unconditional, causing OSError (ENAMETOOLONG) when
the base64 string (~2MB) is treated as a filename.

Use a `source_file_path` sentinel so unlink only runs when content was
actually resolved from an on-disk file (Dataloader/V2 API path).

Fixes: NVBug 5984261
Made-with: Cursor
@enwaiax enwaiax requested a review from a team as a code owner March 16, 2026 15:33
@enwaiax enwaiax requested a review from charlesbluca March 16, 2026 15:33
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 16, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@enwaiax enwaiax changed the title fix: guard file unlink in audio extraction to prevent crash in librar… fix: guard file unlink in audio extraction to prevent crash in library mode Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants