Skip to content

Add fsspec-backed remote storage support for bucket URLs#134

Merged
alexkroman merged 3 commits into
mainfrom
claude/dazzling-faraday-la380g
Jun 12, 2026
Merged

Add fsspec-backed remote storage support for bucket URLs#134
alexkroman merged 3 commits into
mainfrom
claude/dazzling-faraday-la380g

Conversation

@alexkroman

Copy link
Copy Markdown
Collaborator

Summary

Adds support for transcribing audio from fsspec-addressable remote storage (S3, Google Cloud Storage, Azure Blob Storage, SFTP, etc.). Users can now pass bucket URLs like s3://bucket/calls/*.mp3 or gs://bucket/calls/ to assembly transcribe for both single-file and batch operations.

Key Changes

  • New remotefs module (aai_cli/remotefs.py): Provides a clean abstraction over fsspec for downloading remote files and expanding globs/folder listings. Normalizes backend exceptions into user-friendly CLIError messages with install hints when optional protocol backends (s3fs, gcsfs, etc.) are missing.

  • Batch mode expansion for remote URLs (aai_cli/transcribe_batch.py):

    • expand_sources() now recognizes bucket URLs and delegates to _remote_sources()
    • Supports glob patterns (s3://bucket/*.mp3) and folder scans with trailing slash (s3://bucket/calls/)
    • Filters results by audio extensions and excludes sidecar files, mirroring local directory behavior
    • Generates deterministic sidecar paths using URL hash to avoid collisions
  • Single-file remote transcription (aai_cli/transcribe_exec.py):

    • run_transcription() downloads remote files to a temporary directory before uploading to the API
    • check_source_exists() validates remote URLs exist before transcription starts
    • Passes allow_remote=True to resolve_audio_source() to permit bucket URLs
  • Source validation updates (aai_cli/client.py):

    • resolve_audio_source() gains allow_remote parameter to opt-in to bucket URL support
    • Transcribe and stream commands have different policies: transcribe allows remotes, stream rejects them
  • Help text and examples (aai_cli/commands/transcribe.py):

    • Updated docstring to document bucket URL support for single files and batches
    • Added example: assembly transcribe "s3://bucket/calls/*.mp3"
  • Comprehensive test coverage (tests/test_remotefs.py, tests/test_transcribe_batch_sources.py, tests/test_transcribe.py, tests/test_source_validation.py, tests/test_transcribe_show_code.py):

    • Tests use fsspec's in-process memory:// filesystem to exercise real glob/find/download code paths offline
    • Covers glob expansion, folder recursion, sidecar resume, missing files, permission errors, and missing backends
    • Validates that --show-code rejects bucket URLs (generated SDK code can't fetch them)
  • Dependencies (pyproject.toml):

    • Added fsspec>=2026.4.0 as a core dependency (protocol backends like s3fs remain optional user installs)

Implementation Details

  • Remote file downloads use temporary directories with aai-remote- prefix to avoid polluting the working directory
  • Sidecar paths for remote URLs are derived from a hash of the full URL to ensure determinism across runs
  • Backend errors (auth, permissions, etc.) are normalized to single-line CLIError messages; missing backends surface fsspec's own install hints
  • The memory_fs pytest fixture resets fsspec's global state between tests to prevent cross-test pollution
  • Batch mode treats bucket URLs consistently with local paths: globs expand to matches, trailing-slash folders scan recursively for audio files

https://claude.ai/code/session_01KKPpFcWPk3tyqVRBApdgcj

A new aai_cli/remotefs.py (mirroring youtube.py's lazy-import shape) fetches
audio from any fsspec-addressable store. fsspec core ships with the CLI;
protocol backends (s3fs, gcsfs, adlfs, …) stay user-installed extras surfaced
via fsspec's own install hint as a clean CLIError.

- Single files: a bucket URL is downloaded to a temp dir and uploaded, like
  the YouTube path. resolve_audio_source gains an opt-in allow_remote flag, so
  stream/agent keep rejecting bucket URLs as missing files.
- Batch mode: a remote glob (s3://bucket/calls/*.mp3) or trailing-slash folder
  (filtered by AUDIO_EXTENSIONS, recursive) expands like its local equivalent;
  sidecar naming now treats any bucket URL like a web URL (slug + hash in cwd).
- --show-code rejects bucket URLs up front instead of emitting a script the
  SDK can't run.

Tests drive the real fsspec code paths via its in-process memory:// filesystem
(shared memory_fs fixture), so pytest-socket stays armed.

https://claude.ai/code/session_01KKPpFcWPk3tyqVRBApdgcj
@alexkroman alexkroman enabled auto-merge June 12, 2026 22:04
@alexkroman alexkroman added this pull request to the merge queue Jun 12, 2026
Merged via the queue into main with commit 5715998 Jun 12, 2026
15 checks passed
@alexkroman alexkroman deleted the claude/dazzling-faraday-la380g branch June 12, 2026 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants