Skip to content

Add HuggingFace storage bucket support#484

Merged
JoelNiklaus merged 1 commit into
huggingface:mainfrom
JoelNiklaus:feat/hf-bucket-support
May 5, 2026
Merged

Add HuggingFace storage bucket support#484
JoelNiklaus merged 1 commit into
huggingface:mainfrom
JoelNiklaus:feat/hf-bucket-support

Conversation

@JoelNiklaus

@JoelNiklaus JoelNiklaus commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Problem

Datatrove had no first-class support for HF storage buckets (hf://buckets/...). While HfFileSystem (fsspec) already handles bucket URLs for basic reads, the writer path had gaps: DiskWriter.close_file() would crash on retry for bucket paths (it assumed Git-repo attributes like path_in_repo), and there was no high-level writer equivalent to HuggingFaceDatasetWriter for staged uploads to buckets.

Solution

New HuggingFaceBucketWriter — a ParquetWriter subclass that stages files locally and uploads them to a bucket via batch_bucket_files() (Xet-optimized). Key features:

  • Auto-creates the bucket on first upload (create_bucket(exist_ok=True))
  • Uploads completed files immediately on rotation (max_file_size) and on close()
  • overwrite=True mode: lists and deletes existing files at the prefix before the first upload
  • No commits/revisions (buckets are mutable object storage)

Fixed DiskWriter.close_file() retry path — now detects HfFileSystemResolvedBucketPath and uses batch_bucket_files() instead of upload_file() for retries, so JsonlWriter/ParquetWriter writing directly to hf://buckets/... won't crash on transient errors.

Bumped huggingface-hub>=1.5.0 (from >=1.0.0) for bucket API support.

Documentation — README updated to position buckets as the recommended storage for raw/intermediate data, with datasets for published data. New "HF Storage Buckets" section covers all four access patterns: HuggingFaceBucketWriter, direct fsspec, hf-mount, and HF Jobs volume mounts.

Testing

  • 15 new unit tests for HuggingFaceBucketWriter (init, upload, overwrite, cleanup, close, file switch, full write cycle)
  • 4 new tests for DiskWriter bucket retry path (transient error, non-retryable, max retries, regression)
  • 4 new tests for DataFolder with hf://buckets/ paths (resolve, list, shard, is_local)
  • 1 integration test — parquet round-trip through HuggingFaceBucketWriterParquetReader
  • 3 manual test scripts under tests/manual/ for live Hub and Slurm integration — bucket writer round-trip and bucket-backed logging_dir verified against live Hub
  • Full pytest suite: 39 new tests all passing, no regressions introduced

Made with Cursor


Note

Medium Risk
Touches core writer close/retry behavior and adds new upload logic against Hugging Face Hub bucket APIs, so regressions could impact pipeline output reliability under transient failures. Test coverage is expanded, but real-world behavior depends on Hub/Xet semantics and permissions.

Overview
Adds first-class support for Hugging Face storage buckets (hf://buckets/...) by introducing HuggingFaceBucketWriter, which stages Parquet shards locally and uploads them via batch_bucket_files, with optional overwrite (prefix delete-once) behavior and uploads on rotation/close.

Fixes DiskWriter.close_file() retry handling for bucket-backed fsspec files by detecting HfFileSystemResolvedBucketPath and retrying with the bucket upload API instead of repo upload_file(), preventing crashes on transient Hub errors. Also bumps huggingface-hub to >=1.5.0, updates README guidance (buckets for raw/intermediate, datasets for publish), and adds an example plus unit/integration/manual tests covering bucket paths, logging dirs, and retry behavior.

Reviewed by Cursor Bugbot for commit a06de9b. Configure here.

New HuggingFaceBucketWriter for staged uploads to HF buckets via
batch_bucket_files(). Fix DiskWriter retry path for bucket URLs.
Bump huggingface-hub>=1.5.0. Update docs to recommend buckets for
raw/intermediate data.

Made-with: Cursor
@JoelNiklaus JoelNiklaus requested review from lewtun and lhoestq April 29, 2026 18:02

@lhoestq lhoestq left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good !

@JoelNiklaus JoelNiklaus merged commit 23c540f into huggingface:main May 5, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants