Skip to content

fix(synthetic): align _SyntheticTextExamplesIterable with HuggingFace…#645

Open
arturofredes wants to merge 1 commit intovllm-project:mainfrom
arturofredes:fix/synthetic-iterable-hf-datasets-api
Open

fix(synthetic): align _SyntheticTextExamplesIterable with HuggingFace…#645
arturofredes wants to merge 1 commit intovllm-project:mainfrom
arturofredes:fix/synthetic-iterable-hf-datasets-api

Conversation

@arturofredes
Copy link

@arturofredes arturofredes commented Mar 19, 2026

… datasets API

  • Add n_shards property (alias of num_shards). HuggingFace IterableDataset expects n_shards; without it, loading synthetic data raises NotImplementedError.
  • Change shard_data_sources(worker_id, num_workers) to match the base _BaseExamplesIterable API; the previous signature (num_shards, index, contiguous) caused 'unexpected keyword argument worker_id' when using multiple DataLoader workers. Fixes synthetic data loader when used with datasets.IterableDataset.

Summary

Details

  • [ ]

Test Plan

Related Issues

  • Resolves #

  • "I certify that all code in this PR is my own, except as noted below."

Use of AI

  • Includes AI-assisted code completion
  • Includes code generated by an AI application
  • Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

… datasets API

- Add n_shards property (alias of num_shards). HuggingFace IterableDataset
  expects n_shards; without it, loading synthetic data raises
  NotImplementedError.
- Change shard_data_sources(worker_id, num_workers) to match the base
  _BaseExamplesIterable API; the previous signature (num_shards, index,
  contiguous) caused 'unexpected keyword argument worker_id' when
  using multiple DataLoader workers.
Fixes synthetic data loader when used with datasets.IterableDataset.

Signed-off-by: Arturo Fredes <arturofredesc@gmail.com>
@dbutenhof
Copy link
Collaborator

dbutenhof commented Mar 19, 2026

Can you tell us what version of the datasets package you're using?

The latest, 4.8.2, aligns with the code in the current GuideLLM. The change you're proposing seems to revert a datasets change made in October 2024, which renamed n_shards to num_shards (although it retained n_shards as an alias for backwards compatibility) and changed the signature of shard_data_sources from what you're proposing to what GuideLLM now uses.

GuideLLM currently depends on the datasets package without any minimum version, but apparently actually depends on a version somewhat later than you're using. (It appears that PR first appeared in datasets 3.1.0.)

Is there another constraint requiring you to use an old version of datasets? And, if not, can you try upgrading?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants