Skip to content

data loading timing and disk use #4

@poedator

Description

@poedator

The dataset loading code is taking too long. It downloads whole huge datasets (70G wiki, etc) to use just a handful of examples. setting split="train[0:2000]") is not helping since slicing happens only after full download
Suggestions:

  • download just the first files of the datasets.
  • replace c4 with allenai/c4: load_dataset("allenai/c4", "allenai--c4", data_files={"train": "en/c4-train.00000-of-01024.json.gz"}, split="train")
  • replace wiki with wikitext2. load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions