The motivation behind these kind of "datasets" is very odd imo.
Why not just enforce a single dataset class - and call it a day! Anyone can write a simple script to download a dataset from HF and convert it to open-ai format, right? Also, there is almost zero usecase where you just take one dataset and train on it (at least to build high quality aligned models) its always a combination of a huge set of datasets, each with different formats etc. This is all data-munging work that a toolkit should not be entangled in.
Premature notions of "convenience" just end up being code debt.
The motivation behind these kind of "datasets" is very odd imo.
Why not just enforce a single dataset class - and call it a day! Anyone can write a simple script to download a dataset from HF and convert it to open-ai format, right? Also, there is almost zero usecase where you just take one dataset and train on it (at least to build high quality aligned models) its always a combination of a huge set of datasets, each with different formats etc. This is all data-munging work that a toolkit should not be entangled in.
Premature notions of "convenience" just end up being code debt.