Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
codelionΒ 
posted an update 28 days ago
Post
3975
Want to experiment with pre-training dataset mixtures but don't want to process terabytes of data? We've got you covered.

We're releasing a collection of several carefully curated 1B token dataset samples specifically designed for rapid prototyping and pretraining experiments: https://huggingface.co/collections/codelion/pre-training-dataset-samples

These samples were created using reservoir sampling - an algorithm that guarantees statistically unbiased random samples from massive source datasets. This means results you get at the 1B token scale are representative of how these datasets behave at 100B+ token scales, letting you iterate quickly without the computational overhead.

The collection includes:
- finePDFs-1B: High-quality textbook-style educational content
- DCLM-baseline-1B: Filtered, diverse web content
- FineWeb-Edu-1B: Curated educational web resources

We used these exact samples to run 50+ systematic experiments on dataset mixing strategies, ultimately discovering that a 50-30-20 mixture of finePDFs + DCLM-baseline + FineWeb-Edu achieves 90%+ of GPT-2's performance with just 1/10th the training data.

Whether you're researching optimal data mixtures, testing curriculum learning strategies, or just want to quickly prototype a pretraining run, these samples give you a solid foundation to start experimenting immediately.

Read the full story of how we used these datasets to find the optimal pretraining recipe: https://huggingface.co/blog/codelion/optimal-dataset-mixing

Really cool!! Thanks :)