Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up

datablations

https://github.com/huggingface/datablations
Activity Feed Request to join this org

AI & ML interests

Scaling Data-Constrained Language Models

Recent Activity

craffel  authored a paper 29 days ago
TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior
thomwolf  authored a paper 3 months ago
Robot Learning: A Tutorial
thomwolf  authored a paper 7 months ago
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
View all activity

Niklas Muennighoff's profile picture Teven Le Scao's profile picture Nouamane Tazi's profile picture Risto Luukkonen's profile picture Aleksandra Piktus's profile picture Sampo Pyysalo's profile picture Colin Raffel's profile picture Thomas Wolf's profile picture Sasha Rush's profile picture

datablations 's datasets 13

datablations/scripts

Viewer • Updated Jun 15, 2023 • 3.48M • 491

datablations/oscar-subsets

Viewer • Updated Jun 14, 2023 • 365k • 208

datablations/c4-subsets

Viewer • Updated Jun 14, 2023 • 729k • 296 • 5

datablations/c4-filter-megatron

Updated May 28, 2023 • 85

datablations/oscar-filter-megatron

Updated May 27, 2023 • 25

datablations/python-megatron

Updated May 22, 2023 • 8.27k • 1

datablations/subsets

Viewer • Updated May 10, 2023 • 365k • 98

datablations/oscar-filter

Viewer • Updated May 10, 2023 • 432M • 636

datablations/oscar-dedup-expanded

Viewer • Updated May 10, 2023 • 432M • 102

datablations/mup

Updated Apr 24, 2023 • 1.16k

datablations/c4-filter

Viewer • Updated Feb 1, 2023 • 365M • 415

datablations/c4-filter-small

Viewer • Updated Jan 17, 2023 • 100k • 8

datablations/oscar-filter-small

Viewer • Updated Nov 24, 2022 • 100k
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs