Ettin: an Open Suite of Paired Encoders and Decoders
π― TL;DR: State-of-the-art paired encoder and decoder models (17M-1B params) trained identically for fair comparison with open data. Encoders beat ModernBERT. Decoders beat Llama 3.2/SmolLM2.
π Paper | π GitHub Repository
This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.
Table of Contents
- Performance Highlights
- Quick Start
- Model Description
- Training Data
- Model Family
- Research Applications
- Training Details
- Model Architecture
- Citation
Performance Highlights
Encoder Tasks (vs. ModernBERT)
- GLUE Average: 88.9 vs 88.4 (Base), 90.8 vs 90.4 (Large)
- MTEB v2 English Retrieval: 45.7 vs 43.9 (Base), 48.4 vs 47.0 (Large)
- Code Search and Long Context: Superior performance on CodeSearchNet and MLDR
Decoder Tasks (vs. SmolLM2 & Llama 3.2)
- Average Score: 46.2 vs 45.2 (SmolLM2-135M)
- 1B Model: 59.0 vs 56.6 (Llama 3.2-1B)
- Generative Tasks: Competitive across all model sizes
Key Finding
Architecture-specific advantages persist: A 400M encoder outperforms a 1B decoder on classification tasks, while a 400M decoder outperforms a 1B encoder on generation tasks.
Quick Start
Installation
If you haven't already, you can install the Transformers.js JavaScript library from NPM using:
npm i @huggingface/transformers
Usage (Transformers.js)
import { pipeline } from "@huggingface/transformers";
const unmasker = await pipeline("fill-mask", "onnx-community/ettin-encoder-32m-ONNX");
const result = await unmasker("The capital of France is [MASK].");
console.log(result);
// [
// { score: 0.5151872038841248, token: 7785, token_str: ' Paris', sequence: 'The capital of France is Paris.' },
// { score: 0.033725105226039886, token: 42268, token_str: ' Lyon', sequence: 'The capital of France is Lyon.' },
// { score: 0.031234024092555046, token: 23397, token_str: ' Nancy', sequence: 'The capital of France is Nancy.' },
// { score: 0.02075139433145523, token: 30167, token_str: ' Brussels', sequence: 'The capital of France is Brussels.' },
// { score: 0.018962178379297256, token: 31955, token_str: ' Geneva', sequence: 'The capital of France is Geneva.' }
// ]
Model Description
Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:
- Identical training data - Same high-quality mixture across all models
- Open Training Data - Data is available now with batch-level training data for each of the 250+ checkpoints
- Matched architectures - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
- Consistent training recipe - Three-phase training with 2T tokens
- Multiple scales - From 17M to 1B parameters
This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.
Training Data
The training data is publicly available and split across different phases:
- Pre-training Data: jhu-clsp/ettin-pretraining-data - 1.7T tokens of diverse data mixture
- Mid-training/Extension Data: jhu-clsp/ettin-extension-data - 250B tokens of higher-quality filtered data
- Decay Phase Data: jhu-clsp/ettin-decay-data - 100B tokens of premium data sources
- Training Data Order: jhu-clsp/ettin-data-order - Batch-level training order (columns: input_ids, step)
Model Family
Encoder Models
| Size | Model | Parameters | Best For | Download |
|---|---|---|---|---|
| XXS | ettin-encoder-17m | 17M | Mobile/Edge devices | |
| XS | ettin-encoder-32m | 32M | Fast inference | |
| Small | ettin-encoder-68m | 68M | Balanced performance | |
| Base | ettin-encoder-150m | 150M | Standard use cases | |
| Large | ettin-encoder-400m | 400M | High accuracy needs | |
| XL | ettin-encoder-1b | 1B | Best performance |
Decoder Models
| Size | Model | Parameters | Best For | Download |
|---|---|---|---|---|
| XXS | ettin-decoder-17m | 17M | Lightweight generation | |
| XS | ettin-decoder-32m | 32M | Quick prototyping | |
| Small | ettin-decoder-68m | 68M | Efficient generation | |
| Base | ettin-decoder-150m | 150M | Standard generation | |
| Large | ettin-decoder-400m | 400M | Quality generation | |
| XL | ettin-decoder-1b | 1B | Best generation |
Cross-Objective Models
These models demonstrate what happens when you continue training encoders as decoders (and vice versa). Important: Load these models using the architecture they were converted to, not their original architecture.
Encoders Trained from Decoders (Decoder β MLM)
Load as encoders using AutoModel or AutoModelForMaskedLM:
| Size | Model | Parameters | Description | Download |
|---|---|---|---|---|
| XXS | ettin-encoder-from-decoder-17m | 17M | Decoder β MLM continued training | |
| XS | ettin-encoder-from-decoder-32m | 32M | Decoder β MLM continued training | |
| Small | ettin-encoder-from-decoder-68m | 68M | Decoder β MLM continued training | |
| Base | ettin-encoder-from-decoder-150m | 150M | Decoder β MLM continued training | |
| Large | ettin-encoder-from-decoder-400m | 400M | Decoder β MLM continued training | |
| XL | ettin-encoder-from-decoder-1b | 1B | Decoder β MLM continued training |
π¬ Research Applications
What Makes Ettin Unique
Ettin provides the first controlled comparison of encoder vs. decoder architectures:
- Identical Training Data: Same 2T token mixture across all models
- Matched Architectures: Only attention patterns and objectives differ
- Open Everything: Training data, model weights, and batch-level training order
- Multiple Scales: Fair comparison from 17M to 1B parameters
- 250+ Checkpoints: Complete training trajectory analysis
Use Cases for Researchers
- Architecture Studies: Compare encoder vs decoder capabilities fairly
- Training Dynamics: Analyze 250+ checkpoints with batch-level data ordering
- Scaling Laws: Study how architectural advantages change with scale
- Transfer Learning: Investigate cross-objective training effectiveness
- Replication Studies: First open replication of ModernBERT training recipe
Reproducibility
All training artifacts are publicly available:
- Training data with exact batch ordering
- Model checkpoints every 8.5B tokens
- Complete hyperparameter configurations
- Training code and evaluation scripts
Training Details
Data: High-quality mixture including DCLM, Dolma v1.7, scientific papers, code, and curated sources totaling 2T+ tokens
Architecture: Transformer with RoPE, GLU activations, and prenorm layers
Training Phases:
- Pre-training: 1.7T tokens with diverse data mixture
- Mid-training: 250B tokens with higher-quality filtered data and context extension to 8K
- Decay phase: 100B tokens with premium data sources
Key Features:
- Context length: Up to 8K tokens
- Vocabulary: 50,368 tokens (ModernBERT tokenizer)
- Deep but efficient architectures following MobileLLM principles
Model Architecture
| Parameter | 17M | 32M | 68M | 150M | 400M | 1B |
|---|---|---|---|---|---|---|
| Layers | 7 | 10 | 19 | 22 | 28 | 28 |
| Hidden Size | 256 | 384 | 512 | 768 | 1024 | 1792 |
| Intermediate Size | 384 | 576 | 768 | 1152 | 2624 | 3840 |
| Attention Heads | 4 | 6 | 8 | 12 | 16 | 28 |
Citation
If you use Ettin models in your research, please cite our work:
@misc{weller2025seqvsseqopen,
title={Seq vs Seq: An Open Suite of Paired Encoders and Decoders},
author={Orion Weller and Kathryn Ricci and Marc Marone and Antoine Chaffin and Dawn Lawrie and Benjamin Van Durme},
year={2025},
eprint={2507.11412},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.11412},
}
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contact: For questions about the models or research, please open an issue or contact the authors.
- Downloads last month
- -
Model tree for onnx-community/ettin-encoder-32m-ONNX
Base model
jhu-clsp/ettin-encoder-32m