Ettin: an Open Suite of Paired Encoders and Decoders

License: MIT Paper Models Data GitHub

🎯 TL;DR: State-of-the-art paired encoder and decoder models (17M-1B params) trained identically for fair comparison with open data. Encoders beat ModernBERT. Decoders beat Llama 3.2/SmolLM2.

πŸ“„ Paper | πŸš€ GitHub Repository

This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.

Table of Contents

Performance Highlights

Encoder Tasks (vs. ModernBERT)

  • GLUE Average: 88.9 vs 88.4 (Base), 90.8 vs 90.4 (Large)
  • MTEB v2 English Retrieval: 45.7 vs 43.9 (Base), 48.4 vs 47.0 (Large)
  • Code Search and Long Context: Superior performance on CodeSearchNet and MLDR

Decoder Tasks (vs. SmolLM2 & Llama 3.2)

  • Average Score: 46.2 vs 45.2 (SmolLM2-135M)
  • 1B Model: 59.0 vs 56.6 (Llama 3.2-1B)
  • Generative Tasks: Competitive across all model sizes

Key Finding

Architecture-specific advantages persist: A 400M encoder outperforms a 1B decoder on classification tasks, while a 400M decoder outperforms a 1B encoder on generation tasks.

Quick Start

Installation

If you haven't already, you can install the Transformers.js JavaScript library from NPM using:

npm i @huggingface/transformers

Usage (Transformers.js)

import { pipeline } from "@huggingface/transformers";

const unmasker = await pipeline("fill-mask", "onnx-community/ettin-encoder-32m-ONNX");
const result = await unmasker("The capital of France is [MASK].");
console.log(result);
// [
//   { score: 0.5151872038841248, token: 7785, token_str: ' Paris', sequence: 'The capital of France is Paris.' },
//   { score: 0.033725105226039886, token: 42268, token_str: ' Lyon', sequence: 'The capital of France is Lyon.' },
//   { score: 0.031234024092555046, token: 23397, token_str: ' Nancy', sequence: 'The capital of France is Nancy.' },
//   { score: 0.02075139433145523, token: 30167, token_str: ' Brussels', sequence: 'The capital of France is Brussels.' },
//   { score: 0.018962178379297256, token: 31955, token_str: ' Geneva', sequence: 'The capital of France is Geneva.' }
// ]

Model Description

Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use:

  1. Identical training data - Same high-quality mixture across all models
  2. Open Training Data - Data is available now with batch-level training data for each of the 250+ checkpoints
  3. Matched architectures - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM)
  4. Consistent training recipe - Three-phase training with 2T tokens
  5. Multiple scales - From 17M to 1B parameters

This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture.

Training Data

The training data is publicly available and split across different phases:

Model Family

Encoder Models

Size Model Parameters Best For Download
XXS ettin-encoder-17m 17M Mobile/Edge devices Download
XS ettin-encoder-32m 32M Fast inference Download
Small ettin-encoder-68m 68M Balanced performance Download
Base ettin-encoder-150m 150M Standard use cases Download
Large ettin-encoder-400m 400M High accuracy needs Download
XL ettin-encoder-1b 1B Best performance Download

Decoder Models

Size Model Parameters Best For Download
XXS ettin-decoder-17m 17M Lightweight generation Download
XS ettin-decoder-32m 32M Quick prototyping Download
Small ettin-decoder-68m 68M Efficient generation Download
Base ettin-decoder-150m 150M Standard generation Download
Large ettin-decoder-400m 400M Quality generation Download
XL ettin-decoder-1b 1B Best generation Download

Cross-Objective Models

These models demonstrate what happens when you continue training encoders as decoders (and vice versa). Important: Load these models using the architecture they were converted to, not their original architecture.

Encoders Trained from Decoders (Decoder β†’ MLM)

Load as encoders using AutoModel or AutoModelForMaskedLM:

Size Model Parameters Description Download
XXS ettin-encoder-from-decoder-17m 17M Decoder β†’ MLM continued training Download
XS ettin-encoder-from-decoder-32m 32M Decoder β†’ MLM continued training Download
Small ettin-encoder-from-decoder-68m 68M Decoder β†’ MLM continued training Download
Base ettin-encoder-from-decoder-150m 150M Decoder β†’ MLM continued training Download
Large ettin-encoder-from-decoder-400m 400M Decoder β†’ MLM continued training Download
XL ettin-encoder-from-decoder-1b 1B Decoder β†’ MLM continued training Download

πŸ”¬ Research Applications

What Makes Ettin Unique

Ettin provides the first controlled comparison of encoder vs. decoder architectures:

  • Identical Training Data: Same 2T token mixture across all models
  • Matched Architectures: Only attention patterns and objectives differ
  • Open Everything: Training data, model weights, and batch-level training order
  • Multiple Scales: Fair comparison from 17M to 1B parameters
  • 250+ Checkpoints: Complete training trajectory analysis

Use Cases for Researchers

  • Architecture Studies: Compare encoder vs decoder capabilities fairly
  • Training Dynamics: Analyze 250+ checkpoints with batch-level data ordering
  • Scaling Laws: Study how architectural advantages change with scale
  • Transfer Learning: Investigate cross-objective training effectiveness
  • Replication Studies: First open replication of ModernBERT training recipe

Reproducibility

All training artifacts are publicly available:

  • Training data with exact batch ordering
  • Model checkpoints every 8.5B tokens
  • Complete hyperparameter configurations
  • Training code and evaluation scripts

Training Details

Data: High-quality mixture including DCLM, Dolma v1.7, scientific papers, code, and curated sources totaling 2T+ tokens

Architecture: Transformer with RoPE, GLU activations, and prenorm layers

Training Phases:

  • Pre-training: 1.7T tokens with diverse data mixture
  • Mid-training: 250B tokens with higher-quality filtered data and context extension to 8K
  • Decay phase: 100B tokens with premium data sources

Key Features:

  • Context length: Up to 8K tokens
  • Vocabulary: 50,368 tokens (ModernBERT tokenizer)
  • Deep but efficient architectures following MobileLLM principles

Model Architecture

Parameter 17M 32M 68M 150M 400M 1B
Layers 7 10 19 22 28 28
Hidden Size 256 384 512 768 1024 1792
Intermediate Size 384 576 768 1152 2624 3840
Attention Heads 4 6 8 12 16 28

Citation

If you use Ettin models in your research, please cite our work:

@misc{weller2025seqvsseqopen,
      title={Seq vs Seq: An Open Suite of Paired Encoders and Decoders}, 
      author={Orion Weller and Kathryn Ricci and Marc Marone and Antoine Chaffin and Dawn Lawrie and Benjamin Van Durme},
      year={2025},
      eprint={2507.11412},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.11412}, 
}

License

This project is licensed under the MIT License - see the LICENSE file for details.


Contact: For questions about the models or research, please open an issue or contact the authors.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for onnx-community/ettin-encoder-32m-ONNX

Quantized
(1)
this model