Jina Embeddings v5 Text Small Retrieval - MLX

MLX port of Jina AI's v5-text-small-retrieval embedding model with multiple quantization levels.

Elastic Inference Service | ArXiv | Blog

Available Quantization Levels

This repository includes three quantization levels:

  1. Full Precision (model.safetensors): ~2.28 GB
  2. 4-bit Quantized (model-4bit.safetensors): ~355 MB (6.4x smaller)
  3. 8-bit Quantized (model-8bit.safetensors): ~639 MB (3.6x smaller)

Installation

pip install mlx tokenizers huggingface_hub

Usage

via Elastic Inference Service

The fastest way to use v5-text in production. Elastic Inference Service (EIS) provides managed embedding inference with built-in scaling, so you can generate embeddings directly within your Elastic deployment.

PUT _inference/text_embedding/jina-v5
{
  "service": "elastic",
  "service_settings": {
    "model_id": "jina-embeddings-v5-text-small"
  }
}

See the Elastic Inference Service documentation for setup details.

### Full Precision Model
import mlx.core as mx
from tokenizers import Tokenizer
from model import JinaEmbeddingModel
import json

# Load config
with open("config.json") as f:
    config = json.load(f)

# Load model (full precision)
model = JinaEmbeddingModel(config)
weights = mx.load("model.safetensors")
model.load_weights(list(weights.items()))

# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")

# Encode
texts = ["Query: What is machine learning?", "Document: Machine learning is..."]
embeddings = model.encode(texts, tokenizer, task_type="retrieval.query")

4-bit Quantized Model

# Load model (4-bit quantized)
model = JinaEmbeddingModel(config)
weights = mx.load("model-4bit.safetensors")
model.load_weights(list(weights.items()))

# Rest is the same - same API
embeddings = model.encode(texts, tokenizer, task_type="retrieval.query")

8-bit Quantized Model

# Load model (8-bit quantized)
model = JinaEmbeddingModel(config)
weights = mx.load("model-8bit.safetensors")
model.load_weights(list(weights.items()))

# Rest is the same - same API
embeddings = model.encode(texts, tokenizer, task_type="retrieval.query")

Task Types

For the retrieval variant:

  • retrieval.query - For search queries
  • retrieval.passage - For documents/passages

Quantization Details

4-bit Quantization

  • Method: Affine quantization with group size 64
  • Size: ~355 MB (6.4x compression)
  • Quality: Cosine similarity β‰₯0.99 vs full precision
  • Use case: Resource-constrained environments, mobile deployment

8-bit Quantization

  • Method: Affine quantization with group size 64
  • Size: ~639 MB (3.6x compression)
  • Quality: Cosine similarity β‰₯0.9999 vs full precision
  • Use case: Production deployments with minimal quality loss

Matryoshka Dimensions

All quantization levels support Matryoshka embedding truncation to: 32, 64, 128, 256, 512, 768, 1024

# Get 256-dim embedding
embeddings_256 = embeddings[:, :256]

Model Details

jina-embeddings-v5-text Architecture

  • Architecture: Qwen3-0.6B with task-specific LoRA adapters (pre-merged)
  • Embedding dimension: 1024
  • Max sequence length: 32768 tokens
  • Optimized for: Apple Silicon (M1/M2/M3/M4) with Metal acceleration

Performance Comparison

Quantization File Size Speed Quality (Cosine Similarity)
Full Precision model.safetensors 2.28 GB Baseline 1.0000
8-bit model-8bit.safetensors 639 MB ~1.5-2x faster β‰₯0.9999
4-bit model-4bit.safetensors 355 MB ~2-3x faster β‰₯0.99

MMTEB Multilingual Benchmark

MTEB English Benchmark

Retrieval Benchmark Results

Files

jina-embeddings-v5-text-small-retrieval-mlx/
β”œβ”€β”€ model.safetensors          # Full precision weights
β”œβ”€β”€ model-4bit.safetensors     # 4-bit quantized weights
β”œβ”€β”€ model-8bit.safetensors     # 8-bit quantized weights
β”œβ”€β”€ model.py                    # Model implementation
β”œβ”€β”€ config.json                 # Model configuration
β”œβ”€β”€ tokenizer.json              # Tokenizer
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ vocab.json
β”œβ”€β”€ merges.txt
β”œβ”€β”€ .gitignore
└── README.md

Citation

@misc{akram2026jinaembeddingsv5texttasktargetedembeddingdistillation,
      title={jina-embeddings-v5-text: Task-Targeted Embedding Distillation}, 
      author={Mohammad Kalim Akram and Saba Sturua and Nastia Havriushenko and Quentin Herreros and Michael GΓΌnther and Maximilian Werk and Han Xiao},
      year={2026},
      eprint={2602.15547},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.15547}, 
}

License

CC BY-NC 4.0

Links

Downloads last month
214
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jinaai/jina-embeddings-v5-text-small-retrieval-mlx

Collection including jinaai/jina-embeddings-v5-text-small-retrieval-mlx

Paper for jinaai/jina-embeddings-v5-text-small-retrieval-mlx