Jina Embeddings v5 Text Small Retrieval - MLX
MLX port of Jina AI's v5-text-small-retrieval embedding model with multiple quantization levels.
Elastic Inference Service | ArXiv | Blog
Available Quantization Levels
This repository includes three quantization levels:
- Full Precision (
model.safetensors): ~2.28 GB
- 4-bit Quantized (
model-4bit.safetensors): ~355 MB (6.4x smaller)
- 8-bit Quantized (
model-8bit.safetensors): ~639 MB (3.6x smaller)
Installation
pip install mlx tokenizers huggingface_hub
Usage
via Elastic Inference Service
The fastest way to use v5-text in production. Elastic Inference Service (EIS) provides managed embedding inference with built-in scaling, so you can generate embeddings directly within your Elastic deployment.
PUT _inference/text_embedding/jina-v5
{
"service": "elastic",
"service_settings": {
"model_id": "jina-embeddings-v5-text-small"
}
}
See the Elastic Inference Service documentation for setup details.
### Full Precision Model
import mlx.core as mx
from tokenizers import Tokenizer
from model import JinaEmbeddingModel
import json
with open("config.json") as f:
config = json.load(f)
model = JinaEmbeddingModel(config)
weights = mx.load("model.safetensors")
model.load_weights(list(weights.items()))
tokenizer = Tokenizer.from_file("tokenizer.json")
texts = ["Query: What is machine learning?", "Document: Machine learning is..."]
embeddings = model.encode(texts, tokenizer, task_type="retrieval.query")
4-bit Quantized Model
model = JinaEmbeddingModel(config)
weights = mx.load("model-4bit.safetensors")
model.load_weights(list(weights.items()))
embeddings = model.encode(texts, tokenizer, task_type="retrieval.query")
8-bit Quantized Model
model = JinaEmbeddingModel(config)
weights = mx.load("model-8bit.safetensors")
model.load_weights(list(weights.items()))
embeddings = model.encode(texts, tokenizer, task_type="retrieval.query")
Task Types
For the retrieval variant:
retrieval.query - For search queries
retrieval.passage - For documents/passages
Quantization Details
4-bit Quantization
- Method: Affine quantization with group size 64
- Size: ~355 MB (6.4x compression)
- Quality: Cosine similarity β₯0.99 vs full precision
- Use case: Resource-constrained environments, mobile deployment
8-bit Quantization
- Method: Affine quantization with group size 64
- Size: ~639 MB (3.6x compression)
- Quality: Cosine similarity β₯0.9999 vs full precision
- Use case: Production deployments with minimal quality loss
Matryoshka Dimensions
All quantization levels support Matryoshka embedding truncation to: 32, 64, 128, 256, 512, 768, 1024
embeddings_256 = embeddings[:, :256]
Model Details
- Architecture: Qwen3-0.6B with task-specific LoRA adapters (pre-merged)
- Embedding dimension: 1024
- Max sequence length: 32768 tokens
- Optimized for: Apple Silicon (M1/M2/M3/M4) with Metal acceleration
Performance Comparison
| Quantization |
File |
Size |
Speed |
Quality (Cosine Similarity) |
| Full Precision |
model.safetensors |
2.28 GB |
Baseline |
1.0000 |
| 8-bit |
model-8bit.safetensors |
639 MB |
~1.5-2x faster |
β₯0.9999 |
| 4-bit |
model-4bit.safetensors |
355 MB |
~2-3x faster |
β₯0.99 |
Files
jina-embeddings-v5-text-small-retrieval-mlx/
βββ model.safetensors # Full precision weights
βββ model-4bit.safetensors # 4-bit quantized weights
βββ model-8bit.safetensors # 8-bit quantized weights
βββ model.py # Model implementation
βββ config.json # Model configuration
βββ tokenizer.json # Tokenizer
βββ tokenizer_config.json
βββ vocab.json
βββ merges.txt
βββ .gitignore
βββ README.md
Citation
@misc{akram2026jinaembeddingsv5texttasktargetedembeddingdistillation,
title={jina-embeddings-v5-text: Task-Targeted Embedding Distillation},
author={Mohammad Kalim Akram and Saba Sturua and Nastia Havriushenko and Quentin Herreros and Michael GΓΌnther and Maximilian Werk and Han Xiao},
year={2026},
eprint={2602.15547},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.15547},
}
License
CC BY-NC 4.0
Links