File size: 1,490 Bytes
10d76e2 89d5074 10d76e2 89d5074 10d76e2 89d5074 10d76e2 89d5074 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
---
license: apple-amlr
base_model:
- mistralai/Mistral-7B-Instruct-v0.2
tags:
- rag
- compression
- retrieval
- end-to-end
- generation
---
# CLaRa-7B-E2E (Compression-16 & 128)
The **CLaRa-7B-E2E** model is our fully end-to-end unified RAG model, jointly optimizing retrieval and generation with 16× and 128x document compression.
**Training recipe:** End-to-end finetuning with differentiable top-k retrieval and a unified language-modeling objective.
**Benchmarks:** Strong retrieval-augmented QA performance under aggressive compression.
---
## More details and usage examples:
Paper: [CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning](https://arxiv.org/abs/2511.18659)
GitHub: https://github.com/apple/ml-clara
---
## Example Usage (End-to-End Inference)
```python
from transformers import AutoModel
unirag = AutoModel.from_pretrained(
"/mnt/ceph_rbd/model/CLaRa-7B-E2E/compression-16",
trust_remote_code=True
).to("cuda")
# Example documents and question
documents = [[
"Weldenia is a monotypic genus of flowering plant in the family Commelinaceae...",
] * 20]
questions = [
"Which genus of plant grows originally in Mexico and Guatemala, Phylica or Weldenia?"
]
# End-to-end usage (retrieval + generation)
# The effective top-k is controlled by `generation_top_k` in config.json.
out = unirag.generate_from_questions(
questions=questions,
documents=documents,
max_new_tokens=64
)
print("Generated answer", out) |