Curious about Quants for embedding models

by vvekthkr - opened 9 days ago

9 days ago

I have seen many quants for qwen embedding models, I am just curious around result differences say between general unquantized embedding models and and an FP8 version of this model, Just asking here If anyone has done that comparison, before running my own tests on my codebase and compare the numbers generated for same set of code chunks.

Just started downloading the octen 4B as qwen 8b runs a little slow (even for 1024 dims) for me and fills up the sglang request queue, I see that the octen 4B is better on benchmarks than the qwen 8B.

Thanks for putting efforts into this model.

bflhc

Octen-Team org 9 days ago

For embedding models, the performance loss from quantized versions on downstream retrieval tasks is negligible.

We don’t currently offer an FP8 version, but our INT8 release (Octen-Embedding-8B-INT8) performs very close to the BF16 version on benchmarks, with less than a 1% difference.

vvekthkr

6 days ago

Thanks for the inputs. I do see Qdrant by default using int8 to persist the vectors. I will try to use the int8 quant.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment