alvarobartt HF Staff commited on
Commit
3cc7935
·
verified ·
1 Parent(s): fee984e

Clone from jinaai/jina-code-embeddings-1.5b

Browse files
Files changed (1) hide show
  1. README.md +293 -0
README.md ADDED
@@ -0,0 +1,293 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-Coder-1.5B
4
+ license: cc-by-nc-4.0
5
+ tags:
6
+ - feature-extraction
7
+ - mteb
8
+ - sentence-transformers
9
+ inference: false
10
+ library_name: transformers
11
+ ---
12
+
13
+ <br><br>
14
+
15
+ <p align="center">
16
+ <img src="https://huggingface.co/datasets/jinaai/documentation-images/resolve/main/logo.webp" alt="Jina AI: Your Search Foundation, Supercharged!" width="150px">
17
+ </p>
18
+
19
+ <p align="center">
20
+ <b>The code embedding model trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
21
+ </p>
22
+
23
+ # Jina Code Embeddings: A Small but Performant Code Embedding Model
24
+
25
+ ## Intended Usage & Model Info
26
+ `jina-code-embeddings` is an embedding model for code retrieval.
27
+ The model supports various types of code retrieval (text-to-code, code-to-code, code-to-text, code-to-completion) and technical question answering across 15+ programming languages.
28
+
29
+
30
+ Built on [Qwen/Qwen2.5-Coder-1.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B), `jina-code-embeddings-1.5b` features:
31
+
32
+ - **Multilingual support** (15+ programming languages) and compatibility with a wide range of domains, including web development, software development, machine learning, data science, and educational coding problems.
33
+ - **Task-specific instruction prefixes** for NL2Code, Code2Code, Code2NL, Code2Completion, and Technical QA, which can be selected at inference time.
34
+ - **Flexible embedding size**: dense embeddings are 1536-dimensional by default but can be truncated to as low as 128 with minimal performance loss.
35
+
36
+
37
+ Summary of features:
38
+
39
+ | Feature | Jina Code Embeddings 1.5B |
40
+ |------------|------------|
41
+ | Base Model | Qwen2.5-Coder-1.5B |
42
+ | Supported Tasks | `nl2code`, `code2code`, `code2nl`, `code2completion`, `qa` |
43
+ | Model DType | BFloat 16 |
44
+ | Max Sequence Length | 32768 |
45
+ | Embedding Vector Dimension | 1536 |
46
+ | Matryoshka dimensions | 128, 256, 512, 1024, 1536 |
47
+ | Pooling Strategy | Last-token pooling |
48
+ | Attention Mechanism | FlashAttention2 |
49
+
50
+ ## Usage
51
+
52
+ <details>
53
+ <summary>Requirements</a></summary>
54
+
55
+ The following Python packages are required:
56
+
57
+ - `transformers>=4.53.0`
58
+ - `torch>=2.7.1`
59
+
60
+ ### Optional / Recommended
61
+ - **flash-attention**: Installing [flash-attention](https://github.com/Dao-AILab/flash-attention) is recommended for improved inference speed and efficiency, but not mandatory.
62
+ - **sentence-transformers**: If you want to use the model via the `sentence-transformers` interface, install this package as well.
63
+ </details>
64
+
65
+ <details>
66
+ <summary>via <a href="https://huggingface.co/docs/transformers/en/index">transformers</a></summary>
67
+
68
+ ```python
69
+ # !pip install transformers>=4.53.0 torch>=2.7.1
70
+
71
+ import torch
72
+ import torch.nn.functional as F
73
+
74
+ from transformers import AutoModel, AutoTokenizer
75
+
76
+ INSTRUCTION_CONFIG = {
77
+ "nl2code": {
78
+ "query": "Find the most relevant code snippet given the following query:\n",
79
+ "passage": "Candidate code snippet:\n"
80
+ },
81
+ "qa": {
82
+ "query": "Find the most relevant answer given the following question:\n",
83
+ "passage": "Candidate answer:\n"
84
+ },
85
+ "code2code": {
86
+ "query": "Find an equivalent code snippet given the following code snippet:\n",
87
+ "passage": "Candidate code snippet:\n"
88
+ },
89
+ "code2nl": {
90
+ "query": "Find the most relevant comment given the following code snippet:\n",
91
+ "passage": "Candidate comment:\n"
92
+ },
93
+ "code2completion": {
94
+ "query": "Find the most relevant completion given the following start of code snippet:\n",
95
+ "passage": "Candidate completion:\n"
96
+ }
97
+ }
98
+
99
+ MAX_LENGTH = 8192
100
+
101
+ def cosine_similarity(x,y):
102
+ x = F.normalize(x, p=2, dim=1)
103
+ y = F.normalize(y, p=2, dim=1)
104
+ return x @ y.T
105
+
106
+ def last_token_pool(last_hidden_states, attention_mask):
107
+ left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
108
+ if left_padding:
109
+ return last_hidden_states[:, -1]
110
+ else:
111
+ sequence_lengths = attention_mask.sum(dim=1) - 1
112
+ batch_size = last_hidden_states.shape[0]
113
+ return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
114
+
115
+ def add_instruction(instruction, query):
116
+ return f'{instruction}{query}'
117
+
118
+ # The queries and documents to embed
119
+ queries = [
120
+ add_instruction(INSTRUCTION_CONFIG["nl2code"]["query"], "print hello world in python"),
121
+ add_instruction(INSTRUCTION_CONFIG["nl2code"]["query"], "initialize array of 5 zeros in c++")
122
+ ]
123
+ documents = [
124
+ add_instruction(INSTRUCTION_CONFIG["nl2code"]["passage"], "print('Hello World!')"),
125
+ add_instruction(INSTRUCTION_CONFIG["nl2code"]["passage"], "int arr[5] = {0, 0, 0, 0, 0};")
126
+ ]
127
+ all_inputs = queries + documents
128
+
129
+ tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-code-embeddings-1.5b')
130
+ model = AutoModel.from_pretrained('jinaai/jina-code-embeddings-1.5b')
131
+
132
+ batch_dict = tokenizer(
133
+ all_inputs,
134
+ padding=True,
135
+ truncation=True,
136
+ max_length=MAX_LENGTH,
137
+ return_tensors="pt",
138
+ )
139
+ batch_dict.to(model.device)
140
+ outputs = model(**batch_dict)
141
+ embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
142
+ query_embeddings = embeddings[:2]
143
+ passage_embeddings = embeddings[2:]
144
+
145
+ # Compute the (cosine) similarity between the query and document embeddings
146
+ scores = cosine_similarity(query_embeddings, passage_embeddings)
147
+ print(scores)
148
+ # tensor([[0.7647, 0.1115],
149
+ # [0.0930, 0.6606]], grad_fn=<MmBackward0>)
150
+ ```
151
+ </details>
152
+
153
+ <details>
154
+ <summary>via <a href="https://sbert.net/">sentence-transformers</a></summary>
155
+
156
+ ```python
157
+ # !pip install sentence_transformers>=5.0.0 torch>=2.7.1
158
+
159
+ import torch
160
+ from sentence_transformers import SentenceTransformer
161
+
162
+ # Load the model
163
+ model = SentenceTransformer(
164
+ "jinaai/jina-code-embeddings-1.5b",
165
+ model_kwargs={
166
+ "torch_dtype": torch.bfloat16,
167
+ "attn_implementation": "flash_attention_2",
168
+ "device_map": "cuda"
169
+ },
170
+ tokenizer_kwargs={"padding_side": "left"},
171
+ )
172
+
173
+ # The queries and documents to embed
174
+ queries = [
175
+ "print hello world in python",
176
+ "initialize array of 5 zeros in c++"
177
+ ]
178
+ documents = [
179
+ "print('Hello World!')",
180
+ "int arr[5] = {0, 0, 0, 0, 0};"
181
+ ]
182
+
183
+ query_embeddings = model.encode(queries, prompt_name="nl2code_query")
184
+ document_embeddings = model.encode(documents, prompt_name="nl2code_document")
185
+
186
+ # Compute the (cosine) similarity between the query and document embeddings
187
+ similarity = model.similarity(query_embeddings, document_embeddings)
188
+ print(similarity)
189
+ # tensor([[0.7670, 0.1117],
190
+ # [0.0938, 0.6607]])
191
+ ```
192
+ </details>
193
+
194
+ <details>
195
+ <summary>via <a href="https://github.com/vllm-project/vllm">vLLM</a></summary>
196
+
197
+ ```python
198
+
199
+ import torch
200
+ import torch.nn.functional as F
201
+ from vllm import LLM
202
+
203
+ INSTRUCTION_CONFIG = {
204
+ "nl2code": {
205
+ "query": "Find the most relevant code snippet given the following query:\n",
206
+ "passage": "Candidate code snippet:\n"
207
+ },
208
+ "qa": {
209
+ "query": "Find the most relevant answer given the following question:\n",
210
+ "passage": "Candidate answer:\n"
211
+ },
212
+ "code2code": {
213
+ "query": "Find an equivalent code snippet given the following code snippet:\n",
214
+ "passage": "Candidate code snippet:\n"
215
+ },
216
+ "code2nl": {
217
+ "query": "Find the most relevant comment given the following code snippet:\n",
218
+ "passage": "Candidate comment:\n"
219
+ },
220
+ "code2completion": {
221
+ "query": "Find the most relevant completion given the following start of code snippet:\n",
222
+ "passage": "Candidate completion:\n"
223
+ }
224
+ }
225
+
226
+ def add_instruction(instruction, text):
227
+ return f"{instruction}{text}"
228
+
229
+ def cosine_similarity(x, y):
230
+ x = F.normalize(x, p=2, dim=1)
231
+ y = F.normalize(y, p=2, dim=1)
232
+ return x @ y.T
233
+
234
+ # Build the queries and documents
235
+ queries = [
236
+ add_instruction(INSTRUCTION_CONFIG["nl2code"]["query"], "print hello world in python"),
237
+ add_instruction(INSTRUCTION_CONFIG["nl2code"]["query"], "initialize array of 5 zeros in c++"),
238
+ ]
239
+ documents = [
240
+ add_instruction(INSTRUCTION_CONFIG["nl2code"]["passage"], "print('Hello World!')"),
241
+ add_instruction(INSTRUCTION_CONFIG["nl2code"]["passage"], "int arr[5] = {0, 0, 0, 0, 0};"),
242
+ ]
243
+ all_inputs = queries + documents
244
+
245
+ # vLLM embedding model
246
+ llm = LLM(
247
+ model="jinaai/jina-code-embeddings-1.5b",
248
+ task="embed"
249
+ )
250
+
251
+ # Encode with vLLM
252
+ outputs = llm.encode(all_inputs)
253
+
254
+ # Collect embeddings into a single tensor
255
+ emb_list = []
256
+ for out in outputs:
257
+ vec = out.outputs.data.detach()
258
+ emb_list.append(vec)
259
+ embeddings = torch.stack(emb_list, dim=0)
260
+
261
+ # Split into query and passage embeddings
262
+ n_q = len(queries)
263
+ query_embeddings = embeddings[:n_q]
264
+ passage_embeddings = embeddings[n_q:]
265
+
266
+ # Cosine similarity matrix (queries x documents)
267
+ scores = cosine_similarity(query_embeddings, passage_embeddings)
268
+ print(scores)
269
+ # tensor([[0.7650, 0.1118],
270
+ # [0.0937, 0.6613]])
271
+ ```
272
+
273
+ </details>
274
+
275
+ ## Citation
276
+
277
+ Please refer to our [technical report of jina-code-embeddings](https://arxiv.org/abs/2508.21290) for training details and benchmarks. If you find it useful in your research, please cite the following paper:
278
+
279
+ ```
280
+ @misc{kryvosheieva2025efficientcodeembeddingscode,
281
+ title={Efficient Code Embeddings from Code Generation Models},
282
+ author={Daria Kryvosheieva and Saba Sturua and Michael Günther and Scott Martens and Han Xiao},
283
+ year={2025},
284
+ eprint={2508.21290},
285
+ archivePrefix={arXiv},
286
+ primaryClass={cs.CL},
287
+ url={https://arxiv.org/abs/2508.21290},
288
+ }
289
+ ```
290
+
291
+ ## Contact
292
+
293
+ Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.