File size: 11,814 Bytes
83f398c
 
 
 
 
 
 
 
 
 
 
 
 
e79e8bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
---
license: gpl-3.0
language:
  - en
  - zh
pipeline_tag: text-ranking
tags:
  - transformers
  - sentence-transformers
  - binary code
  - binary
  - sentence-similarity
---


# BinSeek: Cross-modal Retrieval Models for Stripped Binary Analysis

BinSeek is the first two-stage cross-modal retrieval framework specifically designed for stripped binary code analysis. It bridges the semantic gap between natural language queries and binary code (decompiled pseudocode), enabling effective retrieval of relevant binary functions from large-scale codebases.

BinSeek addresses these challenges with a two-stage retrieval strategy:

- **BinSeek-Embedding**: An embedding model trained to learn the semantic relevance between binary code and natural language descriptions, used for efficient first-stage candidate retrieval.
- **BinSeek-Reranker**: A reranking model that carefully judges the relevance of candidate code to the description with calling context augmentation for more precise results.

<p align="center">
  <img src="https://raw.githubusercontent.com/XingTuLab/BinSeek/main/assets/binseek.png" alt="Overview of BinSeek" width="95%">
</p>

## Model Information

| Model                                                              | Domain | Parameters | Embedding Dim | Max Tokens |
|:-------------------------------------------------------------------|:------:|:----------:|:-------------:|:----------:|
| [🤗 BinSeek-Embedding](https://huggingface.co/XingTuLab/BinSeek-Embedding) | Binary |    0.3B    |     1024      |    4096    |
| [🤗 BinSeek-Reranker](https://huggingface.co/XingTuLab/BinSeek-Reranker)   | Binary |    0.6B    |       /       |   16384    |


BinSeek achieves advanced performance on binary code retrieval:

| Model                    | Model Size | Recall@1 | Recall@3 | MRR@3  |
|:-------------------------|:----------:|:--------:|:--------:|:------:|
| Qwen3-Embedding-8B       | 8B         |  57.50   |  65.00   | 60.75  |
| BinSeek-Embedding        | 0.3B       |  67.00   |  80.50   | 72.83  |
| Qwen3-Reranker-8B        | 8B         |  62.50   |  80.50   | 70.83  |
| BinSeek-Reranker         | 0.6B       |  61.50   |  83.00   | 70.50  |
| BinSeek (Emb+ Rerank)    | /          |  76.75   |  84.50   | 80.25  |


## Model Usage

### Dependencies

```bash
pip install torch sentence-transformers>=5.1.2 transformers>=4.57.1 
```

Our models are compatible with the following frameworks. We recommend using the **two-stage pipeline** (Embedding + Reranker) for optimal retrieval performance.

### Sentence-Transformers

```python
import torch
from sentence_transformers import SentenceTransformer, CrossEncoder

# Query and Corpus
query = "A function that implements XTEA encryption algorithm"

# Binary pseudocode corpus (decompiled by IDA Pro)
corpus = [
'''char *__fastcall sub_100000924(__int64 a1, __int64 a2, unsigned int a3)
{
  unsigned int i; // [xsp+1Ch] [xbp-34h]
  char *v5; // [xsp+20h] [xbp-30h]
  unsigned int v6; // [xsp+2Ch] [xbp-24h]
  __int64 v9; // [xsp+40h] [xbp-10h] BYREF

  v6 = a3;
  v9 = 0;
  if ( a3 % 8 )
    v6 = a3 + 8 - a3 % 8;
  v5 = (char *)malloc(v6);
  __memset_chk(v5, 0, v6, -1);
  for ( i = 0; i < v6; i += 8 )
  {
    v9 = *(_QWORD *)(a1 + (int)i);
    sub_100000A68(32, (unsigned int *)&v9, a2);
    __memcpy_chk(&v5[i], &v9, 8, -1);
  }
  return v5;
}''',
'''void *__fastcall sub_401000(size_t size){
    void *ptr = malloc(size);
    if (!ptr) { perror("malloc failed"); exit(1); }
    return ptr;
}''',
'''int __fastcall sub_402000(char *s1, char *s2){
    return strcmp(s1, s2);
}''',
# ... more functions in your corpus
]
# the context functions (concatenated into a single string) for each binary function in the corpus, selected from callees, see our paper for more details
corpus_context = [
'''__int64 __fastcall sub_100000A68(__int64 result, unsigned int *a2, __int64 a3)
{
  unsigned int v3; // [xsp+8h] [xbp-28h]
  unsigned int v4; // [xsp+Ch] [xbp-24h]
  unsigned int v5; // [xsp+10h] [xbp-20h]
  unsigned int i; // [xsp+14h] [xbp-1Ch]

  v5 = *a2;
  v4 = a2[1];
  v3 = 0;
  for ( i = 0; i < (unsigned int)result; ++i )
  {
    v5 += (((v4 >> 5) ^ (16 * v4)) + v4) ^ (v3 + *(_DWORD *)(a3 + 4LL * (v3 & 3)));
    v3 -= 1640531527;
    v4 += (((v5 >> 5) ^ (16 * v5)) + v5) ^ (v3 + *(_DWORD *)(a3 + 4LL * ((v3 >> 11) & 3)));
  }
  *a2 = v5;
  a2[1] = v4;
  return result;
}''',
"",
"",
# ... more context functions in your corpus
]

# Embedding-based Retrieval
embedding_model = SentenceTransformer(
    "XingTuLab/BinSeek-Embedding",
    model_kwargs={"dtype": torch.bfloat16},
    trust_remote_code=True
)

query_embeddings = embedding_model.encode([query])
corpus_embeddings = embedding_model.encode(corpus, batch_size=64)

similarity_matrix = embedding_model.similarity(query_embeddings, corpus_embeddings)
scores = similarity_matrix[0].cpu().float().numpy()
top_k = 10  # Number of candidates to retrieve
top_k_indices = scores.argsort()[::-1][:top_k]
candidates = [corpus[i] for i in top_k_indices]

print("=== Stage 1: Embedding Retrieval Results ===")
for i, idx in enumerate(top_k_indices):
    print(f"Rank {i+1}: Score={scores[idx]:.4f}, Corpus Index={idx}")

def build_candidates_with_context(candidates_ids):
    candidates_with_context = []
    for candidate_id in candidates_ids:
        data = f"<pseudocode>\n{corpus[candidate_id]}\n</pseudocode>\n<context>\n{corpus_context[candidate_id]}\n</context>"
        candidates_with_context.append(data)
    return candidates_with_context

candidates_with_context = build_candidates_with_context(top_k_indices)

# Reranking for Precise Results
reranker = CrossEncoder(
    "XingTuLab/BinSeek-Reranker",
    model_kwargs={"dtype": torch.bfloat16},
    trust_remote_code=True
)

reranked_results = reranker.rank(query, candidates_with_context)

print("\n=== Stage 2: Reranking Results ===")
print(f"Query: {query}")
for rank in reranked_results:
    original_idx = top_k_indices[rank['corpus_id']]
    print(f"Rank {reranked_results.index(rank)+1}: Score={rank['score']:.4f}, Corpus Index={original_idx}")
```

### Transformers

```python
import torch
import numpy as np
from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification

# Query and Corpus
query = "A function that implements XTEA encryption algorithm"

# Binary pseudocode corpus (decompiled by IDA Pro)
corpus = [
'''char *__fastcall sub_100000924(__int64 a1, __int64 a2, unsigned int a3)
{
  unsigned int i; // [xsp+1Ch] [xbp-34h]
  char *v5; // [xsp+20h] [xbp-30h]
  unsigned int v6; // [xsp+2Ch] [xbp-24h]
  __int64 v9; // [xsp+40h] [xbp-10h] BYREF

  v6 = a3;
  v9 = 0;
  if ( a3 % 8 )
    v6 = a3 + 8 - a3 % 8;
  v5 = (char *)malloc(v6);
  __memset_chk(v5, 0, v6, -1);
  for ( i = 0; i < v6; i += 8 )
  {
    v9 = *(_QWORD *)(a1 + (int)i);
    sub_100000A68(32, (unsigned int *)&v9, a2);
    __memcpy_chk(&v5[i], &v9, 8, -1);
  }
  return v5;
}''',
'''void *__fastcall sub_401000(size_t size){
    void *ptr = malloc(size);
    if (!ptr) { perror("malloc failed"); exit(1); }
    return ptr;
}''',
'''int __fastcall sub_402000(char *s1, char *s2){
    return strcmp(s1, s2);
}''',
# ... more functions in your corpus
]
# the context functions (concatenated into a single string) for each binary function in the corpus, selected from callees, see our paper for more details
corpus_context = [
'''__int64 __fastcall sub_100000A68(__int64 result, unsigned int *a2, __int64 a3)
{
  unsigned int v3; // [xsp+8h] [xbp-28h]
  unsigned int v4; // [xsp+Ch] [xbp-24h]
  unsigned int v5; // [xsp+10h] [xbp-20h]
  unsigned int i; // [xsp+14h] [xbp-1Ch]

  v5 = *a2;
  v4 = a2[1];
  v3 = 0;
  for ( i = 0; i < (unsigned int)result; ++i )
  {
    v5 += (((v4 >> 5) ^ (16 * v4)) + v4) ^ (v3 + *(_DWORD *)(a3 + 4LL * (v3 & 3)));
    v3 -= 1640531527;
    v4 += (((v5 >> 5) ^ (16 * v5)) + v5) ^ (v3 + *(_DWORD *)(a3 + 4LL * ((v3 >> 11) & 3)));
  }
  *a2 = v5;
  a2[1] = v4;
  return result;
}''',
"",
"",
# ... more context functions in your corpus
]

# Embedding-based Retrieval
embed_tokenizer = AutoTokenizer.from_pretrained(
    "XingTuLab/BinSeek-Embedding", 
    trust_remote_code=True
)
embed_model = AutoModel.from_pretrained(
    "XingTuLab/BinSeek-Embedding",
    dtype=torch.bfloat16,
    trust_remote_code=True
).eval().cuda()

def get_embeddings(texts, tokenizer, model, max_length=4096):
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=max_length)
    inputs = {k: v.cuda() for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs)
        # Last token pooling: use attention_mask to find last valid token position
        attention_mask = inputs["attention_mask"]
        last_token_indices = attention_mask.sum(dim=1) - 1  # (batch_size,)
        batch_indices = torch.arange(outputs.last_hidden_state.size(0), device=outputs.last_hidden_state.device)
        embeddings = outputs.last_hidden_state[batch_indices, last_token_indices, :]
        embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
    return embeddings.cpu().float().numpy()

query_embedding = get_embeddings([query], embed_tokenizer, embed_model)
corpus_embeddings = get_embeddings(corpus, embed_tokenizer, embed_model)

scores = np.dot(query_embedding, corpus_embeddings.T)[0]
top_k = 10
top_k_indices = np.argsort(scores)[::-1][:min(top_k, len(corpus))]
candidates = [corpus[i] for i in top_k_indices]

print("=== Stage 1: Embedding Retrieval Results ===")
for i, idx in enumerate(top_k_indices):
    print(f"Rank {i+1}: Score={scores[idx]:.4f}, Corpus Index={idx}")

def build_candidates_with_context(candidates_ids):
    candidates_with_context = []
    for candidate_id in candidates_ids:
        data = f"<pseudocode>\n{corpus[candidate_id]}\n</pseudocode>\n<context>\n{corpus_context[candidate_id]}\n</context>"
        candidates_with_context.append(data)
    return candidates_with_context

candidates_with_context = build_candidates_with_context(top_k_indices)

# Reranking for Precise Results
rerank_tokenizer = AutoTokenizer.from_pretrained(
    "XingTuLab/BinSeek-Reranker", 
    trust_remote_code=True
)
rerank_model = AutoModelForSequenceClassification.from_pretrained(
    "XingTuLab/BinSeek-Reranker",
    dtype=torch.bfloat16,
    trust_remote_code=True
).eval().cuda()

def rerank(query, candidates, tokenizer, model, max_length=16384):
    pairs = [[query, cand] for cand in candidates]
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors="pt", max_length=max_length)
    inputs = {k: v.cuda() for k, v in inputs.items()}
    with torch.no_grad():
        logits = model(**inputs).logits.squeeze(-1)
        scores = torch.sigmoid(logits).float().cpu().numpy()  # Apply sigmoid activation
    return scores

rerank_scores = rerank(query, candidates_with_context, rerank_tokenizer, rerank_model)
reranked_order = np.argsort(rerank_scores)[::-1]

print("\n=== Stage 2: Reranking Results ===")
print(f"Query: {query}")
for i, idx in enumerate(reranked_order):
    original_idx = top_k_indices[idx]
    print(f"Rank {i+1}: Score={rerank_scores[idx]:.4f}, Corpus Index={original_idx}")
```


## License

This project is under the GPL-3.0 License, and it is for research purposes only. Please use responsibly and in accordance with applicable laws and regulations.

## Citation

If you find our work helpful, feel free to give us a cite.

```bibtex
@misc{chen2025BinSeek,
      title={Cross-modal Retrieval Models for Stripped Binary Analysis}, 
      author={Guoqiang Chen and Lingyun Ying and Ziyang Song and Daguang Liu and Qiang Wang and Zhiqi Wang and Li Hu and Shaoyin Cheng and Weiming Zhang and Nenghai Yu},
      year={2025},
      eprint={2512.10393},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2512.10393}, 
}
```