File size: 11,814 Bytes

---
license: gpl-3.0
language:
  - en
  - zh
pipeline_tag: text-ranking
tags:
  - transformers
  - sentence-transformers
  - binary code
  - binary
  - sentence-similarity
---


# BinSeek: Cross-modal Retrieval Models for Stripped Binary Analysis

BinSeek is the first two-stage cross-modal retrieval framework specifically designed for stripped binary code analysis. It bridges the semantic gap between natural language queries and binary code (decompiled pseudocode), enabling effective retrieval of relevant binary functions from large-scale codebases.

BinSeek addresses these challenges with a two-stage retrieval strategy:

- **BinSeek-Embedding**: An embedding model trained to learn the semantic relevance between binary code and natural language descriptions, used for efficient first-stage candidate retrieval.
- **BinSeek-Reranker**: A reranking model that carefully judges the relevance of candidate code to the description with calling context augmentation for more precise results.

<p align="center">
  <img src="https://raw.githubusercontent.com/XingTuLab/BinSeek/main/assets/binseek.png" alt="Overview of BinSeek" width="95%">
</p>

## Model Information

| Model                                                              | Domain | Parameters | Embedding Dim | Max Tokens |
|:-------------------------------------------------------------------|:------:|:----------:|:-------------:|:----------:|
| [🤗 BinSeek-Embedding](https://huggingface.co/XingTuLab/BinSeek-Embedding) | Binary |    0.3B    |     1024      |    4096    |
| [🤗 BinSeek-Reranker](https://huggingface.co/XingTuLab/BinSeek-Reranker)   | Binary |    0.6B    |       /       |   16384    |


BinSeek achieves advanced performance on binary code retrieval:

| Model                    | Model Size | Recall@1 | Recall@3 | MRR@3  |
|:-------------------------|:----------:|:--------:|:--------:|:------:|
| Qwen3-Embedding-8B       | 8B         |  57.50   |  65.00   | 60.75  |
| BinSeek-Embedding        | 0.3B       |  67.00   |  80.50   | 72.83  |
| Qwen3-Reranker-8B        | 8B         |  62.50   |  80.50   | 70.83  |
| BinSeek-Reranker         | 0.6B       |  61.50   |  83.00   | 70.50  |
| BinSeek (Emb+ Rerank)    | /          |  76.75   |  84.50   | 80.25  |


## Model Usage

### Dependencies

```bash
pip install torch sentence-transformers>=5.1.2 transformers>=4.57.1 
```

Our models are compatible with the following frameworks. We recommend using the **two-stage pipeline** (Embedding + Reranker) for optimal retrieval performance.

### Sentence-Transformers

```python
import torch
from sentence_transformers import SentenceTransformer, CrossEncoder

# Query and Corpus
query = "A function that implements XTEA encryption algorithm"

# Binary pseudocode corpus (decompiled by IDA Pro)
corpus = [
'''char *__fastcall sub_100000924(__int64 a1, __int64 a2, unsigned int a3)
{
  unsigned int i; // [xsp+1Ch] [xbp-34h]
  char *v5; // [xsp+20h] [xbp-30h]
  unsigned int v6; // [xsp+2Ch] [xbp-24h]
  __int64 v9; // [xsp+40h] [xbp-10h] BYREF

  v6 = a3;
  v9 = 0;
  if ( a3 % 8 )
    v6 = a3 + 8 - a3 % 8;
  v5 = (char *)malloc(v6);
  __memset_chk(v5, 0, v6, -1);
  for ( i = 0; i < v6; i += 8 )
  {
    v9 = *(_QWORD *)(a1 + (int)i);
    sub_100000A68(32, (unsigned int *)&v9, a2);
    __memcpy_chk(&v5[i], &v9, 8, -1);
  }
  return v5;
}''',
'''void *__fastcall sub_401000(size_t size){
    void *ptr = malloc(size);
    if (!ptr) { perror("malloc failed"); exit(1); }
    return ptr;
}''',
'''int __fastcall sub_402000(char *s1, char *s2){
    return strcmp(s1, s2);
}''',
# ... more functions in your corpus
]
# the context functions (concatenated into a single string) for each binary function in the corpus, selected from callees, see our paper for more details
corpus_context = [
'''__int64 __fastcall sub_100000A68(__int64 result, unsigned int *a2, __int64 a3)
{
  unsigned int v3; // [xsp+8h] [xbp-28h]
  unsigned int v4; // [xsp+Ch] [xbp-24h]
  unsigned int v5; // [xsp+10h] [xbp-20h]
  unsigned int i; // [xsp+14h] [xbp-1Ch]

  v5 = *a2;
  v4 = a2[1];
  v3 = 0;
  for ( i = 0; i < (unsigned int)result; ++i )
  {
    v5 += (((v4 >> 5) ^ (16 * v4)) + v4) ^ (v3 + *(_DWORD *)(a3 + 4LL * (v3 & 3)));
    v3 -= 1640531527;
    v4 += (((v5 >> 5) ^ (16 * v5)) + v5) ^ (v3 + *(_DWORD *)(a3 + 4LL * ((v3 >> 11) & 3)));
  }
  *a2 = v5;
  a2[1] = v4;
  return result;
}''',
"",
"",
# ... more context functions in your corpus
]

# Embedding-based Retrieval
embedding_model = SentenceTransformer(
    "XingTuLab/BinSeek-Embedding",
    model_kwargs={"dtype": torch.bfloat16},
    trust_remote_code=True
)

query_embeddings = embedding_model.encode([query])
corpus_embeddings = embedding_model.encode(corpus, batch_size=64)

similarity_matrix = embedding_model.similarity(query_embeddings, corpus_embeddings)
scores = similarity_matrix[0].cpu().float().numpy()
top_k = 10  # Number of candidates to retrieve
top_k_indices = scores.argsort()[::-1][:top_k]
candidates = [corpus[i] for i in top_k_indices]

print("=== Stage 1: Embedding Retrieval Results ===")
for i, idx in enumerate(top_k_indices):
    print(f"Rank {i+1}: Score={scores[idx]:.4f}, Corpus Index={idx}")

def build_candidates_with_context(candidates_ids):
    candidates_with_context = []
    for candidate_id in candidates_ids:
        data = f"<pseudocode>\n{corpus[candidate_id]}\n</pseudocode>\n<context>\n{corpus_context[candidate_id]}\n</context>"
        candidates_with_context.append(data)
    return candidates_with_context

candidates_with_context = build_candidates_with_context(top_k_indices)

# Reranking for Precise Results
reranker = CrossEncoder(
    "XingTuLab/BinSeek-Reranker",
    model_kwargs={"dtype": torch.bfloat16},
    trust_remote_code=True
)

reranked_results = reranker.rank(query, candidates_with_context)

print("\n=== Stage 2: Reranking Results ===")
print(f"Query: {query}")
for rank in reranked_results:
    original_idx = top_k_indices[rank['corpus_id']]
    print(f"Rank {reranked_results.index(rank)+1}: Score={rank['score']:.4f}, Corpus Index={original_idx}")
```

### Transformers

```python
import torch
import numpy as np
from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification

# Query and Corpus
query = "A function that implements XTEA encryption algorithm"

# Binary pseudocode corpus (decompiled by IDA Pro)
corpus = [
'''char *__fastcall sub_100000924(__int64 a1, __int64 a2, unsigned int a3)
{
  unsigned int i; // [xsp+1Ch] [xbp-34h]
  char *v5; // [xsp+20h] [xbp-30h]
  unsigned int v6; // [xsp+2Ch] [xbp-24h]
  __int64 v9; // [xsp+40h] [xbp-10h] BYREF

  v6 = a3;
  v9 = 0;
  if ( a3 % 8 )
    v6 = a3 + 8 - a3 % 8;
  v5 = (char *)malloc(v6);
  __memset_chk(v5, 0, v6, -1);
  for ( i = 0; i < v6; i += 8 )
  {
    v9 = *(_QWORD *)(a1 + (int)i);
    sub_100000A68(32, (unsigned int *)&v9, a2);
    __memcpy_chk(&v5[i], &v9, 8, -1);
  }
  return v5;
}''',
'''void *__fastcall sub_401000(size_t size){
    void *ptr = malloc(size);
    if (!ptr) { perror("malloc failed"); exit(1); }
    return ptr;
}''',
'''int __fastcall sub_402000(char *s1, char *s2){
    return strcmp(s1, s2);
}''',
# ... more functions in your corpus
]
# the context functions (concatenated into a single string) for each binary function in the corpus, selected from callees, see our paper for more details
corpus_context = [
'''__int64 __fastcall sub_100000A68(__int64 result, unsigned int *a2, __int64 a3)
{
  unsigned int v3; // [xsp+8h] [xbp-28h]
  unsigned int v4; // [xsp+Ch] [xbp-24h]
  unsigned int v5; // [xsp+10h] [xbp-20h]
  unsigned int i; // [xsp+14h] [xbp-1Ch]

  v5 = *a2;
  v4 = a2[1];
  v3 = 0;
  for ( i = 0; i < (unsigned int)result; ++i )
  {
    v5 += (((v4 >> 5) ^ (16 * v4)) + v4) ^ (v3 + *(_DWORD *)(a3 + 4LL * (v3 & 3)));
    v3 -= 1640531527;
    v4 += (((v5 >> 5) ^ (16 * v5)) + v5) ^ (v3 + *(_DWORD *)(a3 + 4LL * ((v3 >> 11) & 3)));
  }
  *a2 = v5;
  a2[1] = v4;
  return result;
}''',
"",
"",
# ... more context functions in your corpus
]

# Embedding-based Retrieval
embed_tokenizer = AutoTokenizer.from_pretrained(
    "XingTuLab/BinSeek-Embedding", 
    trust_remote_code=True
)
embed_model = AutoModel.from_pretrained(
    "XingTuLab/BinSeek-Embedding",
    dtype=torch.bfloat16,
    trust_remote_code=True
).eval().cuda()

def get_embeddings(texts, tokenizer, model, max_length=4096):
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=max_length)
    inputs = {k: v.cuda() for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs)
        # Last token pooling: use attention_mask to find last valid token position
        attention_mask = inputs["attention_mask"]
        last_token_indices = attention_mask.sum(dim=1) - 1  # (batch_size,)
        batch_indices = torch.arange(outputs.last_hidden_state.size(0), device=outputs.last_hidden_state.device)
        embeddings = outputs.last_hidden_state[batch_indices, last_token_indices, :]
        embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
    return embeddings.cpu().float().numpy()

query_embedding = get_embeddings([query], embed_tokenizer, embed_model)
corpus_embeddings = get_embeddings(corpus, embed_tokenizer, embed_model)

scores = np.dot(query_embedding, corpus_embeddings.T)[0]
top_k = 10
top_k_indices = np.argsort(scores)[::-1][:min(top_k, len(corpus))]
candidates = [corpus[i] for i in top_k_indices]

print("=== Stage 1: Embedding Retrieval Results ===")
for i, idx in enumerate(top_k_indices):
    print(f"Rank {i+1}: Score={scores[idx]:.4f}, Corpus Index={idx}")

def build_candidates_with_context(candidates_ids):
    candidates_with_context = []
    for candidate_id in candidates_ids:
        data = f"<pseudocode>\n{corpus[candidate_id]}\n</pseudocode>\n<context>\n{corpus_context[candidate_id]}\n</context>"
        candidates_with_context.append(data)
    return candidates_with_context

candidates_with_context = build_candidates_with_context(top_k_indices)

# Reranking for Precise Results
rerank_tokenizer = AutoTokenizer.from_pretrained(
    "XingTuLab/BinSeek-Reranker", 
    trust_remote_code=True
)
rerank_model = AutoModelForSequenceClassification.from_pretrained(
    "XingTuLab/BinSeek-Reranker",
    dtype=torch.bfloat16,
    trust_remote_code=True
).eval().cuda()

def rerank(query, candidates, tokenizer, model, max_length=16384):
    pairs = [[query, cand] for cand in candidates]
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors="pt", max_length=max_length)
    inputs = {k: v.cuda() for k, v in inputs.items()}
    with torch.no_grad():
        logits = model(**inputs).logits.squeeze(-1)
        scores = torch.sigmoid(logits).float().cpu().numpy()  # Apply sigmoid activation
    return scores

rerank_scores = rerank(query, candidates_with_context, rerank_tokenizer, rerank_model)
reranked_order = np.argsort(rerank_scores)[::-1]

print("\n=== Stage 2: Reranking Results ===")
print(f"Query: {query}")
for i, idx in enumerate(reranked_order):
    original_idx = top_k_indices[idx]
    print(f"Rank {i+1}: Score={rerank_scores[idx]:.4f}, Corpus Index={original_idx}")
```


## License

This project is under the GPL-3.0 License, and it is for research purposes only. Please use responsibly and in accordance with applicable laws and regulations.

## Citation

If you find our work helpful, feel free to give us a cite.

```bibtex
@misc{chen2025BinSeek,
      title={Cross-modal Retrieval Models for Stripped Binary Analysis}, 
      author={Guoqiang Chen and Lingyun Ying and Ziyang Song and Daguang Liu and Qiang Wang and Zhiqi Wang and Li Hu and Shaoyin Cheng and Weiming Zhang and Nenghai Yu},
      year={2025},
      eprint={2512.10393},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2512.10393}, 
}
```