Model Card for cisco-ai/SecureBERT2.0-biencoder
The SecureBERT 2.0 Bi-Encoder is a cybersecurity-domain sentence-similarity and document-embedding model fine-tuned from SecureBERT 2.0.
It independently encodes queries and documents into a shared vector space for semantic search, information retrieval, and cybersecurity knowledge retrieval.
Model Details
Model Description
- Developed by: Cisco AI
- Model type: Bi-Encoder (Sentence Transformer)
- Architecture: ModernBERT backbone with dual encoders
- Max sequence length: 1024 tokens
- Output dimension: 768
- Language: English
- License: Apache-2.0
- Finetuned from: cisco-ai/SecureBERT2.0-base
Uses
Direct Use
- Semantic search and document similarity in cybersecurity corpora
- Information retrieval and ranking for threat intelligence reports, advisories, and vulnerability notes
- Document embedding for retrieval-augmented generation (RAG) and clustering
Downstream Use
- Threat intelligence knowledge graph construction
- Cybersecurity QA and reasoning systems
- Security operations center (SOC) data mining
Out-of-Scope Use
- Non-technical or general-domain text similarity
- Generative or conversational tasks
Model Architecture
The Bi-Encoder encodes queries and documents independently into a joint vector space.
This architecture enables scalable approximate nearest-neighbor search for candidate retrieval and semantic ranking.
Datasets
Fine-Tuning Datasets
| Dataset Category | Number of Records |
|---|---|
| Cybersecurity QA corpus | 43 000 |
| Security governance QA corpus | 60 000 |
| Cybersecurity instructionโresponse corpus | 25 000 |
| Cybersecurity rules corpus (evaluation) | 5 000 |
Dataset Descriptions
- Cybersecurity QA corpus: 43 k questionโanswer pairs, reports, and technical documents covering network security, malware analysis, cryptography, and cloud security.
- Security governance QA corpus: 60 k expert-curated governance and compliance QA pairs emphasizing clear, validated responses.
- Cybersecurity instructionโresponse corpus: 25 k instructional pairs enabling reasoning and instruction-following.
- Cybersecurity rules corpus: 5 k structured policy and guideline records used for evaluation.
How to Get Started with the Model
Using Sentence Transformers
pip install -U sentence-transformers
Run Model to Encode
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("cisco-ai/SecureBERT2.0-biencoder")
sentences = [
"How would you use Amcache analysis to detect fileless malware?",
"Amcache analysis provides forensic artifacts for detecting fileless malware ...",
"To capture and display network traffic"
]
embeddings = model.encode(sentences)
print(embeddings.shape)
Compute Similarity
from sentence_transformers import util
similarity = util.cos_sim(embeddings, embeddings)
print(similarity)
Framework Versions
- python: 3.10.10
- sentence_transformers: 5.0.0
- transformers: 4.52.4
- PyTorch: 2.7.0+cu128
- accelerate: 1.9.0
- datasets: 3.6.0
Training Details
Training Dataset
The model was fine-tuned on cybersecurity-specific paired-sentence data for document embedding and similarity learning.
- Dataset Size: 35,705 samples
- Columns:
sentence_0,sentence_1,label
Example Schema
| Field | Type | Description |
|---|---|---|
| sentence_0 | string | Query or short text input |
| sentence_1 | string | Candidate or document text |
| label | float | Similarity score (1.0 = relevant) |
Example Samples
| sentence_0 | sentence_1 | label |
|---|---|---|
| Under what circumstances does attribution bias distort intrusion linking? | Attribution bias in intrusion linking occurs when analysts allow preconceived notions, organizational pressures, or cognitive shortcuts to influence their assessment of attack origins and relationships between incidents... | 1.0 |
| How can you identify store buffer bypass speculation artifacts? | Store buffer bypass speculation artifacts represent side-channel vulnerabilities that exploit speculative execution to leak sensitive information... | 1.0 |
Training Objective and Loss
The model was optimized to maximize semantic similarity between relevant cybersecurity text pairs using contrastive learning.
- Loss Function: MultipleNegativesRankingLoss
Loss Parameters
{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
Reference
@article{aghaei2025securebert,
title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence},
author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun},
journal={arXiv preprint arXiv:2510.00240},
year={2025}
}
Model Card Authors
Cisco AI
Model Card Contact
For inquiries, please contact ai-threat-intel@cisco.com
- Downloads last month
- 5,137
Model tree for cisco-ai/SecureBERT2.0-biencoder
Base model
answerdotai/ModernBERT-base