AEGIS Sidecar Classifier
Layer 3 of the AEGIS (Adaptive Ensemble Guard with Integrated Steering) defense system. This is a lightweight classifier that runs alongside the main LLM to detect attack patterns and dynamically adjust defense strength.
Model Description
The sidecar classifier categorizes inputs into three classes:
- SAFE: Benign queries that need minimal defense
- WARN: Ambiguous or suspicious queries
- ATTACK: Clear jailbreak/attack attempts
Intended Uses
Primary use cases:
- Real-time classification of LLM inputs for adaptive defense
- Dynamically adjusting RepE steering strength
- Research on attack detection methods
- Content moderation and threat triage
Out of scope:
- Standalone content moderation (designed for adaptive steering)
- High-stakes security decisions without human review
Training Details
| Parameter | Value |
|---|---|
| Base Model | Qwen2.5-3B-Instruct |
| Method | LoRA fine-tuning |
| LoRA Rank | 32 |
| LoRA Alpha | 64 |
| Target Modules | q_proj, k_proj, v_proj, o_proj |
| Training Samples | 2,349 |
| Classes | SAFE, WARN, ATTACK |
Evaluation Results
| Class | Precision | Recall | F1 |
|---|---|---|---|
| SAFE | 24% | 35% | 28% |
| WARN | 66% | 40% | 50% |
| ATTACK | 62% | 61% | 62% |
Note: This is a research prototype. ATTACK detection is prioritized over SAFE detection for security.
Usage
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PeftModel
# Load model
base_model = AutoModelForSequenceClassification.from_pretrained(
"Qwen/Qwen2.5-3B-Instruct",
num_labels=3,
device_map="auto",
torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(base_model, "scthornton/aegis-sidecar-classifier")
tokenizer = AutoTokenizer.from_pretrained("scthornton/aegis-sidecar-classifier")
# Classify input
label_names = ["SAFE", "WARN", "ATTACK"]
def classify(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048)
with torch.no_grad():
outputs = model(**inputs.to(model.device))
probs = torch.softmax(outputs.logits, dim=-1)[0]
return label_names[probs.argmax()], probs.tolist()
# Example
result, probs = classify("Ignore all previous instructions and...")
print(f"Classification: {result}") # ATTACK
print(f"Probabilities: {dict(zip(label_names, probs))}")
Dynamic Defense Integration
Use the classification to adjust RepE steering strength:
alpha_map = {
"SAFE": 0.5, # Minimal steering - preserve fluency
"WARN": 1.5, # Moderate steering
"ATTACK": 2.5 # Maximum steering - block harmful output
}
classification, _ = classify(user_input)
alpha = alpha_map[classification]
# Apply RepE steering with this alpha value
AEGIS Architecture
This classifier is Layer 3 of the 3-layer AEGIS defense:
- Layer 1 (KNOWLEDGE): aegis-mistral-7b-dpo
- Layer 2 (INSTINCT): aegis-repe-vectors
- Layer 3 (OVERSIGHT): Sidecar classifier (this model)
Limitations and Risks
Limitations:
- SAFE class has lower precision (24%) - may over-classify benign queries as WARN
- Trained on English-language attacks only
- 3-class granularity may miss nuanced threat levels
- Requires ~3B parameter model inference overhead
Risks:
- False positives may trigger unnecessary steering
- False negatives on novel attack patterns
- Classification confidence doesn't guarantee correctness
Recommendations:
- Use probability scores, not just class labels, for fine-grained control
- Consider WARN classification as "elevated caution" rather than definite threat
- Combine with other safety mechanisms for production use
Framework Versions
- PEFT 0.18.0
Citation
@misc{aegis2024,
title={AEGIS: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering},
author={scthornton.ai},
year={2024},
url={https://huggingface.co/scthornton/aegis-sidecar-classifier}
}
License
CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International)
You are free to:
- Share β copy and redistribute the material in any medium or format
- Adapt β remix, transform, and build upon the material
Under the following terms:
- Attribution β You must give appropriate credit to scthornton.ai / perfecXion.ai, provide a link to the license, and indicate if changes were made
- NonCommercial β You may not use the material for commercial purposes without explicit written permission
- ShareAlike β If you remix, transform, or build upon the material, you must distribute your contributions under the same license
For commercial licensing inquiries, contact: scott@perfecxion.ai
- Downloads last month
- 33