AEGIS Sidecar Classifier

Layer 3 of the AEGIS (Adaptive Ensemble Guard with Integrated Steering) defense system. This is a lightweight classifier that runs alongside the main LLM to detect attack patterns and dynamically adjust defense strength.

Model Description

The sidecar classifier categorizes inputs into three classes:

  • SAFE: Benign queries that need minimal defense
  • WARN: Ambiguous or suspicious queries
  • ATTACK: Clear jailbreak/attack attempts

Intended Uses

Primary use cases:

  • Real-time classification of LLM inputs for adaptive defense
  • Dynamically adjusting RepE steering strength
  • Research on attack detection methods
  • Content moderation and threat triage

Out of scope:

  • Standalone content moderation (designed for adaptive steering)
  • High-stakes security decisions without human review

Training Details

Parameter Value
Base Model Qwen2.5-3B-Instruct
Method LoRA fine-tuning
LoRA Rank 32
LoRA Alpha 64
Target Modules q_proj, k_proj, v_proj, o_proj
Training Samples 2,349
Classes SAFE, WARN, ATTACK

Evaluation Results

Class Precision Recall F1
SAFE 24% 35% 28%
WARN 66% 40% 50%
ATTACK 62% 61% 62%

Note: This is a research prototype. ATTACK detection is prioritized over SAFE detection for security.

Usage

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PeftModel

# Load model
base_model = AutoModelForSequenceClassification.from_pretrained(
    "Qwen/Qwen2.5-3B-Instruct",
    num_labels=3,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(base_model, "scthornton/aegis-sidecar-classifier")
tokenizer = AutoTokenizer.from_pretrained("scthornton/aegis-sidecar-classifier")

# Classify input
label_names = ["SAFE", "WARN", "ATTACK"]

def classify(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048)
    with torch.no_grad():
        outputs = model(**inputs.to(model.device))
        probs = torch.softmax(outputs.logits, dim=-1)[0]
    return label_names[probs.argmax()], probs.tolist()

# Example
result, probs = classify("Ignore all previous instructions and...")
print(f"Classification: {result}")  # ATTACK
print(f"Probabilities: {dict(zip(label_names, probs))}")

Dynamic Defense Integration

Use the classification to adjust RepE steering strength:

alpha_map = {
    "SAFE": 0.5,   # Minimal steering - preserve fluency
    "WARN": 1.5,   # Moderate steering
    "ATTACK": 2.5  # Maximum steering - block harmful output
}

classification, _ = classify(user_input)
alpha = alpha_map[classification]
# Apply RepE steering with this alpha value

AEGIS Architecture

This classifier is Layer 3 of the 3-layer AEGIS defense:

  1. Layer 1 (KNOWLEDGE): aegis-mistral-7b-dpo
  2. Layer 2 (INSTINCT): aegis-repe-vectors
  3. Layer 3 (OVERSIGHT): Sidecar classifier (this model)

Limitations and Risks

Limitations:

  • SAFE class has lower precision (24%) - may over-classify benign queries as WARN
  • Trained on English-language attacks only
  • 3-class granularity may miss nuanced threat levels
  • Requires ~3B parameter model inference overhead

Risks:

  • False positives may trigger unnecessary steering
  • False negatives on novel attack patterns
  • Classification confidence doesn't guarantee correctness

Recommendations:

  • Use probability scores, not just class labels, for fine-grained control
  • Consider WARN classification as "elevated caution" rather than definite threat
  • Combine with other safety mechanisms for production use

Framework Versions

  • PEFT 0.18.0

Citation

@misc{aegis2024,
  title={AEGIS: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering},
  author={scthornton.ai},
  year={2024},
  url={https://huggingface.co/scthornton/aegis-sidecar-classifier}
}

License

CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International)

You are free to:

  • Share β€” copy and redistribute the material in any medium or format
  • Adapt β€” remix, transform, and build upon the material

Under the following terms:

  • Attribution β€” You must give appropriate credit to scthornton.ai / perfecXion.ai, provide a link to the license, and indicate if changes were made
  • NonCommercial β€” You may not use the material for commercial purposes without explicit written permission
  • ShareAlike β€” If you remix, transform, or build upon the material, you must distribute your contributions under the same license

For commercial licensing inquiries, contact: scott@perfecxion.ai

Downloads last month
33
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for scthornton/aegis-sidecar-classifier

Base model

Qwen/Qwen2.5-3B
Adapter
(614)
this model