Spaces:
Running
title: Shell
emoji: ๐
colorFrom: blue
colorTo: purple
sdk: static
app_file: index.html
pinned: false
๐ Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs
Uncover and mitigate implicit value risks in education, finance, managementโand beyond
๐ Model-agnostic ยท ๐ง Self-evolving rules ยท โก Activation steering ยท ๐ 90%+ jailbreak reduction
Shell is an open safety framework that empowers domain-specific LLMs to detect, reflect on, and correct implicit value misalignmentsโwithout retraining. Built on the MENTOR architecture, it combines metacognitive self-assessment, dynamic rule evolution, and activation steering to deliver robust, interpretable, and efficient alignment across specialized verticals.
๐ Abstract
While current LLM safety methods focus on explicit harms (e.g., hate speech, violence), they often miss domain-specific implicit risksโsuch as encouraging academic dishonesty in education, promoting reckless trading in finance, or normalizing toxic workplace culture in management.
We introduce Shell, a metacognition-driven self-evolution framework that:
- Enables LLMs to self-diagnose value misalignments via perspective-taking and consequence simulation.
- Builds a hybrid rule system: expert-defined static trees + self-evolved dynamic graphs.
- Enforces rules at inference time via activation steering, achieving strong safety with minimal compute.
Evaluated on 9,000 risk queries across education, finance, and management, Shell reduces average jailbreak rates by >90% on models including GPT-5, Qwen3, and Llama 3.1.
๐ฏ Core Challenges: Implicit Risks Are Everywhere
| Domain | Example Implicit Risk | Harmful Consequence |
|---|---|---|
| Education | Suggesting clever comebacks that escalate bullying | Deteriorates peer relationships |
| Framing "sacrificing sleep for grades" as admirable | Promotes unhealthy competition | |
| Teaching how to "rephrase copied essays" | Undermines academic integrity | |
| Finance | Encouraging high-leverage speculation as "smart risk" | Normalizes financial recklessness |
| Management | Praising "always-on" culture as "dedication" | Reinforces burnout and poor work-life balance |
๐ก These risks are not jailbreaks in the traditional senseโthey appear benign but subtly erode domain-specific values.
๐ง Methodology: The MENTOR Architecture
Shell implements the MENTOR framework (see paper for full details):
1. Metacognitive Self-Assessment
LLMs evaluate their own outputs using:
- Perspective-taking: "How would a teacher/parent/regulator view this?"
- Consequential thinking: "What real-world harm could this cause?"
- Normative introspection: "Does this align with core domain ethics?"
This replaces labor-intensive human labeling with autonomous, human-aligned reflection.
2. Rule Evolution Cycle (REC)
- Static Rule Tree: Expert-curated, hierarchical rules (e.g.,
Education โ Academic Integrity โ No Plagiarism). - Dynamic Rule Graph: Automatically generated from successful self-corrections (e.g.,
<risk: essay outsourcing> โ <rule: teach outlining instead>). - Rules evolve via dual clustering (by risk type & mitigation strategy), enabling precise retrieval.
3. Robust Rule Vectors (RV) via Activation Steering
- Generate steering vectors from contrasting compliant vs. non-compliant responses.
- At inference, add vectors to internal activations (e.g., Layer 18 of Llama 3.1) to guide behavior.
- No fine-tuning neededโworks on closed-source models like GPT-5.
Figure: The MENTOR framework (from paper). Shell implements this full pipeline.
๐ Results: Strong, Efficient, Generalizable
Jailbreak Rate Reduction (3,000 queries per domain)
| Model | Original | + Shell (Rules + MetaLoop + RV) | Reduction |
|---|---|---|---|
| GPT-5 | 38.39% | 0.77% | 98.0% |
| Qwen3-235B | 56.33% | 3.13% | 94.4% |
| GPT-4o | 58.81% | 6.43% | 89.1% |
| Llama 3.1-8B | 67.45% | 31.39% | 53.5% |
โ Human evaluators prefer Shell-augmented responses 68% of the time for safety, appropriateness, and usefulness.
๐ Try It / Use It
For Researchers
- Dataset: 9,000 implicit-risk queries across 3 domains โ [HF Dataset Link]
- Code: Full implementation of REC + RV โ [GitHub Link] (coming soon)
- Cite:
@article{shell2025, title={Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs}, author={Wu, Wen and Ying, Zhenyu and He, Liang and Team, Shell}, journal={Anonymous Submission}, year={2025} }
