-
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique
Paper • 2408.10701 • Published • 12 -
Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming
Paper • 2406.11654 • Published • 6 -
Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse
Paper • 2409.11242 • Published • 7 -
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
Paper • 2308.09662 • Published • 3
Collections
Discover the best community collections!
Collections including paper arxiv:2308.09662
-
Large Language Model Alignment: A Survey
Paper • 2309.15025 • Published • 2 -
Aligning Large Language Models with Human: A Survey
Paper • 2307.12966 • Published • 1 -
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Paper • 2305.18290 • Published • 63 -
SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF
Paper • 2310.05344 • Published • 1
-
Moral Foundations of Large Language Models
Paper • 2310.15337 • Published • 1 -
Specific versus General Principles for Constitutional AI
Paper • 2310.13798 • Published • 3 -
Contrastive Prefence Learning: Learning from Human Feedback without RL
Paper • 2310.13639 • Published • 25 -
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Paper • 2309.00267 • Published • 52
-
Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic
Paper • 2402.11746 • Published • 2 -
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
Paper • 2308.09662 • Published • 3 -
Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases
Paper • 2310.14303 • Published • 1 -
Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming
Paper • 2406.11654 • Published • 6
-
LTD: Low Temperature Distillation for Robust Adversarial Training
Paper • 2111.02331 • Published • 1 -
Interpolated Adversarial Training: Achieving Robust Neural Networks without Sacrificing Too Much Accuracy
Paper • 1906.06784 • Published • 1 -
Pruning Adversarially Robust Neural Networks without Adversarial Examples
Paper • 2210.04311 • Published • 1 -
Mitigating the Accuracy-Robustness Trade-off via Multi-Teacher Adversarial Distillation
Paper • 2306.16170 • Published • 1
-
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique
Paper • 2408.10701 • Published • 12 -
Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming
Paper • 2406.11654 • Published • 6 -
Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse
Paper • 2409.11242 • Published • 7 -
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
Paper • 2308.09662 • Published • 3
-
Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic
Paper • 2402.11746 • Published • 2 -
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
Paper • 2308.09662 • Published • 3 -
Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases
Paper • 2310.14303 • Published • 1 -
Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming
Paper • 2406.11654 • Published • 6
-
Large Language Model Alignment: A Survey
Paper • 2309.15025 • Published • 2 -
Aligning Large Language Models with Human: A Survey
Paper • 2307.12966 • Published • 1 -
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Paper • 2305.18290 • Published • 63 -
SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF
Paper • 2310.05344 • Published • 1
-
LTD: Low Temperature Distillation for Robust Adversarial Training
Paper • 2111.02331 • Published • 1 -
Interpolated Adversarial Training: Achieving Robust Neural Networks without Sacrificing Too Much Accuracy
Paper • 1906.06784 • Published • 1 -
Pruning Adversarially Robust Neural Networks without Adversarial Examples
Paper • 2210.04311 • Published • 1 -
Mitigating the Accuracy-Robustness Trade-off via Multi-Teacher Adversarial Distillation
Paper • 2306.16170 • Published • 1
-
Moral Foundations of Large Language Models
Paper • 2310.15337 • Published • 1 -
Specific versus General Principles for Constitutional AI
Paper • 2310.13798 • Published • 3 -
Contrastive Prefence Learning: Learning from Human Feedback without RL
Paper • 2310.13639 • Published • 25 -
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Paper • 2309.00267 • Published • 52