XLS-R + SLS Classifier for Audio Deepfake Detection
Reproduction of "Audio Deepfake Detection with XLS-R and SLS Classifier" (Zhang et al., ACM Multimedia 2024).
The Selective Layer Summarization (SLS) classifier extracts attention-weighted features from all 24 transformer layers of XLS-R 300M (wav2vec 2.0), then classifies bonafide vs. spoofed speech via a lightweight fully-connected head. RawBoost (algo=3, SSI) data augmentation is applied during training.
Available Checkpoints
| File | Experiment | Description |
|---|---|---|
v1/epoch_2.pth |
v1 (baseline) | Best cross-domain generalization. Patience=1, no validation, 4 epochs. |
v2/epoch_16.pth |
v2 (val-based) | Validation early stopping. Patience=10, ASVspoof2019 LA dev validation, 27 epochs. |
Recommended: Use v1/epoch_2.pth — it generalizes better to unseen attack types (DF, In-the-Wild).
Original authors' pretrained models
The original pretrained checkpoints from Zhang et al. are available from:
- Google Drive
- Baidu Pan (password: shan)
Results
| Track | Paper EER (%) | v1 EER (%) | v2 EER (%) |
|---|---|---|---|
| ASVspoof 2021 DF | 1.92 | 2.14 | 3.75 |
| ASVspoof 2021 LA | 2.87 | 3.51 | 3.47 |
| In-the-Wild | 7.46 | 7.84 | 12.67 |
v1 closely reproduces the paper results. v2 improves LA slightly but degrades DF and In-the-Wild due to overfitting to the LA validation domain — a well-documented cross-domain generalization problem in audio deepfake detection (Muller et al., Interspeech 2022).
Training Configuration
Both experiments share the following setup:
| Parameter | Value |
|---|---|
| Training data | ASVspoof2019 LA train (25,380 utterances) |
| Loss | Weighted Cross-Entropy [0.1, 0.9] |
| Optimizer | Adam (lr=1e-6, weight_decay=1e-4) |
| Batch size | 5 |
| RawBoost | algo=3 (SSI) |
| Seed | 1234 |
| SSL backbone | XLS-R 300M (frozen feature extractor) |
| GPU | NVIDIA RTX 4080 (16 GB) |
v1 specifics
- Early stopping: patience=1 on training loss
- No validation set
- 4 epochs trained, best at epoch 2 (train loss = 0.000661)
v2 specifics
- Early stopping: patience=10 on validation loss
- Validation: ASVspoof2019 LA dev (24,844 trials)
- 27 epochs trained, best at epoch 16 (val_loss = 0.000468, val_acc = 99.99%)
- Bug fixes:
torch.no_grad()in validation loop, correctbest_val_losstracking
Usage
Download checkpoint
from huggingface_hub import hf_hub_download
# Download v1 checkpoint (recommended)
checkpoint_path = hf_hub_download(
repo_id="sukhdeveyash/XLS-R-SLS-Deepfake-Detection",
filename="v1/epoch_2.pth"
)
# Download v2 checkpoint
# checkpoint_path = hf_hub_download(
# repo_id="sukhdeveyash/XLS-R-SLS-Deepfake-Detection",
# filename="v2/epoch_16.pth"
# )
Load and run inference
import torch
from model import Model # from the GitHub repo
device = "cuda" if torch.cuda.is_available() else "cpu"
model = Model(device=device, ssl_cpkt_path="xlsr2_300m.pt")
model.load_state_dict(torch.load(checkpoint_path, map_location=device))
model = model.to(device)
model.eval()
Full training and evaluation code: GitHub Repository
Requirements
- Python 3.7+
- PyTorch 1.13.1 (CUDA 11.7)
- fairseq (commit a54021305d6b3c)
- XLS-R 300M base checkpoint (
xlsr2_300m.pt) from fairseq
See environment.yml in the GitHub repo for the full environment.
Citation
@inproceedings{zhang2024audio,
title={Audio Deepfake Detection with XLS-R and SLS Classifier},
author={Zhang, Qishan and Wen, Shuangbing and Hu, Tao},
booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
year={2024},
publisher={ACM}
}
Acknowledgements
- XLS-R (Babu et al., 2022)
- RawBoost (Tak et al., Odyssey 2022)
- ASVspoof Challenge