XLS-R + SLS Classifier for Audio Deepfake Detection

Reproduction of "Audio Deepfake Detection with XLS-R and SLS Classifier" (Zhang et al., ACM Multimedia 2024).

The Selective Layer Summarization (SLS) classifier extracts attention-weighted features from all 24 transformer layers of XLS-R 300M (wav2vec 2.0), then classifies bonafide vs. spoofed speech via a lightweight fully-connected head. RawBoost (algo=3, SSI) data augmentation is applied during training.

Available Checkpoints

File	Experiment	Description
`v1/epoch_2.pth`	v1 (baseline)	Best cross-domain generalization. Patience=1, no validation, 4 epochs.
`v2/epoch_16.pth`	v2 (val-based)	Validation early stopping. Patience=10, ASVspoof2019 LA dev validation, 27 epochs.

Recommended: Use v1/epoch_2.pth — it generalizes better to unseen attack types (DF, In-the-Wild).

Original authors' pretrained models

The original pretrained checkpoints from Zhang et al. are available from:

Google Drive
Baidu Pan (password: shan)

Results

Track	Paper EER (%)	v1 EER (%)	v2 EER (%)
ASVspoof 2021 DF	1.92	2.14	3.75
ASVspoof 2021 LA	2.87	3.51	3.47
In-the-Wild	7.46	7.84	12.67

v1 closely reproduces the paper results. v2 improves LA slightly but degrades DF and In-the-Wild due to overfitting to the LA validation domain — a well-documented cross-domain generalization problem in audio deepfake detection (Muller et al., Interspeech 2022).

Training Configuration

Both experiments share the following setup:

Parameter	Value
Training data	ASVspoof2019 LA train (25,380 utterances)
Loss	Weighted Cross-Entropy [0.1, 0.9]
Optimizer	Adam (lr=1e-6, weight_decay=1e-4)
Batch size	5
RawBoost	algo=3 (SSI)
Seed	1234
SSL backbone	XLS-R 300M (frozen feature extractor)
GPU	NVIDIA RTX 4080 (16 GB)

v1 specifics

Early stopping: patience=1 on training loss
No validation set
4 epochs trained, best at epoch 2 (train loss = 0.000661)

v2 specifics

Early stopping: patience=10 on validation loss
Validation: ASVspoof2019 LA dev (24,844 trials)
27 epochs trained, best at epoch 16 (val_loss = 0.000468, val_acc = 99.99%)
Bug fixes: torch.no_grad() in validation loop, correct best_val_loss tracking

Usage

Download checkpoint

from huggingface_hub import hf_hub_download

# Download v1 checkpoint (recommended)
checkpoint_path = hf_hub_download(
    repo_id="sukhdeveyash/XLS-R-SLS-Deepfake-Detection",
    filename="v1/epoch_2.pth"
)

# Download v2 checkpoint
# checkpoint_path = hf_hub_download(
#     repo_id="sukhdeveyash/XLS-R-SLS-Deepfake-Detection",
#     filename="v2/epoch_16.pth"
# )

Load and run inference

import torch
from model import Model  # from the GitHub repo

device = "cuda" if torch.cuda.is_available() else "cpu"

model = Model(device=device, ssl_cpkt_path="xlsr2_300m.pt")
model.load_state_dict(torch.load(checkpoint_path, map_location=device))
model = model.to(device)
model.eval()

Full training and evaluation code: GitHub Repository

Requirements

Python 3.7+
PyTorch 1.13.1 (CUDA 11.7)
fairseq (commit a54021305d6b3c)
XLS-R 300M base checkpoint (xlsr2_300m.pt) from fairseq

See environment.yml in the GitHub repo for the full environment.

Citation

@inproceedings{zhang2024audio,
  title={Audio Deepfake Detection with XLS-R and SLS Classifier},
  author={Zhang, Qishan and Wen, Shuangbing and Hu, Tao},
  booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
  year={2024},
  publisher={ACM}
}

Acknowledgements

XLS-R (Babu et al., 2022)
RawBoost (Tak et al., Odyssey 2022)
ASVspoof Challenge

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for sukhdeveyash/XLS-R-SLS-Deepfake-Detection

Does Audio Deepfake Detection Generalize?

Paper • 2203.16263 • Published Mar 30, 2022