DAC-SE1: High-Fidelity Speech Enhancement via Discrete Audio Tokens

This checkpoint has been trained to specifically reflect real-world denoising scenarios. It utilizes discrete audio tokens to perform high-fidelity speech enhancement, treating audio restoration as a sequence modeling task to generate clean audio tokens from noisy input sequences.

Usage

To use this model, you need the inference tools and tokenizers provided in the official GitHub repository.

1. Setup Environment

First, clone the repository to get the necessary helper scripts (DACTools, DACTokenizer, etc.) and navigate into the folder:

git clone https://github.com/ETH-DISCO/DAC-SE1.git
cd DAC-SE1
pip install -r requirements.txt

2. Inference

You can run the following Python script to denoise an audio file.

import torch
from transformers import LlamaForCausalLM, LogitsProcessorList
from inference import DACTools, DACTokenizer, DACConstrainedLogitsProcessor
import re
from huggingface_hub import login


# Initialize DAC tools for audio encoding/decoding
dac_tools = DACTools()
tokenizer = DACTokenizer(num_layers=9, codebook_size=1024)

# Load denoiser model
model_path = "disco-eth/DAC-SE1"
model = LlamaForCausalLM.from_pretrained(model_path)
model = model.to('cuda')
model.eval()

# Load noisy audio and convert to tokens
noisy_tokens = dac_tools.audio_to_tokens('input.wav')

# Prepare input for model
token_ids = tokenizer.encode(noisy_tokens, add_special_tokens=False)
input_ids = [tokenizer.bos_token_id] + token_ids + [tokenizer.start_clean_token_id]
input_tensor = torch.tensor([input_ids]).cuda()

# Generate clean tokens
num_tokens = len(re.findall(r'<\|s\d+_c\d\|>', noisy_tokens))
logits_processor = LogitsProcessorList([
    DACConstrainedLogitsProcessor(tokenizer=tokenizer, min_tokens=num_tokens)
])

with torch.no_grad():
    outputs = model.generate(
        input_tensor,
        max_new_tokens=num_tokens,
        min_new_tokens=num_tokens,
        logits_processor=logits_processor,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
        do_sample=False,
    )

# Extract generated tokens
generated_ids = outputs[0, len(input_ids):].tolist()
generated_output = tokenizer.decode(generated_ids, skip_special_tokens=True)

# Convert tokens back to audio
valid_tokens = re.findall(r'<\|s\d+_c\d\|>', generated_output)
if valid_tokens:
    remainder = len(valid_tokens) % 9
    if remainder != 0:
        valid_tokens = valid_tokens[:len(valid_tokens) - remainder]
    denoised_tokens = "".join(valid_tokens)
    tokens = dac_tools.string_to_tokens(denoised_tokens)
    clean_audio = dac_tools.tokens_to_audio(tokens)

# Save denoised audio
import soundfile as sf
sf.write('output.wav', clean_audio, dac_tools.sample_rate)

Citation

If you use this model, please cite our paper:

@misc{lanzendörfer2025highfidelityspeechenhancementdiscrete,
      title={High-Fidelity Speech Enhancement via Discrete Audio Tokens}, 
      author={Luca A. Lanzendörfer and Frédéric Berdoz and Antonis Asonitis and Roger Wattenhofer},
      year={2025},
      eprint={2510.02187},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2510.02187}, 
}
Downloads last month
27
Safetensors
Model size
0.9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for disco-eth/DAC-SE1