File size: 9,018 Bytes
c188696 97ba6d8 c188696 d9a8168 c188696 23718a7 c188696 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
---
license: apache-2.0
pipeline_tag: audio-text-to-text
library_name: transformers
tags:
- audio-reasoning
- chain-of-thought
- multi-modal
- step-audio-r1
---
## Overview of Step-Audio-R1.1
<a href="https://www.stepfun.com/studio/audio?tab=conversation"><img src="https://img.shields.io/static/v1?label=Space%20Playground&message=Studio&color=yellow"></a> <a href="https://huggingface.co/spaces/stepfun-ai/Step-Audio-R1"><img src="https://img.shields.io/static/v1?label=Space&message=Web&color=green"></a>  
### Introduction
Step-Audio R1.1 (Realtime) is a major upgrade to Step-Audio-R1, designed for interactive spoken dialogue with both **real-time responsiveness** and **strong reasoning capability**.
Unlike conventional streaming speech models that trade intelligence for latency, R1.1 enables *thinking while speaking*, achieving high intelligence without sacrificing speed.
### Mind-Paced Speaking (Low Latency)
Based on the research [*Mind-Paced Speaking*](MPS.pdf), the Realtime variant adopts a **Dual-Brain Architecture**:
- A **Formulation Brain** responsible for high-level reasoning
- An **Articulation Brain** dedicated to speech generation
This decoupling allows the model to perform **Chain-of-Thought reasoning during speech output**, maintaining ultra-low latency while handling complex tasks in real time.
### Acoustic-Grounded Reasoning (High Intelligence)
To address the *inverted scaling* issue鈥攚here reasoning over transcripts can degrade performance鈥擲tep-Audio R1.1 grounds its reasoning directly in acoustic representations rather than text alone.
Through iterative self-distillation, extended deliberation becomes a strength instead of a liability. This enables effective test-time compute scaling and leads to **state-of-the-art performance**, including top-ranking results on the AA benchmark.



## Model Usage
### 馃摐 Requirements
- **GPU**: NVIDIA GPUs with CUDA support (tested on 4脳L40S/H100/H800/H20).
- **Operating System**: Linux.
- **Python**: >= 3.10.0.
### 猬囷笍 Download Model
First, you need to download the Step-Audio-R1 model weights.
**Method A 路 Git LFS**
```bash
git lfs install
git clone https://huggingface.co/stepfun-ai/Step-Audio-R1.1
```
**Method B 路 Hugging Face CLI**
```bash
hf download stepfun-ai/Step-Audio-R1.1 --local-dir ./Step-Audio-R1.1
```
### 馃殌 Deployment and Execution
We provide two ways to serve the model: Docker (recommended) or compiling the customized vLLM backend.
#### 馃惓 Method 1 路 Run with Docker (Recommended)
A customized vLLM image is required.
1. **Pull the image**:
```bash
docker pull stepfun2025/vllm:step-audio-2-v20250909
```
2. **Start the service**:
Assuming the model is downloaded in the `Step-Audio-R1` folder in the current directory.
```bash
docker run --rm -ti --gpus all \
-v $(pwd)/Step-Audio-R1.1:/Step-Audio-R1.1 \
-p 9999:9999 \
stepfun2025/vllm:step-audio-2-v20250909 \
-- vllm serve /Step-Audio-R1.1 \
--served-model-name Step-Audio-R1.1 \
--port 9999 \
--max-model-len 16384 \
--max-num-seqs 32 \
--tensor-parallel-size 4 \
--chat-template '{%- macro render_content(content) -%}{%- if content is string -%}{{- content.replace("<audio_patch>\n", "<audio_patch>") -}}{%- elif content is mapping -%}{{- content['"'"'value'"'"'] if '"'"'value'"'"' in content else content['"'"'text'"'"'] -}}{%- elif content is iterable -%}{%- for item in content -%}{%- if item.type == '"'"'text'"'"' -%}{{- item['"'"'value'"'"'] if '"'"'value'"'"' in item else item['"'"'text'"'"'] -}}{%- elif item.type == '"'"'audio'"'"' -%}<audio_patch>{%- endif -%}{%- endfor -%}{%- endif -%}{%- endmacro -%}{%- if tools -%}{{- '"'"'<|BOT|>system\n'"'"' -}}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{{- '"'"'<|BOT|>tool_json_schemas\n'"'"' + tools|tojson + '"'"'<|EOT|>'"'"' -}}{%- else -%}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- '"'"'<|BOT|>system\n'"'"' + render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- for message in messages -%}{%- if message["role"] == "user" -%}{{- '"'"'<|BOT|>human\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- elif message["role"] == "assistant" -%}{{- '"'"'<|BOT|>assistant\n'"'"' + (render_content(message["content"]) if message["content"] else '"'"''"'"') -}}{%- set is_last_assistant = true -%}{%- for m in messages[loop.index:] -%}{%- if m["role"] == "assistant" -%}{%- set is_last_assistant = false -%}{%- endif -%}{%- endfor -%}{%- if not is_last_assistant -%}{{- '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- elif message["role"] == "function_output" -%}{%- else -%}{%- if not (loop.first and message["role"] == "system") -%}{{- '"'"'<|BOT|>'"'"' + message["role"] + '"'"'\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '"'"'<|BOT|>assistant\n<think>\n'"'"' -}}{%- endif -%}' \
--enable-log-requests \
--interleave-mm-strings \
--trust-remote-code
```
After the service starts, it will listen on `localhost:9999`.
#### 馃惓 Method 2 路 Run from Source (Compile vLLM)
Step-Audio-R1 requires a customized vLLM backend.
1. **Download Source Code**:
```bash
git clone https://github.com/stepfun-ai/vllm.git
cd vllm
```
2. **Prepare Environment**:
```bash
python3 -m venv .venv
source .venv/bin/activate
```
3. **Install and Compile**:
vLLM contains both C++ and Python code. We mainly modified the Python code, so the C++ part can use the pre-compiled version to speed up the process.
```bash
# Use pre-compiled C++ extensions (Recommended)
VLLM_USE_PRECOMPILED=1 pip install -e .
```
4. **Switch Branch**:
After compilation, switch to the branch that supports Step-Audio.
```bash
git checkout feat/step-audio-support
```
5. **Start the Service**:
```bash
# Ensure you are in the vllm directory and the virtual environment is activated
source .venv/bin/activate
python3 -m vllm.entrypoints.openai.api_server \
--model ../Step-Audio-R1.1 \
--served-model-name Step-Audio-R1.1 \
--port 9999 \
--host 0.0.0.0 \
--max-model-len 65536 \
--max-num-seqs 128 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.85 \
--trust-remote-code \
--enable-log-requests \
--interleave-mm-strings \
--chat-template '{%- macro render_content(content) -%}{%- if content is string -%}{{- content.replace("<audio_patch>\n", "<audio_patch>") -}}{%- elif content is mapping -%}{{- content['"'"'value'"'"'] if '"'"'value'"'"' in content else content['"'"'text'"'"'] -}}{%- elif content is iterable -%}{%- for item in content -%}{%- if item.type == '"'"'text'"'"' -%}{{- item['"'"'value'"'"'] if '"'"'value'"'"' in item else item['"'"'text'"'"'] -}}{%- elif item.type == '"'"'audio'"'"' -%}<audio_patch>{%- endif -%}{%- endfor -%}{%- endif -%}{%- endmacro -%}{%- if tools -%}{{- '"'"'<|BOT|>system\n'"'"' -}}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{{- '"'"'<|BOT|>tool_json_schemas\n'"'"' + tools|tojson + '"'"'<|EOT|>'"'"' -}}{%- else -%}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- '"'"'<|BOT|>system\n'"'"' + render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- for message in messages -%}{%- if message["role"] == "user" -%}{{- '"'"'<|BOT|>human\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- elif message["role"] == "assistant" -%}{{- '"'"'<|BOT|>assistant\n'"'"' + (render_content(message["content"]) if message["content"] else '"'"''"'"') -}}{%- set is_last_assistant = true -%}{%- for m in messages[loop.index:] -%}{%- if m["role"] == "assistant" -%}{%- set is_last_assistant = false -%}{%- endif -%}{%- endfor -%}{%- if not is_last_assistant -%}{{- '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- elif message["role"] == "function_output" -%}{%- else -%}{%- if not (loop.first and message["role"] == "system") -%}{{- '"'"'<|BOT|>'"'"' + message["role"] + '"'"'\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '"'"'<|BOT|>assistant\n<think>\n'"'"' -}}{%- endif -%}'
```
After the service starts, it will listen on `localhost:9999`. |