File size: 2,503 Bytes
627755f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
501d5b7
 
 
 
 
 
 
627755f
 
501d5b7
627755f
dcda51a
627755f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
703f3e2
 
 
 
 
 
 
 
627755f
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
language:
- en
tags:
- audio-text-to-audio-text
- speech-understanding
- audio
- chat
license: apache-2.0
datasets:
- custom
metrics:
- wer
- bleu
- AIR-Bench
---
<div align="center">
<h1>
  EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
</h1>
</div>

<p align="center">
  <font size="3">
    <a href="https://github.com/FreedomIntelligence/EchoX">🐈‍⬛ Github</a>&nbsp|&nbsp
    <a href="https://arxiv.org/abs/2509.09174">📃 Paper</a>&nbsp|&nbsp
    <a href="https://huggingface.co/spaces/FreedomIntelligence/EchoX">🚀 Space (8B)</a>&nbsp|&nbsp
    <a href="https://huggingface.co/datasets/FreedomIntelligence/EchoX-Dialougues">📊 EchoX-Dialougues</a>&nbsp|&nbsp
    <a href="https://huggingface.co/datasets/KurtDu/EchoX-Dialogues-Plus">📊 EchoX-Dialogues-Plus</a>
  </font>
</p>


## Model Description
EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. This is the 3B version. By introducing **Echo Training**, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 10k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks.

### Key Features
<div>
  <ul>
    <font size="3"><li>Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs</li></font>
    <font size="3"><li>Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)</li></font>
    <font size="3"><li>Trained on Only 10k Hours of Curated Data, Ensuring Efficiency</li></font>
    <font size="3"><li>Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks</li></font>
    <font size="3"><li>Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks</li></font>
  </ul>
</div>

## Usage
Load the EchoX model and run inference with your audio files as shown in the <a href="https://github.com/FreedomIntelligence/EchoX">GitHub repository</a>.

# <span>📖 Citation</span>
```
@misc{zhang2025echoxmitigatingacousticsemanticgap,
      title={EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs}, 
      author={Yuhao Zhang and Yuhao Du and Zhanchen Dai and Xiangnan Ma and Kaiqi Kou and Benyou Wang and Haizhou Li},
      year={2025},
      eprint={2509.09174},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.09174}, 
}
```