OmniFlatten

Model Overview

We provide OmniFlatten, a groundbreaking model designed for full-duplex conversation, which effectively mirrors the complexity and dynamics of natural human dialogue. This model leverages a novel multi-stage post-training scheme to adapt a large text-based language model into an integrated speech-text dialogue system that operates in real time. Through progressive fine-tuning, OmniFlatten aligns speech and text modalities without altering the core architecture, ensuring low latency and seamless interactions. This approach paves the way for developing more efficient and natural end-to-end full-duplex spoken dialogue systems.

Experiments

We use a progressive learning approach for model training, adopting Speech-Text Alignment, 4-streaming training, 3-streaming training, and 2-streaming training.

4-Streaming Training

3-Streaming Training

2-Streaming Training

Cases

We will show you some cases:

Metrics

Speech-Text Alignment

ASR results on Librispeech and Wenetspeech test datasets. For Librispeech, we report the WER metric. For Wenetspeech, we report the CER metric.
Model	Librispeech		Wenetspeech
	test-clean	test-other	test-meeting	test-net
Whisper-S	3.13	7.37	25.62	16.66
Whisper-L	1.82	3.5	18.87	10.48
VITA	8.14	18.4	12.15	16.53
OmniFlatten	7.91	19.21	26.1	19.0

TTS results on LibriTTS and AIShell-3. For LibriTTS, we utilize whisper-large-v3 model to perform recognition on TTS outputs and assess the WER metric. For AIShell-3, we deploy the paraformer-zh model to recognize TTS results and evaluate CER metric.
Model	LibriTTS (WER)	AIShell-3 (CER)
Original	2.66	2.52
ChatTTS	8.32	3.87
CosyVoice	2.89	3.82
OmniFlatten	4.51	4.46

Dialogue Capability

Performance results of speech and text chat capabilities in both Chinese and English test datasets.
Model	Params	English		Chinese
Model	Params	Score (Text)	Score (Speech + ASR)	Score (Text)	Score (Speech + ASR)
Qwen2-0.5B-Instruct	0.5B	6.75	-	6.98	-
Qwen2-7B-Instruct	7B	8.37	-	8.09	-
LLaMA-Omni	8B	6.01	5.50	4.17	3.89
Moshi	7B	3.92	3.46	-	-
GLM-Voice	9B	6.97	6.40	7.02	6.69
OmniFlatten directly 3-stream	0.5B	2.99	2.59	4.94	3.95
OmniFlatten 3-stream w/o half-duplex	0.5B	3.89	3.54	5.25	4.76
OmniFlatten 3-stream full process	0.5B	4.88	3.92	5.6	5.15
OmniFlatten 2-stream full process	0.5B	-	2.19	-	3.06
Ground Truth Response	-	7.65	-	6.83	-

Turn-Taking Metrics

Assistant Turn-taking and User Turn-taking accuracy at the k-th token (Acc@K) and Response Time.
Models	Assistant Turn-taking Acc@K 1/5/10/25 (%)	Average Assistant Turn-taking Response Time (ms)	User Turn-taking Acc @K 1/5/10/25 (%)	Average User Turn-taking Response Time (ms)
Moshi	2.9/18.8/38.5/55.1	553	0.0/6.2/14.8/45.7	753
OmniFlatten	20.6/53.6/66.3/71.7	193	10.9/30.9/41.8/51.8	287