OmniFlatten:

An End-to-end GPT Model for Seamless Voice Conversation

Model Overview

We provide OmniFlatten, a groundbreaking model designed for full-duplex conversation, which effectively mirrors the complexity and dynamics of natural human dialogue. This model leverages a novel multi-stage post-training scheme to adapt a large text-based language model into an integrated speech-text dialogue system that operates in real time. Through progressive fine-tuning, OmniFlatten aligns speech and text modalities without altering the core architecture, ensuring low latency and seamless interactions. This approach paves the way for developing more efficient and natural end-to-end full-duplex spoken dialogue systems.

Experiments

We use a progressive learning approach for model training, adopting Speech-Text Alignment, 4-streaming training, 3-streaming training, and 2-streaming training.

4-Streaming Training

3-Streaming Training

2-Streaming Training

Cases

We will show you some cases:

Metrics

Speech-Text Alignment

ASR results on Librispeech and Wenetspeech test datasets. For Librispeech, we report the WER metric. For Wenetspeech, we report the CER metric.
Model Librispeech Wenetspeech
test-clean test-other test-meeting test-net
Whisper-S 3.13 7.37 25.62 16.66
Whisper-L 1.82 3.5 18.87 10.48
VITA 8.14 18.4 12.15 16.53
OmniFlatten 7.91 19.21 26.1 19.0
TTS results on LibriTTS and AIShell-3. For LibriTTS, we utilize whisper-large-v3 model to perform recognition on TTS outputs and assess the WER metric. For AIShell-3, we deploy the paraformer-zh model to recognize TTS results and evaluate CER metric.
Model LibriTTS (WER) AIShell-3 (CER)
Original 2.66 2.52
ChatTTS 8.32 3.87
CosyVoice 2.89 3.82
OmniFlatten 4.51 4.46

Dialogue Capability

Performance results of speech and text chat capabilities in both Chinese and English test datasets.
Model Params English Chinese
Score (Text) Score (Speech + ASR) Score (Text) Score (Speech + ASR)
Qwen2-0.5B-Instruct 0.5B 6.75 - 6.98 -
Qwen2-7B-Instruct 7B 8.37 - 8.09 -
LLaMA-Omni 8B 6.01 5.50 4.17 3.89
Moshi 7B 3.92 3.46 - -
GLM-Voice 9B 6.97 6.40 7.02 6.69
OmniFlatten directly 3-stream 0.5B 2.99 2.59 4.94 3.95
OmniFlatten 3-stream w/o half-duplex 0.5B 3.89 3.54 5.25 4.76
OmniFlatten 3-stream full process 0.5B 4.88 3.92 5.6 5.15
OmniFlatten 2-stream full process 0.5B - 2.19 - 3.06
Ground Truth Response - 7.65 - 6.83 -

Turn-Taking Metrics

Assistant Turn-taking and User Turn-taking accuracy at the k-th token (Acc@K) and Response Time.
Models Assistant Turn-taking Acc@K 1/5/10/25 (%) Average Assistant Turn-taking Response Time (ms) User Turn-taking Acc @K 1/5/10/25 (%) Average User Turn-taking Response Time (ms)
Moshi 2.9/18.8/38.5/55.1 553 0.0/6.2/14.8/45.7 753
OmniFlatten 20.6/53.6/66.3/71.7 193 10.9/30.9/41.8/51.8 287