Model Overview
We provide OmniFlatten, a groundbreaking model designed for full-duplex conversation, which effectively mirrors the complexity and dynamics of natural human dialogue. This model leverages a novel multi-stage post-training scheme to adapt a large text-based language model into an integrated speech-text dialogue system that operates in real time. Through progressive fine-tuning, OmniFlatten aligns speech and text modalities without altering the core architecture, ensuring low latency and seamless interactions. This approach paves the way for developing more efficient and natural end-to-end full-duplex spoken dialogue systems.
Experiments
We use a progressive learning approach for model training, adopting Speech-Text Alignment, 4-streaming training, 3-streaming training, and 2-streaming training.
4-Streaming Training
3-Streaming Training
2-Streaming Training
Cases
We will show you some cases:
Metrics
Speech-Text Alignment
Librispeech (CER) | WenetSpeech (CER) | ||
---|---|---|---|
Model | test_clean↓ | test_other↓ | test_meeting↓ |
ASR | |||
OmniFlatten (Ours) | 9.46 | 22.48 | 31.76 |
Whisper V3 | 3.71 | 5.74 | 19.91 |
TTS | |||
OmniFlatten (Ours) | 10.9 | 12.87 | 50.56 |
GT Speech Tokens | 5.82 | 12.74 | 40.18 |
Dialogue Capability
Model | Test Set Loss ↓ | LLM Score ↑ |
---|---|---|
OmniFlatten | 0.8125 | 5.185258 |
OmniFlatten w/o half-duplex training | 0.8129 | 5.008698 |
OmniFlatten w/o modality alignment and half-duplex training | 0.8496 | 4.346218 |
GT Response | - | 7.30685 |
Turn-Taking Metrics
Chunk Size | Assistant Turn-taking Acc@K 1/5/10/25 (%) |
Average Assistant Turn-taking Response Time (K/ms) |
User Turn-taking Acc @K 1/5/10/25 (%) |
Average User Turn-taking Response Time (K/ms) |
---|---|---|---|---|
5 | 29.2/59.4/67.4/71.9 | 3.23/129 | 2.1/5.7/8.1/17.0 | 20.55/822 |
10 | 19.8/55.7/71.3/75.5 | 3.99/160 | 5.5/13.4/19.8/30.0 | 20.13/805 |
Citations
@misc{zhang2024omniflattenendtoendgptmodel,
title={OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation},
author={Qinglin Zhang and Luyao Cheng and Chong Deng and Qian Chen and Wen Wang and Siqi Zheng and Jiaqing Liu and Hai Yu and Chaohong Tan},
year={2024},
eprint={2410.17799},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.17799},
}