AI Models

OmniFlatten

OmniFlatten is a full-duplex, end-to-end GPT model for seamless voice conversation that converts multi-stream speech and text into a single flattened token sequence, enabling a standard LLM backbone to handle simultaneous listening and speaking.

OmniFlatten is an end-to-end GPT-based model for full-duplex voice conversation, published in 2025 and accepted at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). It achieves natural, low-latency spoken dialogue — including interruptions, overlapping speech, and backchannels — without modifying the underlying LLM architecture.

The Core Idea: Flattening

Most approaches to full-duplex dialogue require custom architectures or multiple separate model components to handle the user and assistant audio streams in parallel. OmniFlatten takes a different path: it converts all streams — user speech, assistant speech, and text — into a single interleaved token sequence through a “flattening” operation.

By flattening multi-modal, multi-speaker data into one unified sequence, a standard GPT-style transformer can learn full-duplex conversational dynamics (simultaneous input/output, barge-in handling, overlap) using the same training objective it already uses for language modelling — next-token prediction. No architectural changes needed.

Multi-Stage Training

OmniFlatten adapts a pre-trained text LLM into a full-duplex speech-text model through three progressive training stages:

  1. Modality alignment — teaching the model to understand and generate speech tokens alongside text tokens, grounding audio representations in linguistic meaning.
  2. Half-duplex dialogue learning — training on turn-based spoken conversations, where the model learns to produce coherent spoken responses to speech input.
  3. Full-duplex dialogue learning — training on flattened, time-chunked data where both user and assistant streams co-occur, enabling the model to listen and speak simultaneously.

All three stages use the same flattening operation and the same GPT backbone, making the pipeline coherent and reusable.

Time Chunking and Synchronisation

To handle the real-time nature of duplex audio, OmniFlatten divides the conversation into fixed-size time chunks and interleaves tokens from each stream within each chunk. Time information is embedded directly into the token sequence, allowing the model to synchronise its output with incoming user audio without explicit timing control logic.

This design lets OmniFlatten detect and respond to barge-in (mid-sentence interruptions) with low latency, since it processes new user audio at each chunk boundary rather than waiting for a full utterance.

Comparison to Contemporaries

SystemApproachArchitecture change?
OmniFlattenFlatten all streams into one sequenceNo — standard GPT backbone
MoshiParallel dual-stream with depth transformerYes — custom audio layers
dGSLMSiamese network with cross-attentionYes — two separate encoders

Unlike dGSLM, which requires a dedicated Siamese architecture, OmniFlatten’s flattening strategy lets practitioners adapt any capable text LLM into a full-duplex voice model with minimal architectural overhead.

Significance

OmniFlatten demonstrated that full-duplex spoken dialogue does not inherently require bespoke model architectures — the right data representation and training curriculum can unlock the capability in a standard transformer. Its acceptance at ACL 2025 reflects its influence on the growing field of full-duplex spoken language models.