Moshi — AI Glossary | Superteams.ai

Moshi is a speech-text foundation model and full-duplex spoken dialogue system released by Kyutai in September 2024. Unlike conventional voice assistants that operate in strict turn-taking mode — waiting for the user to finish before responding — Moshi models both sides of a conversation simultaneously, enabling overlapping speech, interruptions, and backchannels in real time. It achieves a theoretical end-to-end latency of 160ms and roughly 200ms in practice.

The Full-Duplex Difference

Traditional voice AI pipelines work in half-duplex: they detect when the user stops speaking, transcribe, process, then generate a response. This creates unnatural pauses and prevents interruption. Full-duplex systems like Moshi maintain two parallel audio streams — one for the user, one for the model — processed simultaneously, just as humans hear and speak at the same time.

Moshi removes explicit speaker turn boundaries entirely, allowing the model to decide when to speak, stay silent, or respond mid-sentence based on the live audio context.

Architecture

Moshi is built on top of Helium, a 7-billion-parameter text language model pre-trained on over 2.1 trillion tokens of public English data. Helium serves as the reasoning core, while Moshi extends it with audio generation through a hierarchical token system:

Mimi — a streaming neural audio codec that compresses 24kHz audio down to 1.1 kbps at 12.5 Hz, with an 80ms frame latency. Mimi produces residual quantizer (RQ) tokens that Moshi uses as its audio vocabulary.
Depth Transformer — a secondary transformer that generates the multiple RQ codebook levels per time step, enabling high-quality audio output in a streaming-friendly way.
Temporal Transformer — the main autoregressive backbone (initialized from Helium) that processes the time sequence of audio and text tokens.

Moshi runs two audio token streams in parallel: one for the user’s speech and one for Moshi’s own speech. This dual-stream modeling is what makes real simultaneous listening and speaking possible without architectural hacks.

Inner Monologue

One of Moshi’s key innovations is its Inner Monologue method. Before generating audio tokens at each time step, Moshi first predicts a text token aligned to that moment in the conversation. This silent “thinking in text” step dramatically improves the linguistic quality and coherence of the spoken output, because the model grounds each audio chunk in an explicit textual intent before synthesising it.

Training Pipeline

Moshi’s training proceeds through four phases:

Unsupervised pre-training on large-scale audio data, initialising the Temporal Transformer from Helium.
Post-training on diarized data — simulated multi-stream training derived from segmented real conversations.
Fisher dataset fine-tuning — using a corpus of real telephone conversations to develop full-duplex capability.
Instruction fine-tuning on a custom dataset built from synthetic interactive scripts, to align conversational behaviour.

Training ran on 127 DGX nodes (1,016 H100 GPUs) provided by Scaleway.

Model Variants

Kyutai released two personality variants — Moshika (female voice) and Moshiko (male voice) — in bf16 and int8 precision for PyTorch, as well as MLX and Rust/Candle ports for Apple Silicon and edge deployment.

Significance

Moshi was one of the first open-weight full-duplex speech LLMs, establishing a blueprint for real-time conversational AI that has since influenced successor systems including OmniFlatten and the broader field of full-duplex spoken dialogue modelling.