dGSLM (Dialogue Generative Spoken Language Model) — AI Glossary

dGSLM (Dialogue Generative Spoken Language Model) is a full-duplex, speech-only language model designed to generate naturalistic spoken dialogue — including backchannels, overlapping speech, and turn-taking — without relying on text as an intermediate representation. It is one of the foundational systems in the full-duplex spoken dialogue research lineage, predating and influencing later models such as Moshi and OmniFlatten.

Speech-Only, Text-Free

Unlike most voice AI systems that transcribe speech to text, reason in text, then synthesise speech back, dGSLM operates entirely in the speech domain. It learns to generate spoken conversation directly from raw audio features, capturing the prosodic and paralinguistic cues (timing, pitch, rhythm) that text-based systems inherently discard.

This text-free design makes dGSLM particularly sensitive to the acoustic dynamics of conversation — when to interject, when to stay silent, when to produce a backchannel like “mm-hmm” — since all of this information is present in the speech signal itself.

Siamese Dual-Encoder Architecture

dGSLM models the two sides of a conversation (speaker A and speaker B) as two separate but coupled channels, processed by a Siamese network: two identical encoder branches that share weights and exchange information through cross-attention.

Each branch encodes one speaker’s audio stream into discrete speech units (produced by a HuBERT-based speech tokeniser).
Cross-attention lets each branch condition its predictions on the other speaker’s current state, enabling true bidirectional influence between the two streams.
The model generates speech tokens for both channels jointly, producing realistic turn-taking and overlap behaviour.

This architecture is a natural fit for the two-speaker structure of dialogue, but it requires a dedicated design — it cannot be adapted from a standard single-stream LLM without modification.

Turn-Taking and Latency

Benchmarks evaluating full-duplex models have found that dGSLM responds with an average latency of approximately 0.3 seconds — comparable to Moshi — and exhibits a high Takeover Rate, meaning it engages quickly when the other speaker finishes or pauses. This responsiveness reflects the model’s direct speech-level processing: it does not wait for a full utterance to be transcribed before deciding to act.

Limitations and Successors

dGSLM established the core research problem and a working architecture for speech-only duplex modelling, but it has several practical constraints:

No language grounding — operating without text means dGSLM cannot easily be steered by instructions or integrated with LLM-based reasoning.
Bespoke architecture — the Siamese design is not directly compatible with standard LLM infrastructure, limiting scalability.
Evaluation gap — later benchmarks (Full-Duplex-Bench, FLEXI) revealed persistent challenges in backchannel timing, emergency detection, and cross-turn correction that the dGSLM baseline does not fully address.

Subsequent systems like Moshi and OmniFlatten addressed these gaps by integrating text representations and building on top of large pre-trained language models, while preserving the full-duplex streaming capability that dGSLM pioneered.