OmniVoice is a zero-shot text-to-speech model developed by Xiaomi’s speech research team (k2-fsa) and introduced in 2025. It supports 646 languages from a single unified model — approximately 20× more language coverage than ElevenLabs and 5× more than PlayHT — with particular depth in low-resource languages that most commercial and open-source TTS systems fail to support adequately.
Beyond language coverage, OmniVoice introduces two capabilities that distinguish it from other open TTS models: zero-shot voice cloning from a short reference clip, and voice design — generating a new speaker persona from a text description of desired vocal attributes, without requiring any reference audio at all.
Architecture
OmniVoice uses a hybrid diffusion language model architecture that is neither pure autoregressive nor pure diffusion. It combines:
- Diffusion-based synthesis for output quality and speaker fidelity
- LLM-style inference for generation speed and sequence coherence
This combination is what allows OmniVoice to achieve both high naturalness and very low latency simultaneously — a tradeoff that typically favours one at the expense of the other. The model achieves a real-time factor (RTF) of 0.025, meaning it synthesises 40 seconds of audio per second of compute.
Performance
On a 24-language multilingual benchmark:
| Model | Word Error Rate | Speaker Similarity |
|---|---|---|
| OmniVoice | 2.85% | 0.830 |
| ElevenLabs | 10.95% | 0.655 |
| PlayHT | ~8–9% | ~0.70 |
OmniVoice outperforms ElevenLabs on both accuracy and voice fidelity across the benchmark languages — including non-English languages where ElevenLabs has historically been strongest.
Key Capabilities
- 646-language coverage — including low-resource African, Asian, and Indigenous language families
- Zero-shot voice cloning — clone any speaker from a reference clip of a few seconds
- Voice design — specify a voice using natural language attributes (see below)
- 40× real-time synthesis — RTF 0.025 on standard hardware
- Apache 2.0 license — open for research and commercial deployment
Voice Design: Generating Speakers Without Reference Audio
Most voice cloning systems require a recording of the target speaker. OmniVoice introduces voice design as an alternative: you describe the voice you want using natural language attributes, and the model generates a speaker persona matching that description.
Attributes you can specify include:
- Gender and approximate age
- Pitch register (high, mid, low)
- Speaking rate (slow, natural, fast)
- Accent or regional dialect
- Stylistic character (formal, warm, authoritative, whispered)
This capability is practically significant for product teams that do not have a reference recording — for instance, when creating a branded AI persona, generating character voices for interactive media, or building voice interfaces for new languages where no existing recordings exist.
Enterprise Relevance
OmniVoice is most relevant for:
- Global product localisation — voice output in 646 languages from a single model, eliminating the need for per-language TTS vendors
- Low-resource language markets — serving users in languages that commercial APIs do not support
- Privacy-sensitive deployments — on-premise inference under Apache 2.0, no external API dependency
- Branded voice creation — voice design removes the dependency on recording sessions for persona development
- Regulated industries — healthcare, legal, and government applications requiring data sovereignty
References
- OmniVoice paper – arXiv:2604.00688
- OmniVoice GitHub (k2-fsa)
- OmniVoice demo page
- OmniVoice vs ElevenLabs – HM.AI
Ready to build?
Leverage AI technologies to build your product stack
Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.
Talk to Superteams