Kokoro is an open-weight text-to-speech model developed by hexgrad and released in late 2024. At just 82 million parameters, it is an outlier in the TTS landscape: most models competing on quality run at 1B parameters or more. Kokoro’s achievement is producing natural, clean speech that consistently outperforms or matches much larger models on human evaluation benchmarks — while fitting in less than 350MB and running on CPU without a GPU.
It became the most-downloaded open-source TTS model on Hugging Face within months of release, largely because it solves a real deployment problem: teams that need high-quality voice synthesis without the cost and infrastructure overhead of GPU inference.
How It Works
Kokoro uses a flow-matching architecture rather than the autoregressive or diffusion approaches used by most TTS models. Flow-matching trains the model to learn a direct mapping between noise and speech, enabling single-pass inference — which is what makes it so fast. It was trained on a curated dataset of English speech emphasising naturalness and prosodic variety across multiple speaker styles.
The model outputs audio at 24kHz with natural pacing, appropriate sentence-level intonation, and consistent voice quality across long passages — all common failure points in smaller TTS models.
Key Characteristics
- 82M parameters — runs on CPU, ~350MB footprint, trivial to self-host
- Sub-0.3s latency — single-pass flow-matching inference; no iterative sampling steps
- Multiple English voices — several pre-built speaker personas with distinct character
- High MOS scores — competitive with 1B+ parameter models on naturalness evaluations
- Apache 2.0 license — unrestricted commercial use and modification
- No voice cloning — fixed speaker set; not a zero-shot model
Strengths and Limitations
Kokoro’s primary limitation is the absence of voice cloning. You cannot provide a reference audio clip and clone that speaker — you are limited to the voices baked into the model. For many production use cases (a fixed assistant voice, a branded persona, a narration style), this is not a constraint at all. For applications requiring personalised or arbitrary voice output, Orpheus TTS or OmniVoice are better fits.
Its strength is reliability at scale. Because inference is CPU-bound and sub-300ms, Kokoro can handle high request volumes on standard compute without GPU queuing, autoscaling complexity, or the cost spikes associated with GPU-based TTS services.
Deployment Scenarios
Where Kokoro fits best:
- High-throughput document-to-audio pipelines (audiobooks, summaries, reports)
- Edge and on-device applications where GPU is unavailable
- Real-time assistants with strict latency budgets
- Privacy-sensitive environments requiring fully on-premise inference
- Cost-sensitive production APIs where GPU inference is prohibitive at volume
Comparison with Leading Open-Source Alternatives
| Model | Params | Voice Cloning | Latency | Languages | License |
|---|---|---|---|---|---|
| Kokoro | 82M | No | <0.3s | English | Apache 2.0 |
| Orpheus TTS | 150M–3B | Yes | ~real-time | English | Apache 2.0 |
| OmniVoice | — | Yes | 40× RT | 646 | Apache 2.0 |
| F5-TTS | ~300M | Yes | ~real-time | Multilingual | MIT |
References
- Kokoro-82M on Hugging Face
- Open-source TTS comparison – DigitalOcean
- Best open-source TTS models – Modal
Ready to build?
Leverage AI technologies to build your product stack
Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.
Talk to Superteams