Kokoro — AI Glossary | Superteams.ai

Kokoro is an open-weight text-to-speech model developed by hexgrad and released in late 2024. At just 82 million parameters, it is an outlier in the TTS landscape: most models competing on quality run at 1B parameters or more. Kokoro’s achievement is producing natural, clean speech that consistently outperforms or matches much larger models on human evaluation benchmarks — while fitting in less than 350MB and running on CPU without a GPU.

It became the most-downloaded open-source TTS model on Hugging Face within months of release, largely because it solves a real deployment problem: teams that need high-quality voice synthesis without the cost and infrastructure overhead of GPU inference.

How It Works

Kokoro uses a flow-matching architecture rather than the autoregressive or diffusion approaches used by most TTS models. Flow-matching trains the model to learn a direct mapping between noise and speech, enabling single-pass inference — which is what makes it so fast. It was trained on a curated dataset of English speech emphasising naturalness and prosodic variety across multiple speaker styles.

The model outputs audio at 24kHz with natural pacing, appropriate sentence-level intonation, and consistent voice quality across long passages — all common failure points in smaller TTS models.

Key Characteristics

82M parameters — runs on CPU, ~350MB footprint, trivial to self-host
Sub-0.3s latency — single-pass flow-matching inference; no iterative sampling steps
Multiple English voices — several pre-built speaker personas with distinct character
High MOS scores — competitive with 1B+ parameter models on naturalness evaluations
Apache 2.0 license — unrestricted commercial use and modification
No voice cloning — fixed speaker set; not a zero-shot model

Strengths and Limitations

Kokoro’s primary limitation is the absence of voice cloning. You cannot provide a reference audio clip and clone that speaker — you are limited to the voices baked into the model. For many production use cases (a fixed assistant voice, a branded persona, a narration style), this is not a constraint at all. For applications requiring personalised or arbitrary voice output, Orpheus TTS or OmniVoice are better fits.

Its strength is reliability at scale. Because inference is CPU-bound and sub-300ms, Kokoro can handle high request volumes on standard compute without GPU queuing, autoscaling complexity, or the cost spikes associated with GPU-based TTS services.

Deployment Scenarios

Where Kokoro fits best:

High-throughput document-to-audio pipelines (audiobooks, summaries, reports)
Edge and on-device applications where GPU is unavailable
Real-time assistants with strict latency budgets
Privacy-sensitive environments requiring fully on-premise inference
Cost-sensitive production APIs where GPU inference is prohibitive at volume

Comparison with Leading Open-Source Alternatives

Model	Params	Voice Cloning	Latency	Languages	License
Kokoro	82M	No	<0.3s	English	Apache 2.0
Orpheus TTS	150M–3B	Yes	~real-time	English	Apache 2.0
OmniVoice	—	Yes	40× RT	646	Apache 2.0
F5-TTS	~300M	Yes	~real-time	Multilingual	MIT

References

Ready to build?

Leverage AI technologies to build your product stack

Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.

Talk to Superteams