Voice AI

Kokoro

An open-weight TTS model with just 82M parameters that matches the output quality of models ten times its size, running in under 0.3 seconds on CPU — the most efficient high-quality open-source voice model available.

Kokoro is an open-weight text-to-speech model developed by hexgrad and released in late 2024. At just 82 million parameters, it is an outlier in the TTS landscape: most models competing on quality run at 1B parameters or more. Kokoro’s achievement is producing natural, clean speech that consistently outperforms or matches much larger models on human evaluation benchmarks — while fitting in less than 350MB and running on CPU without a GPU.

It became the most-downloaded open-source TTS model on Hugging Face within months of release, largely because it solves a real deployment problem: teams that need high-quality voice synthesis without the cost and infrastructure overhead of GPU inference.

How It Works

Kokoro uses a flow-matching architecture rather than the autoregressive or diffusion approaches used by most TTS models. Flow-matching trains the model to learn a direct mapping between noise and speech, enabling single-pass inference — which is what makes it so fast. It was trained on a curated dataset of English speech emphasising naturalness and prosodic variety across multiple speaker styles.

The model outputs audio at 24kHz with natural pacing, appropriate sentence-level intonation, and consistent voice quality across long passages — all common failure points in smaller TTS models.

Key Characteristics

  • 82M parameters — runs on CPU, ~350MB footprint, trivial to self-host
  • Sub-0.3s latency — single-pass flow-matching inference; no iterative sampling steps
  • Multiple English voices — several pre-built speaker personas with distinct character
  • High MOS scores — competitive with 1B+ parameter models on naturalness evaluations
  • Apache 2.0 license — unrestricted commercial use and modification
  • No voice cloning — fixed speaker set; not a zero-shot model

Strengths and Limitations

Kokoro’s primary limitation is the absence of voice cloning. You cannot provide a reference audio clip and clone that speaker — you are limited to the voices baked into the model. For many production use cases (a fixed assistant voice, a branded persona, a narration style), this is not a constraint at all. For applications requiring personalised or arbitrary voice output, Orpheus TTS or OmniVoice are better fits.

Its strength is reliability at scale. Because inference is CPU-bound and sub-300ms, Kokoro can handle high request volumes on standard compute without GPU queuing, autoscaling complexity, or the cost spikes associated with GPU-based TTS services.

Deployment Scenarios

Where Kokoro fits best:

  • High-throughput document-to-audio pipelines (audiobooks, summaries, reports)
  • Edge and on-device applications where GPU is unavailable
  • Real-time assistants with strict latency budgets
  • Privacy-sensitive environments requiring fully on-premise inference
  • Cost-sensitive production APIs where GPU inference is prohibitive at volume

Comparison with Leading Open-Source Alternatives

ModelParamsVoice CloningLatencyLanguagesLicense
Kokoro82MNo<0.3sEnglishApache 2.0
Orpheus TTS150M–3BYes~real-timeEnglishApache 2.0
OmniVoiceYes40× RT646Apache 2.0
F5-TTS~300MYes~real-timeMultilingualMIT

References

Ready to build?

Leverage AI technologies to build your product stack

Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.

Talk to Superteams