ASR

NVIDIA Parakeet TDT

NVIDIA's Parakeet TDT 1.1B is the fastest open-source ASR model available, achieving an RTFx near 2,000× real-time — processing audio 6.5× faster than Canary-Qwen at the cost of some accuracy.

NVIDIA Parakeet TDT is an automatic speech recognition model optimised for maximum throughput and minimum latency. The 1.1B parameter variant achieves a real-time factor (RTFx) approaching 2,000× — meaning it can process audio roughly 2,000 times faster than the duration of the audio itself — making it the fastest open-source ASR model on record as of 2026.

TDT stands for Token-and-Duration Transducer, a streaming-friendly architecture that predicts tokens and their durations jointly in a single pass. This design makes Parakeet particularly well-suited for real-time transcription pipelines where latency is the binding constraint, trading some accuracy for significantly higher throughput than encoder-decoder models like Canary-Qwen.

How It Works

The Transducer (also known as RNN-T or RNNT) architecture is designed for online, streaming recognition — unlike attention-based encoder-decoder models that must process a complete audio segment before producing output. Parakeet TDT extends this with the Token-and-Duration prediction mechanism, which simultaneously predicts the next token and how many audio frames it spans.

This joint prediction removes the need for a separate alignment step and enables frame-synchronous decoding — the model emits tokens as audio arrives, achieving low first-token latency. Combined with NVIDIA’s FastConformer encoder (a hardware-optimised variant of the Conformer architecture), the result is a model that saturates GPU throughput at extremely high audio-to-compute ratios.

Key Characteristics

  • RTFx ~2,000× — processes approximately 2,000 seconds of audio per second of compute on modern GPUs
  • 6.5× faster than Canary-Qwen 2.5B — the clear speed leader among accurate open-source ASR models
  • Token-and-Duration Transducer (TDT) — streaming-compatible; emits tokens as audio arrives with minimal latency
  • 1.1B parameters — large enough for strong English accuracy; optimised for batched GPU inference
  • NVIDIA NeMo framework — integrates directly with NVIDIA’s production ASR toolchain and Triton inference server
  • English-focused — optimised for English; multilingual coverage is limited compared to Qwen3-ASR

Strengths and Limitations

Parakeet TDT’s defining advantage is speed. For high-volume pipelines — processing thousands of hours of audio, serving a real-time transcription API at scale, or running live captioning — its throughput advantage over other open models is decisive. At ~2,000× RTFx, a single A100 GPU can process roughly 2,000 hours of audio per hour of wall-clock time.

The accuracy trade-off is real but context-dependent. On the Hugging Face Open ASR Leaderboard, Parakeet TDT ranks approximately 23rd — well below Canary-Qwen’s #1 position. For many production applications, however, this accuracy gap is acceptable: internal tooling, meeting summaries, media indexing, and keyword spotting rarely require sub-6% WER to be useful.

Parakeet is English-only. Teams needing multilingual transcription should look to Qwen3-ASR instead.

Deployment Scenarios

Where Parakeet TDT fits best:

  • Live captioning and real-time voice agents — low first-token latency and streaming output
  • High-volume call centre analytics — processing thousands of recorded calls per day at minimal GPU cost
  • Media and podcast indexing — rapid transcription of large audio archives for search and metadata extraction
  • On-device or edge-constrained inference — where model size and throughput matter more than peak accuracy
  • Cost-optimised production APIs — maximising transcription throughput per GPU-hour

Comparison with Leading Open-Source ASR Models

ModelWER (avg)Speed (RTFx)MultilingualBest Use Case
Parakeet TDT 1.1B~23rd~2,000×English-onlyReal-time, high-throughput
Canary-Qwen 2.5B5.63% (#1)ModerateLimitedMaximum accuracy, batch
Qwen3-ASRBest-in-classModerateYesAccuracy + multilingual

References

Ready to build?

Leverage AI technologies to build your product stack

Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.

Talk to Superteams