Canary-Qwen 2.5B is an automatic speech recognition model developed by NVIDIA, released in mid-2025. It is built on a Speech-Augmented Language Model (SALM) architecture — fusing a speech encoder with a large language model decoder — and holds the top position on the Hugging Face Open ASR Leaderboard with an average word error rate (WER) of 5.63% across standard English benchmarks.
The “Qwen” in its name reflects that the language model component is derived from Alibaba’s Qwen series. By pairing a high-quality audio encoder with a capable LLM decoder, Canary-Qwen benefits from the LLM’s strong language priors to produce more contextually coherent transcriptions — particularly on domain-specific vocabulary, proper nouns, and rare words that purely acoustic models get wrong.
How It Works
The SALM architecture consists of two main components. A speech encoder (based on FastConformer) processes raw audio into dense acoustic representations. These representations are then projected into the token embedding space of a Qwen language model, which autoregressively decodes the transcription text.
This design is conceptually similar to how vision-language models (VLMs) fuse image encoders with LLM decoders — except the modality is audio instead of images. The LLM decoder’s language priors act as an implicit language model, improving transcription coherence without requiring a separate rescoring step.
Key Characteristics
- #1 on Open ASR Leaderboard — 5.63% average WER across the English benchmark suite (as of 2025–2026)
- SALM architecture — FastConformer encoder + Qwen LLM decoder for context-aware transcription
- 2.5B parameters — substantial model size enabling high accuracy; GPU inference required
- Strong on broadcast and read speech — trained on diverse English audio including broadcast news, audiobooks, and conversational data
- NVIDIA NeMo integration — first-class support in NVIDIA’s production ASR and TTS toolchain
- English-primary — benchmark results are strongest on English; multilingual coverage is more limited than Qwen3-ASR
Strengths and Limitations
Canary-Qwen 2.5B is the most accurate open-source English ASR model available by standard benchmark metrics. Its leaderboard position reflects genuine capability gains over previous NVIDIA Canary models, driven primarily by the quality of the Qwen LLM decoder.
The main caveat is that its published benchmarks are weighted toward read speech and broadcast audio. Teams working with noisy, multi-speaker, or highly spontaneous conversational audio should validate on representative samples — leaderboard WER does not always predict performance in production acoustic conditions.
Speed is also a consideration. At 2.5B parameters, Canary-Qwen is not optimised for throughput. For latency-sensitive applications, NVIDIA Parakeet TDT (which is 6.5× faster) is a more appropriate choice.
Deployment Scenarios
Where Canary-Qwen 2.5B fits best:
- Applications where accuracy is the primary constraint and latency is flexible
- Offline batch transcription of broadcast media, podcasts, or recorded meetings
- Benchmarking and evaluation pipelines requiring a best-in-class accuracy baseline
- Enterprises on NVIDIA infrastructure already using the NeMo toolchain
- Legal and academic transcription requiring the lowest achievable WER on English audio
Comparison with Leading Open-Source ASR Models
| Model | WER (avg) | Speed (RTFx) | Multilingual | Architecture |
|---|---|---|---|---|
| Canary-Qwen 2.5B | 5.63% (#1) | Moderate | Limited | SALM (FastConformer + Qwen LLM) |
| Qwen3-ASR | Best-in-class (2026) | Moderate | Yes | Encoder-decoder LM |
| Parakeet TDT 1.1B | ~23rd on leaderboard | ~2,000× | English-only | Token-Duration Transducer |
References
- Hugging Face Open ASR Leaderboard
- NVIDIA NeMo ASR documentation
- Best open-source ASR models 2026 – Northflank
Ready to build?
Leverage AI technologies to build your product stack
Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.
Talk to Superteams