Canary-Qwen 2.5B — AI Glossary

Canary-Qwen 2.5B is an automatic speech recognition model developed by NVIDIA, released in mid-2025. It is built on a Speech-Augmented Language Model (SALM) architecture — fusing a speech encoder with a large language model decoder — and holds the top position on the Hugging Face Open ASR Leaderboard with an average word error rate (WER) of 5.63% across standard English benchmarks.

The “Qwen” in its name reflects that the language model component is derived from Alibaba’s Qwen series. By pairing a high-quality audio encoder with a capable LLM decoder, Canary-Qwen benefits from the LLM’s strong language priors to produce more contextually coherent transcriptions — particularly on domain-specific vocabulary, proper nouns, and rare words that purely acoustic models get wrong.

How It Works

The SALM architecture consists of two main components. A speech encoder (based on FastConformer) processes raw audio into dense acoustic representations. These representations are then projected into the token embedding space of a Qwen language model, which autoregressively decodes the transcription text.

This design is conceptually similar to how vision-language models (VLMs) fuse image encoders with LLM decoders — except the modality is audio instead of images. The LLM decoder’s language priors act as an implicit language model, improving transcription coherence without requiring a separate rescoring step.

Key Characteristics

#1 on Open ASR Leaderboard — 5.63% average WER across the English benchmark suite (as of 2025–2026)
SALM architecture — FastConformer encoder + Qwen LLM decoder for context-aware transcription
2.5B parameters — substantial model size enabling high accuracy; GPU inference required
Strong on broadcast and read speech — trained on diverse English audio including broadcast news, audiobooks, and conversational data
NVIDIA NeMo integration — first-class support in NVIDIA’s production ASR and TTS toolchain
English-primary — benchmark results are strongest on English; multilingual coverage is more limited than Qwen3-ASR

Strengths and Limitations

Canary-Qwen 2.5B is the most accurate open-source English ASR model available by standard benchmark metrics. Its leaderboard position reflects genuine capability gains over previous NVIDIA Canary models, driven primarily by the quality of the Qwen LLM decoder.

The main caveat is that its published benchmarks are weighted toward read speech and broadcast audio. Teams working with noisy, multi-speaker, or highly spontaneous conversational audio should validate on representative samples — leaderboard WER does not always predict performance in production acoustic conditions.

Speed is also a consideration. At 2.5B parameters, Canary-Qwen is not optimised for throughput. For latency-sensitive applications, NVIDIA Parakeet TDT (which is 6.5× faster) is a more appropriate choice.

Deployment Scenarios

Where Canary-Qwen 2.5B fits best:

Applications where accuracy is the primary constraint and latency is flexible
Offline batch transcription of broadcast media, podcasts, or recorded meetings
Benchmarking and evaluation pipelines requiring a best-in-class accuracy baseline
Enterprises on NVIDIA infrastructure already using the NeMo toolchain
Legal and academic transcription requiring the lowest achievable WER on English audio

Comparison with Leading Open-Source ASR Models

Model	WER (avg)	Speed (RTFx)	Multilingual	Architecture
Canary-Qwen 2.5B	5.63% (#1)	Moderate	Limited	SALM (FastConformer + Qwen LLM)
Qwen3-ASR	Best-in-class (2026)	Moderate	Yes	Encoder-decoder LM
Parakeet TDT 1.1B	~23rd on leaderboard	~2,000×	English-only	Token-Duration Transducer

References

Ready to build?

Leverage AI technologies to build your product stack

Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.

Talk to Superteams