article

Nemotron 3.5 ASR vs. Qwen3-ASR: two very different answers to on-device speech

A practical comparison of NVIDIA Nemotron 3.5 ASR and Alibaba Qwen3-ASR/FunASR across model size, mobile package cost, language coverage, streaming, accuracy, and latency.

PublisherWayDigital

Published2026-06-11 00:07 UTC

Languageen

Regionglobal

CategoryEssays

Nemotron 3.5 ASR vs. Qwen3-ASR: two very different answers to on-device speech

Put the two model cards next to each other and the comparison looks simple: NVIDIA has a new 0.6B streaming ASR model; Alibaba has Qwen3-ASR, also starting at 0.6B, with a larger 1.7B option. Both say multilingual. Both say streaming. Both look like candidates for voice agents, meeting transcription, and mobile speech features.

The product answer is less tidy. Nemotron 3.5 ASR is built like a real-time engine. Qwen3-ASR is built like a stronger transcription brain, especially for Chinese, dialects, and messy audio. FunASR then gives Alibaba’s side a practical deployment toolbox, including older but lighter Paraformer routes that may matter more for mobile apps than the newest model headline.

The file sizes tell the first story

NVIDIA Nemotron 3.5 ASR, upstream checkpoint600M parameters. The NeMo `.nemo` file is about 2.21 GiB. The model card lists 40 language-locales, with 32 usable for out-of-the-box transcription and 8 marked adaptation-ready.
Nemotron mobile and lightweight portsThe ONNX INT4 package is about 756 MiB. The Android LiteRT INT8 package is about 687 MiB. A CoreML INT8 package for Apple Silicon is about 668 MiB, with another compiled CoreML INT8 version around 612 MiB. These are not final App Store or Play Store deltas, but they are close to the unavoidable model payload before adding runtime libraries, audio code, decoders, and app logic.
Qwen3-ASR upstream checkpointsQwen3-ASR-0.6B is about 1.79 GiB. Qwen3-ASR-1.7B is about 4.38 GiB. Both support 30 languages and 22 Chinese dialects or accents, and both advertise unified offline and streaming inference.
Qwen3-ASR mobile portsThe public LiteRT INT8 file for Qwen3-ASR-0.6B is about 757 MiB. The same repository also carries f32 builds for MediaTek and Qualcomm targets, usually around 1.8–2.1 GiB each, plus a general f32 file near 2.9 GiB. A community CoreML repository includes several encoder forms; a realistic subset lands roughly around the 0.9–1.1 GiB range, while the full repository is larger because it stores multiple alternatives.
The older FunASR mobile realityParaformer-zh is a 220M-parameter model. Its PyTorch repository is about 848 MiB, but the official ONNX C++ benchmark says the model is 880MB before quantization and 237MB after INT8 quantization. SenseVoiceSmall is 234M parameters, with a Hugging Face repository around 900MB. These are not Qwen3-ASR, but they are often the more realistic FunASR choices when an app actually needs to ship.

For mobile package size, Nemotron is the cleaner 0.6B story

A 650–750MB speech pack is still heavy. It does not belong silently in the first install of a casual consumer app. But as an optional offline pack, an enterprise app component, an in-car module, a meeting device, a Jetson deployment, or a managed field app, it is at least a concrete number.

Nemotron’s public Android and CoreML ports sit in that range. Qwen3-ASR-0.6B can also reach it with INT8 LiteRT, but its f32 mobile builds are much larger, and the model behaves more like a generative ASR system. That brings extra questions around tokenizer handling, decoding, cache, delegated execution, memory, and first-token behavior.

If the product only needs Mandarin-heavy speech input or meeting transcription, Paraformer INT8 ONNX remains hard to ignore. A 237MB model is not tiny, but it is much easier to defend than a 700MB or 1GB speech pack.

Language coverage: global locales versus Chinese dialect depth

Nemotron 3.5 ASR covers 40 language-locales. Its transcription-ready set includes English, Spanish, French, Italian, Portuguese, Dutch, German, Turkish, Russian, Arabic, Hindi, Japanese, Korean, Vietnamese, and Ukrainian, with additional broad-coverage and adaptation-ready tiers. It also supports language-ID prompting and `target_lang=auto` language detection.

Qwen3-ASR covers 30 languages plus 22 Chinese dialects and accents. The language list includes Chinese, English, Cantonese, Arabic, German, French, Spanish, Portuguese, Indonesian, Italian, Korean, Russian, Thai, Vietnamese, Japanese, Turkish, Hindi, Malay, Dutch, Swedish, Danish, Finnish, Polish, Czech, Filipino, Persian, Greek, Hungarian, Macedonian, and Romanian. The dialect list is where Alibaba’s model feels especially product-shaped for China: Anhui, Dongbei, Fujian, Gansu, Guizhou, Hebei, Henan, Hubei, Hunan, Jiangxi, Ningxia, Shandong, Shaanxi, Shanxi, Sichuan, Tianjin, Yunnan, Zhejiang, Hong Kong Cantonese, Guangdong Cantonese, Wu, and Minnan.

So the language decision is not “40 versus 52.” It is global deployment breadth versus Chinese-market depth.

Streaming: both say yes, but the mechanics differ

Nemotron is a native streaming model. It uses a cache-aware FastConformer-RNNT architecture and exposes chunk settings of 80ms, 160ms, 320ms, 560ms, and 1120ms. The important part is cache reuse: the model processes new audio chunks without recomputing overlapping context. NVIDIA reports that on a single H100, at the 80ms setting, Nemotron sustains about 240 concurrent real-time streams versus 14 for the buffered Parakeet RNNT 1.1B multilingual model. At 1120ms, the comparison is 2400 versus 400 streams.

Qwen3-ASR also supports streaming, but the model card says streaming is currently available only through the vLLM backend. Streaming mode does not support batch inference or timestamp return. That makes it a strong server-side path, not the same kind of small-chunk embedded streaming design.

FunASR’s older runtime stack has another practical option: WebSocket streaming with Paraformer online and a 2-pass flow. The system can emit live partial results, then correct the sentence at endpoint with a higher-accuracy offline pass. It is less glamorous than a new model card, but it matches call centers, meeting captions, and private deployments well.

Accuracy: Qwen3-ASR wins the Chinese table; Nemotron wins the latency-shape argument

Nemotron’s model card reports FLEURS numbers at 1.12s chunk size with language input: English WER 7.91, Spanish 4.11, French 9.03, Italian 4.25, Portuguese 5.48, German 8.31, Hindi 6.81, and Korean CER 7.12. Its pitch is not a single best benchmark number. Its pitch is a balance of multilingual accuracy, native streaming, low latency, and high concurrency.

Qwen3-ASR’s published table is stronger on headline accuracy. Qwen3-ASR-1.7B reports Librispeech clean/other at 1.63/3.38, GigaSpeech at 8.45, CV-en at 7.39, Fleurs-en at 3.35, WenetSpeech net/meeting at 4.97/5.88, AISHELL-2-test at 2.71, SpeechIO at 2.88, and Fleurs-zh at 2.41. Qwen3-ASR-0.6B is weaker but still useful: the model card gives an offline average of 3.48 across a small summary set and a streaming average of 4.40.

Those numbers should not be mixed carelessly. The two vendors do not publish one shared evaluation protocol across all languages, streaming settings, and normalization rules. Still, the shape is clear enough: Qwen3-ASR-1.7B is the stronger bet for high-quality Chinese and dialect transcription; Nemotron is the stronger bet when the architecture has to stay genuinely low-latency and stream-efficient.

Where each model actually fits

On-device multilingual streaming with a 600–800MB optional pack: Nemotron 3.5 ASR is the cleaner starting point because public LiteRT and CoreML ports already exist in that size range.
Chinese, Cantonese, dialects, long audio, and server-side quality: Qwen3-ASR-1.7B is more attractive, if the infrastructure can carry the weight.
0.6B server or edge comparison: test both. Nemotron favors low-latency streaming design. Qwen3-ASR favors broader speech understanding and Chinese-market coverage.
Strict mobile install-size pressure: do not start with a 1.7B model. For Mandarin-heavy products, test Paraformer INT8 ONNX or another smaller FunASR route first.
Production decision: build a private test set. Include quiet speech, street noise, far-field meetings, echo, accents, domain words, numbers, code-switching, long silence, and overlapping speakers. Model cards are a map, not a launch checklist.

The short version: NVIDIA shipped a real-time ASR engine that wants to sit close to the user. Alibaba shipped a stronger general ASR brain, especially convincing for Chinese and dialect-heavy workloads, and FunASR gives it a deployment ecosystem. If the app must feel instant and local, start with Nemotron or a smaller FunASR model. If the transcript must be as accurate as possible and the server budget is there, Qwen3-ASR deserves the first benchmark slot.

Sources

NVIDIA Nemotron 3.5 ASR model card on Hugging Face
Nemotron ONNX INT4, LiteRT INT8, and CoreML INT8 community conversion repositories on Hugging Face
Qwen3-ASR-0.6B and Qwen3-ASR-1.7B model cards on Hugging Face
Qwen3-ASR LiteRT and CoreML community conversion repositories on Hugging Face
FunASR GitHub README, Android runtime README, iOS Paraformer demo README, and ONNX C++ benchmark notes

Nemotron 3.5 ASR vs. Qwen3-ASR: two very different answers to on-device speech

Nemotron 3.5 ASR vs. Qwen3-ASR: two very different answers to on-device speech

The file sizes tell the first story

For mobile package size, Nemotron is the cleaner 0.6B story

Language coverage: global locales versus Chinese dialect depth

Streaming: both say yes, but the mechanics differ

Accuracy: Qwen3-ASR wins the Chinese table; Nemotron wins the latency-shape argument

Where each model actually fits

Sources

Comments