article

Choosing an on-device speech-to-text model: FunASR, Whisper, Vosk, Xiaomi MiMo, and the open ASR field

A practical comparison of open speech-to-text models by streaming support, language coverage, model size, and mobile deployment fit.

PublisherWayDigital

Published2026-05-30 08:47 UTC

Languageen

Regionglobal

CategoryProduct Notes

Putting speech-to-text on a phone: the real fight is size, streaming, and heat

I went back through the open ASR landscape with one boring product question in mind: what can actually live inside a mobile app? FunASR is popular. Xiaomi has MiMo-V2.5-ASR. MiniMind and MiniCPM-o make speech demos look tempting. But once the target is an offline feature that ordinary users can download and a low-end Android phone can survive, the filter gets brutal.

The model has to be small enough. It should stream if the use case is voice input. Language coverage has to match the product, not the press release. And the runtime should not turn the app into a hand warmer.

The short answer

Best mobile-first route: sherpa-onnx or sherpa-ncnn with a small streaming model. These projects are runtimes, not just model cards, and they already cover Android, iOS, HarmonyOS, WebAssembly, C++, Kotlin, Swift, and more.
Best Chinese ASR family to test first: FunASR, especially Paraformer streaming and SenseVoiceSmall ONNX. Paraformer is the cleaner ASR choice; SenseVoice adds language, emotion, and audio-event tags.
Smallest practical offline option: Vosk small models. Chinese is about 42 MB. English is about 39 MB. Not the fanciest model, but very easy to ship.
Whisper on mobile: use whisper.cpp, not the original Python/PyTorch stack. tiny/base are realistic, small is near the upper edge, medium and above are not normal phone models.
Xiaomi MiMo-V2.5-ASR: open, interesting, and huge. The weights are about 32 GB. That is a server/research model, not an app bundle.
MiniMind-3o and MiniCPM-o: treat them as speech/multimodal assistant models, not lightweight ASR engines.

The 500 MB line matters

For an offline consumer app, 500 MB is not a random number. Cross it and you pay everywhere: download conversion, install friction, first launch, memory spikes, thermals, and crashes on cheap devices. In a real product, “it runs on my laptop” means almost nothing.

Using that line, the list gets much shorter.

The candidates, without the brochure language

1. FunASR, SenseVoice, and Paraformer

Open source: yes, under the ModelScope/FunAudioLLM ecosystem.
Streaming: Paraformer has streaming variants; FunASR supports online/offline server modes. SenseVoiceSmall is more of a non-streaming model, but can be chunked with VAD.
Languages: SenseVoiceSmall covers Chinese, English, Japanese, Korean, and Cantonese. Fun-ASR-Nano is listed for 31 languages. The broader FunASR ecosystem claims 50+ language support.
Size: SenseVoiceSmall PyTorch is about 893 MiB; quantized ONNX is about 230 MiB. Paraformer quantized ONNX is about 227 MiB; Paraformer PyTorch is about 840 MiB. FSMN-VAD ONNX is only about 0.5 MiB, and CAM++ speaker modeling is about 27 MiB.
Mobile verdict: do not ship the raw PyTorch models. The ONNX/int8 route is realistic. For Chinese plus Cantonese/Japanese/Korean/English, SenseVoiceSmall is worth a serious test. For pure Chinese streaming transcription, Paraformer streaming is cleaner.

2. sherpa-onnx and sherpa-ncnn

Open source: yes, from the k2-fsa ecosystem.
Streaming: yes, both streaming and non-streaming ASR.
Languages: model-dependent, with Chinese, English, Japanese, Korean, Cantonese, Whisper, Paraformer, SenseVoice, Zipformer, and more in the ecosystem.
Size: Zipformer small English int8 components are around the 26 MB range. Chinese streaming Zipformer small CTC int8 is about 25 MB. Paraformer int8 ONNX is about 213 MB.
Mobile verdict: this is the first runtime I would test for a mobile product. It removes a lot of the ugly work between a model file and a usable Android/iOS feature.

3. Vosk

Open source: yes, Kaldi-based.
Streaming: yes. The project is built around low-latency streaming recognition.
Languages: more than 20 languages and dialects, including Chinese, English, German, French, Spanish, Russian, Japanese, Hindi, and others.
Size: small Chinese is about 41.9 MiB; small English is about 39.3 MiB. Runtime memory is usually a few hundred MB for small models.
Mobile verdict: if you need a stable offline voice input quickly, Vosk is still a very practical choice. Accuracy will not beat the newer models in every noisy/open-domain case, but the footprint is hard to ignore.

4. Whisper and whisper.cpp

Open source: yes. OpenAI released Whisper; mobile deployments usually use whisper.cpp.
Streaming: not native causal streaming. Most “streaming Whisper” setups are chunked or pseudo-streaming.
Languages: multilingual models cover many languages; English-only variants also exist.
Size: whisper.cpp ggml tiny is about 74 MiB, base about 141 MiB, small about 465 MiB, medium about 1.46 GiB. OpenAI small safetensors is about 922 MiB.
Mobile verdict: tiny/base are realistic. small is already close to the 500 MB line. medium and above belong on desktop or server unless you control very high-end hardware.

5. Distil-Whisper

Open source: yes, from Hugging Face.
Streaming: chunked/sequential inference, not native true streaming.
Languages: official released checkpoints are mainly English-focused.
Size: distil-small.en ggml is about 336 MB; distil-large-v3 safetensors is about 1.51 GB.
Mobile verdict: a reasonable English-only compromise. Not a main model for a multilingual app.

6. WeNet

Open source: yes, Apache-2.0.
Streaming: yes, with streaming and non-streaming E2E ASR.
Languages: Chinese, English, and more depending on the pretrained model.
Size: model-dependent; the project has pretrained and runtime packages, but it is less “grab this one small mobile model” than Vosk or sherpa.
Mobile verdict: credible, but engineering-heavy. Good for a team with ASR experience, less ideal for a fast app integration.

7. Baidu PaddleSpeech

Open source: yes, Apache-2.0.
Streaming: yes, including C++ streaming deployment paths.
Languages: Chinese and English models are available.
Size: AISHELL Conformer is about 182 MiB; chunk Conformer AISHELL about 178 MiB; WenetSpeech Conformer about 456 MiB; some DeepSpeech2 models range from about 659 MiB to 1.35 GiB.
Mobile verdict: some model sizes are acceptable, but the app integration path is not as direct as sherpa. I would treat it as a second-round candidate.

8. Xiaomi MiMo-V2.5-ASR

Open source: yes, MIT.
Streaming: I did not find a clear real-time streaming ASR API in the public materials.
Languages: Chinese, English, Cantonese, with claims around dialects, code-switching, songs, noise, and multi-speaker audio.
Size: Hugging Face safetensors total roughly 32 GB.
Mobile verdict: not a phone model. Useful for research and server-side benchmarking, not for an offline app bundle.

9. Tencent-related open pieces

Open source: Tencent ncnn is a strong mobile inference framework. TencentGameMate has Chinese wav2vec2/HuBERT pretraining work.
Streaming: ncnn is a runtime framework, not an ASR model; the pretraining projects are not turnkey streaming ASR engines.
Languages: mostly model-specific, with Chinese pretraining work available.
Size: I did not find a Tencent-branded, ready-to-ship mobile ASR package comparable to Vosk small or SenseVoice ONNX.
Mobile verdict: ncnn is useful infrastructure. But if you want an open ASR model to embed today, FunASR/sherpa/Vosk are clearer paths.

10. Meta MMS

Open source: yes, in the fairseq ecosystem.
Streaming: not a ready mobile streaming product.
Languages: ASR coverage for 1,100+ languages. This is the reason MMS matters.
Size: MMS-1B safetensors is about 3.86 GB; MMS-300M PyTorch is about 1.27 GB. Language adapters are small, about 9 MB each, but the backbone is not.
Mobile verdict: valuable for long-tail language research. Too heavy for a normal Chinese/English mobile app.

11. NVIDIA NeMo and Parakeet

Open source: NeMo is open source, and Parakeet has open weights.
Streaming: the NeMo/Riva ecosystem supports streaming in some setups, but each model needs checking.
Languages: public Parakeet checkpoints are mainly English-focused.
Size: Parakeet TDT 0.6B .nemo is about 2.47 GB.
Mobile verdict: not a normal phone bundle. Think GPU server, edge GPU, or NVIDIA deployment stack.

12. MiniMind-3o and MiniCPM-o

Open source: MiniMind-O and MiniCPM-o have open projects or weights.
Streaming: they target speech conversation and multimodal interaction, not a tiny STT engine.
Languages: MiniMind-3o is mainly Chinese/English; MiniCPM-o targets Chinese/English speech conversation plus multimodal input.
Size: MiniMind-3o’s own weight file is about 226 MB, but it relies on frozen components such as SenseVoice, SigLIP2, and Mimi. MiniCPM-o 2.6 weights are about 17 GB.
Mobile verdict: if all you need is speech-to-text, don’t start here. These models solve a different problem.

How I would choose for a mobile app

Chinese offline voice input, latency first: sherpa-onnx with Chinese streaming Zipformer or Paraformer. Start with the 25 MB to 213 MB range.
Chinese/Cantonese/Japanese/Korean/English plus richer audio tags: SenseVoiceSmall ONNX int8, about 230 MB; around 260 MB with VAD and CAM++.
Smallest stable offline package: Vosk small, around 40 MB.
English offline transcription: whisper.cpp tiny/base or Distil-Whisper small.en.
Long-tail language coverage: Meta MMS, probably server-side.
Video subtitles and long audio batch jobs: Whisper small, FunASR Paraformer, and SenseVoice are all worth testing, but thermal behavior matters.

My first test matrix

Round one: sherpa-onnx Chinese streaming Zipformer small int8, Vosk small cn/en, SenseVoiceSmall ONNX int8.
Round two: Paraformer streaming ONNX, whisper.cpp base/small quantized.
Not for the app bundle right now: Xiaomi MiMo-V2.5-ASR, MiniCPM-o, Meta MMS, NVIDIA Parakeet, raw Fun-ASR-Nano weights.

The real test is not whose model has the loudest launch post. Put the phone offline, use an ordinary Android device, record ten minutes, and see whether it keeps transcribing without burning the user’s hand, crashing, or turning the app into a one-gigabyte download. That is where an ASR model stops being a demo and starts being a product.