Mistral releases Voxtral TTS, an open-weight model for speech generation

On March 26, 2026, Mistral released Voxtral TTS — a 4-billion-parameter text-to-speech model available as open weights under a CC BY-NC 4.0 license. The model supports nine languages (English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, Arabic), produces audio within approximately 90 milliseconds of receiving text, and can clone a voice from three to five seconds of reference audio.

In benchmark comparisons against ElevenLabs Flash v2.5, Voxtral achieved a 68.4% win rate in human preference tests for multilingual voice cloning. On audio quality, it performs at parity with ElevenLabs v3. For self-hosting, the model requires a single GPU with at least 16 GB of VRAM running weights in BF16 format; Mistral has also designed it for edge deployment on devices like laptops and smartphones once quantized.

For writers and journalists, the most direct use cases are: narrating articles or newsletters as audio content, creating podcast episodes from written scripts, transcribing or dubbing interviews for multilingual audiences, and generating accessibility-oriented audio versions of documents. The voice cloning feature — adapting to a specific speaker’s accent and tone from a short reference clip — makes it practical for creating consistent voice output without per-minute API costs.

The weights are available on Hugging Face under the identifier mistralai/Voxtral-4B-TTS-2603 at no cost for non-commercial and research use. Commercial use of the weights requires a separate agreement with Mistral. For teams that prefer an API rather than self-hosting, Mistral offers Voxtral TTS through its platform at $0.016 per 1,000 characters.

The CC BY-NC license distinguishes Voxtral from Mistral’s Voxtral speech-to-text transcription models, which are Apache 2.0. Teams building commercial products should verify which license applies to their use case before deploying.