Mistral just updated its open source Small model from 3.1 to 3.2: here’s why

Ledger
Mistral just updated its open source Small model from 3.1 to 3.2: here's why
Changelly


Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more

French AI darling Mistral is keeping the new releases coming this summer.

Just days after announcing its own domestic AI-optimized cloud service Mistral Compute, the well-funded company has released an update to its 24B parameter open source model Mistral Small, jumping from a 3.1 release to 3.2-24B Instruct-2506.

The new version builds directly on Mistral Small 3.1, aiming to improve specific behaviors such as instruction following, output stability, and function calling robustness. While overall architectural details remain unchanged, the update introduces targeted refinements that affect both internal evaluations and public benchmarks.

okex

According to Mistral AI, Small 3.2 is better at adhering to precise instructions and reduces the likelihood of infinite or repetitive generations — a problem occasionally seen in prior versions when handling long or ambiguous prompts.

Similarly, the function calling template has been upgraded to support more reliable tool-use scenarios, particularly in frameworks like vLLM.

And at the same time, it could run on a setup with a single Nvidia A100/H100 80GB GPU, drastically opening up the options for businesses with tight compute resources and/or budgets.

An updated model after only 3 months

Mistral Small 3.1 was announced in March 2025 as a flagship open release in the 24B parameter range. It offered full multimodal capabilities, multilingual understanding, and long-context processing of up to 128K tokens.

The model was explicitly positioned against proprietary peers like GPT-4o Mini, Claude 3.5 Haiku, and Gemma 3-it — and, according to Mistral, outperformed them across many tasks.

Small 3.1 also emphasized efficient deployment, with claims of running inference at 150 tokens per second and support for on-device use with 32 GB RAM.

That release came with both base and instruct checkpoints, offering flexibility for fine-tuning across domains such as legal, medical, and technical fields.

In contrast, Small 3.2 focuses on surgical improvements to behavior and reliability. It does not aim to introduce new capabilities or architecture changes. Instead, it acts as a maintenance release: cleaning up edge cases in output generation, tightening instruction compliance, and refining system prompt interactions.

Small 3.2 vs. Small 3.1: what changed?

Instruction-following benchmarks show a small but measurable improvement. Mistral’s internal accuracy rose from 82.75% in Small 3.1 to 84.78% in Small 3.2.

Similarly, performance on external datasets like Wildbench v2 and Arena Hard v2 improved significantly—Wildbench increased by nearly 10 percentage points, while Arena Hard more than doubled, jumping from 19.56% to 43.10%.

Internal metrics also suggest reduced output repetition. The rate of infinite generations dropped from 2.11% in Small 3.1 to 1.29% in Small 3.2 — almost a 2× reduction. This makes the model more reliable for developers building applications that require consistent, bounded responses.

Performance across text and coding benchmarks presents a more nuanced picture. Small 3.2 showed gains on HumanEval Plus (88.99% to 92.90%), MBPP Pass@5 (74.63% to 78.33%), and SimpleQA. It also modestly improved MMLU Pro and MATH results.

Vision benchmarks remain mostly consistent, with slight fluctuations. ChartQA and DocVQA saw marginal gains, while AI2D and Mathvista dropped by less than two percentage points. Average vision performance decreased slightly from 81.39% in Small 3.1 to 81.00% in Small 3.2.

This aligns with Mistral’s stated intent: Small 3.2 is not a model overhaul, but a refinement. As such, most benchmarks are within expected variance, and some regressions appear to be trade-offs for targeted improvements elsewhere.

However, as AI power user and influencer @chatgpt21 posted on X: “It got worse on MMLU,” meaning the Massive Multitask Language Understanding benchmark, a multidisciplinary test with 57 questions designed to assess broad LLM performance across domains. Indeed, Small 3.2 scored 80.50%, slightly below Small 3.1’s 80.62%.

Open source license will make it more appealing to cost-conscious and customized-focused users

Both Small 3.1 and 3.2 are available under the Apache 2.0 license and can be accessed via the popular. AI code sharing repository Hugging Face (itself a startup based in France and NYC).

Small 3.2 is supported by frameworks like vLLM and Transformers and requires roughly 55 GB of GPU RAM to run in bf16 or fp16 precision.

For developers seeking to build or serve applications, system prompts and inference examples are provided in the model repository.

While Mistral Small 3.1 is already integrated into platforms like Google Cloud Vertex AI and is scheduled for deployment on NVIDIA NIM and Microsoft Azure, Small 3.2 currently appears limited to self-serve access via Hugging Face and direct deployment.

What enterprises should know when considering Mistral Small 3.2 for their use cases

Mistral Small 3.2 may not shift competitive positioning in the open-weight model space, but it represents Mistral AI’s commitment to iterative model refinement.

With noticeable improvements in reliability and task handling — particularly around instruction precision and tool usage — Small 3.2 offers a cleaner user experience for developers and enterprises building on the Mistral ecosystem.

The fact that it is made by a French startup and compliant with EU rules and regulations such as GDPR and the EU AI Act also make it appealing for enterprises working in that part of the world.

Still, for those seeking the biggest jumps in benchmark performance, Small 3.1 remains a reference point—especially given that in some cases, such as MMLU, Small 3.2 does not outperform its predecessor. That makes the update more of a stability-focused option than a pure upgrade, depending on the use case.



Source link

Ledger

Be the first to comment

Leave a Reply

Your email address will not be published.


*