Local Deployment · Engineering Perspective · Avoid Detours
Devstral 2 · Local Deployment Guide
This page does one thing: get Devstral running with the shortest path, and let you know 'which model to choose, what hardware to prepare'
Before You Start (Save Time First)
Suggestion
First check hardware recommendations to determine if you want to run 24B or 123B; naming/alias related issues are unified in theFAQ。
Hardware Recommendations (Don't Waste Time First)
Devstral Small 2 (More Recommended for Personal Developers)
GPU
≥ 24GB VRAM(RTX 3090 / 4090 / L40)
RAM
≥ 32GB
System
Linux / macOS(Apple Silicon can be quantized)
Goal: Let you 'use it out of the box' instead of struggling with environment setup for a week
Devstral 2 (123B, More for Teams/Servers)
GPU
多卡 / ≥ 128GB VRAM(推理服务器级别)
Usage
Team inference service, heavy tasks and long context
If you just want to experience the workflow, starting with 24B is more cost-effective
Ollama (Simplest)
One command to get it running
ollama run devstral-2Model library address: https://ollama.com/library/devstral-2
When is Ollama suitable?
- You want to quickly verify 'if it feels right'
- You care more about 'one command to start service' than ultimate performance tuning
- You're already promoting Ollama workflow in your team
GGUF / llama.cpp (Community Common)
Recommended Process (copy as is based on your copy)
- Download GGUF quantized model from Hugging Face
- Use llama.cpp / LM Studio / text-generation-webui
- Adjust threads/batch size/context window according to project
Recommended Parameters (for starting)
- Temperature: 0.15
- Context: 128k–256k
Note: This is not 'the only correct answer', just a more stable default value