Local Deployment · Engineering Perspective · Avoid Detours

Devstral 2 · Local Deployment Guide

This page does one thing: get Devstral running with the shortest path, and let you know 'which model to choose, what hardware to prepare'

Before You Start (Save Time First)

Suggestion
First check hardware recommendations to determine if you want to run 24B or 123B; naming/alias related issues are unified in theFAQ

Hardware Recommendations (Don't Waste Time First)

Devstral Small 2 (More Recommended for Personal Developers)
GPU
≥ 24GB VRAM(RTX 3090 / 4090 / L40)
RAM
≥ 32GB
System
Linux / macOS(Apple Silicon can be quantized)

Goal: Let you 'use it out of the box' instead of struggling with environment setup for a week

Devstral 2 (123B, More for Teams/Servers)
GPU
多卡 / ≥ 128GB VRAM(推理服务器级别)
Usage
Team inference service, heavy tasks and long context

If you just want to experience the workflow, starting with 24B is more cost-effective

Ollama (Simplest)

One command to get it running
ollama run devstral-2

Model library address: https://ollama.com/library/devstral-2

When is Ollama suitable?
  • You want to quickly verify 'if it feels right'
  • You care more about 'one command to start service' than ultimate performance tuning
  • You're already promoting Ollama workflow in your team

GGUF / llama.cpp (Community Common)

Recommended Process (copy as is based on your copy)
  1. Download GGUF quantized model from Hugging Face
  2. Use llama.cpp / LM Studio / text-generation-webui
  3. Adjust threads/batch size/context window according to project
Recommended Parameters (for starting)
  • Temperature: 0.15
  • Context: 128k–256k

Note: This is not 'the only correct answer', just a more stable default value

Devstral 2 · Run Locally (Hardware & Installation)