How can I access Voxtral Small 24B 2507?

It is available through Mistral's official API and platform as a multimodal model.

What pricing applies to Voxtral Small 24B 2507?

Current pricing details are not specified in the model listing and should be checked directly on the Mistral platform.

What types of inputs does the multimodal capability support?

The model accepts both text and visual data such as images for joint processing and generation.

Voxtral Small 24B 2507 by Mistral — Specs, Pricing, Benchmarks (2026)

About Voxtral Small 24B 2507

Voxtral Small 24B 2507 features a 24 billion parameter architecture built for multimodal inputs. It accepts text, audio, and file data in a single context of up to 32,000 tokens. The design emphasizes unified handling of these modalities.

Its open-weight availability supports customization and local deployment. The model balances scale with efficiency for audio-text integration. This combination enables reliable performance across mixed input types.

Developers typically apply it to voice transcription pipelines and audio-enriched document analysis. It also suits research projects needing accessible multimodal foundations. Production use cases include file-based audio workflows and context-aware text generation.

Capabilities

Multimodal audio-text processing

File input analysis

Speech transcription and understanding

Mixed-modality reasoning

Context-aware text generation

Audio file handling

How Voxtral Small 24B 2507 compares

Voxtral Small 24B 2507 (striped bar) vs other multimodal on intelligence, speed and price.

Price

USD per 1M output tokens · Lower is better · Voxtral Small 24B 2507 ranks #9 of 100

$0.15

Ministral 3 8B 2512

$0.15

Qwen3.5-9B

$0.20

Ministral 3 14B 2512

$0.26

Qwen3.5-Flash

$0.28

MiMo-V2.5

$0.30

Seed 1.6 Flash

$0.30

Voxtral Small 24B 2507

$0.40

Gemini 2.5 Flash Lite Preview 09-2025

$0.40

Seed-2.0-Mini

$0.42

Qwen3 VL 32B Instruct

$0.50

Qwen3 VL 8B Instruct

$0.52

Qwen3 VL 30B A3B Instruct

$0.60

Mistral Small 4

Sources: Artificial Analysis (intelligence, speed) · OpenRouter (price).

Best for

Image-Text Analysis Tasks

The multimodal design allows the model to process and reason over combined visual and textual inputs such as charts, diagrams, and accompanying descriptions.

Extended Document Review

Its 32,000-token context window supports reviewing lengthy reports or articles that include embedded images without losing earlier details.

Visual Question Answering

Users can upload images and ask detailed questions about their content, receiving context-aware responses grounded in both modalities.

Strengths & limitations

Strengths

+Native support for audio and file modalities
+Efficient 24B scale for multimodal tasks
+Integrated handling of text, audio, and files
+Practical 32k token context window

Limitations

–Moderate context length compared to larger-window models
–Smaller parameter count may limit depth on complex reasoning
–Multimodal inputs can increase processing overhead

Cost calculator

Estimate what Voxtral Small 24B 2507 would cost for your usage.

Input tokens / requestOutput tokens / requestRequests / month

$0.00025

per request

$2.5

estimated / month

Based on Voxtral Small 24B 2507's $0.10/1M input · $0.30/1M output. Estimate only — actual cost varies by provider and caching.

Quick start

OpenRouter's API is OpenAI-compatible — most SDKs work by just swapping the base URL. Only the model slug changes between models.

JavaScript · openai

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://openrouter.ai/api/v1",
  apiKey: process.env.OPENROUTER_API_KEY,
});

const completion = await client.chat.completions.create({
  model: "mistralai/voxtral-small-24b-2507",
  messages: [{ role: "user", content: "Hello!" }],
});

console.log(completion.choices[0].message.content);

Model slug: mistralai/voxtral-small-24b-2507

Editor's verdict

Our take on Voxtral Small 24B 2507

Voxtral Small 24B 2507 is Mistral's open-weight multimodal with a 32K-token context window.

At $0.30 per 1M output tokens, it is very cost-efficient for its class.

As an open-weight model you can self-host it or call it through a hosted API.

Best suited to native support for audio and file modalities and efficient 24b scale for multimodal tasks.

Did you find this helpful?

Frequently asked questions

The model provides a context window of 32,000 tokens.

User reviews

Real, verified reviews from the community shape this model's rating.

Loading reviews…

Sign in to review

Similar models

Other multimodal worth comparing.

Gemini 3.1 Pro Preview Custom Tools

Google · Multimodal

Verified

Google's multimodal preview model with custom tools and massive context handling.

Closed1049K ctx$12.00/1M out

GPT-5.1

OpenAI · Multimodal

Verified

OpenAI's multimodal model for large-scale image, text, and file processing.

Closed400K ctx$10.00/1M out

Gemini 3.1 Flash Lite

Google · Multimodal

Verified

Google's fast multimodal model for efficient text, image, and video tasks.

ClosedII 33.51049K ctx$1.50/1M out

Voxtral Small 24B 2507

About Voxtral Small 24B 2507

Capabilities

How Voxtral Small 24B 2507 compares

Price

Best for

Image-Text Analysis Tasks

Extended Document Review

Visual Question Answering

Strengths & limitations

Strengths

Limitations

Cost calculator

Quick start

Editor's verdict

Frequently asked questions

What is the context length of Voxtral Small 24B 2507?

How can I access Voxtral Small 24B 2507?

What pricing applies to Voxtral Small 24B 2507?

What types of inputs does the multimodal capability support?

User reviews

Similar models

Gemini 3.1 Pro Preview Custom Tools

GPT-5.1

Gemini 3.1 Flash Lite

Promote Voxtral Small 24B 2507