Skip to content
Qwen2.5 VL 72B Instruct logo

Qwen2.5 VL 72B Instruct

Verified

Excels at multimodal reasoning combining images and extensive text.

Alibaba QwenMultimodalOpen
Vision
Model page
Updated 2026-06-15

About Qwen2.5 VL 72B Instruct

Built on the Qwen2.5 series, this model extends language capabilities to include visual understanding. It processes combined text and image data seamlessly for integrated analysis. The architecture supports high-resolution image inputs alongside lengthy textual contexts.

Strengths include robust performance in visual question answering and document interpretation. It is well-suited for applications requiring detailed scene description or cross-modal inference. Developers leverage its open weights for fine-tuning on specialized multimodal datasets.

Capabilities

Multimodal text and image understanding
Long-context reasoning
Visual question answering
Document and chart interpretation
Image analysis and description
Multimodal instruction following

How Qwen2.5 VL 72B Instruct compares

Qwen2.5 VL 72B Instruct (striped bar) vs other multimodal on intelligence, speed and price.

Price

USD per 1M output tokens · Lower is better · Qwen2.5 VL 72B Instruct ranks #36 of 139

$0.60
Saba
$0.88
Qwen3 VL 235B A22B Instruct
$0.90
Codestral 2508
$0.90
GLM 4.6V
$1.0
Qwen3.6 35B A3B
$1.0
Qwen3.5-35B-A3B
$1.0
Qwen2.5 VL 72B Instruct
$1.1
Qwen3.6 Flash
$1.1
Step 3.7 Flash
$1.2
MiniMax M3
$1.3
GPT-5.4 Nano
$1.3
ERNIE 4.5 VL 424B A47B
$1.3
Qwen3.7 Plus

Sources: Artificial Analysis (intelligence, speed) · OpenRouter (price).

Best for

Visual Question Answering on Complex Scenes

The model answers detailed questions about images by combining visual recognition with textual reasoning in a single pass.

Document and Chart Analysis

It extracts data, trends, and summaries from charts, tables, and multi-page documents that mix text and visuals.

Long Multimodal Instruction Following

Users can provide extended sequences of images and text up to 131072 tokens for step-by-step visual tasks and reasoning chains.

Strengths & limitations

Strengths

  • +Strong integration of vision and language
  • +Handles extended contexts reliably
  • +Competitive visual reasoning for its size
  • +Good multilingual support

Limitations

  • Can hallucinate on complex or ambiguous images
  • High compute demands due to model size
  • Static images only, no native video support

Cost calculator

Estimate what Qwen2.5 VL 72B Instruct would cost for your usage.

$0.00130
per request
$13
estimated / month

Based on Qwen2.5 VL 72B Instruct's $0.80/1M input · $1.00/1M output. Estimate only — actual cost varies by provider and caching.

Quick start

OpenRouter's API is OpenAI-compatible — most SDKs work by just swapping the base URL. Only the model slug changes between models.

JavaScript · openai
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://openrouter.ai/api/v1",
  apiKey: process.env.OPENROUTER_API_KEY,
});

const completion = await client.chat.completions.create({
  model: "qwen/qwen2.5-vl-72b-instruct",
  messages: [{ role: "user", content: "Hello!" }],
});

console.log(completion.choices[0].message.content);

Model slug: qwen/qwen2.5-vl-72b-instruct

Editor's verdict

Our take on Qwen2.5 VL 72B Instruct

Qwen2.5 VL 72B Instruct is Alibaba Qwen's open-weight multimodal with a 131K-token context window.

At $1.00 per 1M output tokens, it is mid-priced for its class.

As an open-weight model you can self-host it or call it through a hosted API.

Best suited to strong integration of vision and language and handles extended contexts reliably.

Did you find this helpful?

Frequently asked questions

The model handles up to 131072 tokens of combined text and image input.

User reviews

Real, verified reviews from the community shape this model's rating.

Loading reviews…

Sign in to review

Other Qwen models

Sibling versions in the Qwen family from Alibaba Qwen.

Promote Qwen2.5 VL 72B Instruct

Add this badge to your website, or share the tool.

DFeatured on DhanasviQwen2.5 VL 72B Instruct 1