Where can I access Qwen2.5 VL 72B Instruct?

It is released by Alibaba Qwen and available on major model hubs such as Hugging Face.

Is pricing information published for this model?

Pricing depends on the hosting provider and deployment option; official channels list current rates.

Does the model accept both images and text in one prompt?

Yes, it is designed for multimodal inputs that interleave images with text instructions.

What tasks is the 72B Instruct variant optimized for?

It focuses on visual question answering, document interpretation, and long-context multimodal reasoning.

Qwen2.5 VL 72B Instruct by Alibaba Qwen — Specs, Pricing, Benchmarks (2026)

About Qwen2.5 VL 72B Instruct

Built on the Qwen2.5 series, this model extends language capabilities to include visual understanding. It processes combined text and image data seamlessly for integrated analysis. The architecture supports high-resolution image inputs alongside lengthy textual contexts.

Strengths include robust performance in visual question answering and document interpretation. It is well-suited for applications requiring detailed scene description or cross-modal inference. Developers leverage its open weights for fine-tuning on specialized multimodal datasets.

Capabilities

Multimodal text and image understanding

Long-context reasoning

Visual question answering

Document and chart interpretation

Image analysis and description

Multimodal instruction following

How Qwen2.5 VL 72B Instruct compares

Qwen2.5 VL 72B Instruct (striped bar) vs other multimodal on intelligence, speed and price.

Price

USD per 1M output tokens · Lower is better · Qwen2.5 VL 72B Instruct ranks #36 of 139

$0.60

Saba

$0.88

Qwen3 VL 235B A22B Instruct

$0.90

Codestral 2508

$0.90

GLM 4.6V

$1.0

Qwen3.6 35B A3B

$1.0

Qwen3.5-35B-A3B

$1.0

Qwen2.5 VL 72B Instruct

$1.1

Qwen3.6 Flash

$1.1

Step 3.7 Flash

$1.2

MiniMax M3

$1.3

GPT-5.4 Nano

$1.3

ERNIE 4.5 VL 424B A47B

$1.3

Qwen3.7 Plus

Sources: Artificial Analysis (intelligence, speed) · OpenRouter (price).

Best for

Visual Question Answering on Complex Scenes

The model answers detailed questions about images by combining visual recognition with textual reasoning in a single pass.

Document and Chart Analysis

It extracts data, trends, and summaries from charts, tables, and multi-page documents that mix text and visuals.

Long Multimodal Instruction Following

Users can provide extended sequences of images and text up to 131072 tokens for step-by-step visual tasks and reasoning chains.

Strengths & limitations

Strengths

+Strong integration of vision and language
+Handles extended contexts reliably
+Competitive visual reasoning for its size
+Good multilingual support

Limitations

–Can hallucinate on complex or ambiguous images
–High compute demands due to model size
–Static images only, no native video support

Cost calculator

Estimate what Qwen2.5 VL 72B Instruct would cost for your usage.

Input tokens / requestOutput tokens / requestRequests / month

$0.00130

per request

$13

estimated / month

Based on Qwen2.5 VL 72B Instruct's $0.80/1M input · $1.00/1M output. Estimate only — actual cost varies by provider and caching.

Quick start

OpenRouter's API is OpenAI-compatible — most SDKs work by just swapping the base URL. Only the model slug changes between models.

JavaScript · openai

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://openrouter.ai/api/v1",
  apiKey: process.env.OPENROUTER_API_KEY,
});

const completion = await client.chat.completions.create({
  model: "qwen/qwen2.5-vl-72b-instruct",
  messages: [{ role: "user", content: "Hello!" }],
});

console.log(completion.choices[0].message.content);

Model slug: qwen/qwen2.5-vl-72b-instruct

Editor's verdict

Our take on Qwen2.5 VL 72B Instruct

Qwen2.5 VL 72B Instruct is Alibaba Qwen's open-weight multimodal with a 131K-token context window.

At $1.00 per 1M output tokens, it is mid-priced for its class.

As an open-weight model you can self-host it or call it through a hosted API.

Best suited to strong integration of vision and language and handles extended contexts reliably.

Did you find this helpful?

Frequently asked questions

The model handles up to 131072 tokens of combined text and image input.

User reviews

Real, verified reviews from the community shape this model's rating.

Loading reviews…

Sign in to review

Other Qwen models

Sibling versions in the Qwen family from Alibaba Qwen.

Qwen3.7 Max

Alibaba Qwen · Language Models

Verified

Qwen3.7 Max processes up to one million tokens in a single pass.

OpenII 56.61000K ctx$3.75/1M out

Qwen3.7 Plus

Alibaba Qwen · Multimodal

Verified

Open-weight multimodal model for million-token text and image tasks.

OpenII 53.31000K ctx$1.28/1M out

Qwen3.6 Max Preview

Alibaba Qwen · Language Models

Verified

Open-weight LLM optimized for long-context text reasoning and analysis.

OpenII 51.8262K ctx$6.24/1M out

Qwen3.6 27B

Alibaba Qwen · Multimodal

Verified

Multimodal model for long-context text, image, and video processing.

OpenII 45.8262K ctx$3.17/1M out

Qwen3.6 35B A3B

Alibaba Qwen · Multimodal

Verified

Multimodal model for long-context text, image, and video analysis.

OpenII 43.5262K ctx$1.00/1M out

Qwen3.6 Plus

Alibaba Qwen · Multimodal

Verified

Qwen3.6 Plus handles long multimodal sequences across text, images, and video.

Open1000K ctx$1.95/1M out

Similar models

Other multimodal worth comparing.

Gemini 2.5 Flash Lite

Google · Multimodal

Verified

Google's fast, lightweight multimodal model for text, image, audio, and video tasks.

Closed1049K ctx$0.40/1M out

GPT-5.1

OpenAI · Multimodal

Verified

OpenAI's multimodal model for large-scale image, text, and file processing.

Closed400K ctx$10.00/1M out

Gemini 2.5 Pro Preview 05-06

Google · Multimodal

Verified

Google's multimodal model processes text, images, audio, video and files over 1M tokens.

Closed1049K ctx$10.00/1M out

Qwen2.5 VL 72B Instruct

About Qwen2.5 VL 72B Instruct

Capabilities

How Qwen2.5 VL 72B Instruct compares

Price

Best for

Visual Question Answering on Complex Scenes

Document and Chart Analysis

Long Multimodal Instruction Following

Strengths & limitations

Strengths

Limitations

Cost calculator

Quick start

Editor's verdict

Frequently asked questions

What context length does Qwen2.5 VL 72B Instruct support?

Where can I access Qwen2.5 VL 72B Instruct?

Is pricing information published for this model?

Does the model accept both images and text in one prompt?

What tasks is the 72B Instruct variant optimized for?

User reviews

Other Qwen models

Qwen3.7 Max

Qwen3.7 Plus

Qwen3.6 Max Preview

Qwen3.6 27B

Qwen3.6 35B A3B

Qwen3.6 Plus

Similar models

Gemini 2.5 Flash Lite

GPT-5.1

Gemini 2.5 Pro Preview 05-06

Promote Qwen2.5 VL 72B Instruct