Skip to content
GLM 4.6V logo

GLM 4.6V

Verified

Multimodal model for unified image, text, and video processing.

Z.AIMultimodalClosed
Vision
Model page
Updated 2026-06-14

About GLM 4.6V

GLM 4.6V is engineered as a closed-weight multimodal system. It integrates processing across visual, textual, and video modalities. The design accommodates long contexts reaching 131072 tokens.

Strengths center on seamless cross-modal understanding without open-weight distribution. It maintains consistent performance across diverse input types. Z.AI targets users requiring reliable multimodal capabilities.

Common applications involve video analysis, image captioning, and text generation from mixed media. Researchers and developers use it for tasks needing extended context handling. It fits professional workflows that prioritize proprietary model access.

Capabilities

Multimodal understanding (image, text, video)
Long-context reasoning
Visual and video content analysis
Cross-modal instruction following
Text generation and reasoning

How GLM 4.6V compares

GLM 4.6V (striped bar) vs other multimodal on intelligence, speed and price.

Price

USD per 1M output tokens · Lower is better · GLM 4.6V ranks #12 of 63

$0.30
Voxtral Small 24B 2507
$0.40
Gemini 2.5 Flash Lite Preview 09-2025
$0.40
Seed-2.0-Mini
$0.42
Qwen3 VL 32B Instruct
$0.60
Mistral Small 4
$0.88
Qwen3 VL 235B A22B Instruct
$0.90
GLM 4.6V
$0.97
Qwen3.6 35B A3B
$1.1
Qwen3.6 Flash
$1.1
Step 3.7 Flash
$1.2
MiniMax M3
$1.3
Qwen3.7 Plus
$1.5
Gemini 3.1 Flash Lite

Sources: Artificial Analysis (intelligence, speed) · OpenRouter (price).

Best for

Long Video Content Analysis

Processes extended video inputs with multimodal understanding and long-context reasoning to deliver detailed breakdowns and insights from lengthy footage.

Cross-Modal Instruction Tasks

Follows complex instructions that combine images, video, and text to produce accurate analyses and generated responses across modalities.

Visual Document Reasoning

Applies visual and text understanding over large contexts to handle multi-page documents containing charts, images, and supporting text.

Strengths & limitations

Strengths

  • +Native support for image, text, and video inputs
  • +Large context window for extended documents or conversations
  • +Unified multimodal processing in a single model

Limitations

  • Video handling can be computationally intensive
  • Performance varies across languages and domains
  • Multimodal models may inherit vision or language biases

Cost calculator

Estimate what GLM 4.6V would cost for your usage.

$0.00075
per request
$7.5
estimated / month

Based on GLM 4.6V's $0.30/1M input · $0.90/1M output. Estimate only — actual cost varies by provider and caching.

Quick start

OpenRouter's API is OpenAI-compatible — most SDKs work by just swapping the base URL. Only the model slug changes between models.

JavaScript · openai
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://openrouter.ai/api/v1",
  apiKey: process.env.OPENROUTER_API_KEY,
});

const completion = await client.chat.completions.create({
  model: "z-ai/glm-4.6v",
  messages: [{ role: "user", content: "Hello!" }],
});

console.log(completion.choices[0].message.content);

Model slug: z-ai/glm-4.6v

Editor's verdict

Our take on GLM 4.6V

GLM 4.6V is Z.AI's proprietary multimodal with a 131K-token context window.

At $0.90 per 1M output tokens, it is very cost-efficient for its class.

It is available through Z.AI's API and aggregators like OpenRouter.

Best suited to native support for image, text, and video inputs and large context window for extended documents or conversations.

Did you find this helpful?

Frequently asked questions

GLM 4.6V provides a context window of 131072 tokens.

User reviews

Real, verified reviews from the community shape this model's rating.

Loading reviews…

Sign in to review

Other GLM models

Sibling versions in the GLM family from Z.AI.

Promote GLM 4.6V

Add this badge to your website, or share the tool.

DFeatured on DhanasviGLM 4.6V 1