Is there pricing information for UI-TARS 7B?

Pricing details are not specified in the available model information and depend on Bytedance's deployment options.

How can users access UI-TARS 7B?

Access is provided through Bytedance channels as a multimodal model intended for research and application development.

What are the primary use cases for this model?

It is optimized for GUI element recognition, screenshot understanding, and generating text outputs for visual agent planning tasks.

UI-TARS 7B

Verified

ByteDance multimodal model for integrated image and text processing.

BytedanceMultimodalClosed

Vision

Model page

Updated 2026-06-15

About UI-TARS 7B

UI-TARS 7B uses a multimodal architecture that processes both images and text in a single forward pass. Its 128000-token context window enables handling of lengthy documents paired with visual elements. The model remains closed-source and is distributed under ByteDance control.

Strengths include unified understanding of visual scenes and accompanying text without requiring separate encoders. This design reduces pipeline complexity for developers working on image-text workflows. Typical usage covers document analysis, visual question answering, and content moderation pipelines.

Users integrate the model through ByteDance APIs for production applications that need synchronized image and text reasoning. Its closed nature ensures consistent updates while limiting direct fine-tuning by external parties.

Capabilities

Multimodal image-text understanding

User interface and screenshot analysis

GUI element recognition and interaction

Long-context multimodal reasoning

Visual task planning for agents

Text generation grounded in visual inputs

How UI-TARS 7B compares

UI-TARS 7B (striped bar) vs other multimodal on intelligence, speed and price.

Price

USD per 1M output tokens · Lower is better · UI-TARS 7B ranks #11 of 139

$0.15

Qwen3.5-9B

$0.15

Gemma 3 12B

$0.16

Gemma 3 27B

$0.18

Llama Guard 4 12B

$0.20

Ministral 3 14B 2512

$0.20

Mistral Small 3.2 24B

$0.20

UI-TARS 7B

$0.26

Qwen3.5-Flash

$0.28

MiMo-V2.5

$0.30

Llama 4 Scout

$0.30

Seed 1.6 Flash

$0.30

Voxtral Small 24B 2507

$0.33

Gemma 4 26B A4B

Sources: Artificial Analysis (intelligence, speed) · OpenRouter (price).

Best for

GUI Automation Scripting

Analyzes screenshots to identify interface elements and generates grounded interaction steps for building reliable automation scripts in desktop or web applications.

Visual Agent Development

Performs long-context multimodal reasoning to create step-by-step task plans that let agents navigate and operate software interfaces from visual input alone.

UI Screenshot Analysis

Provides detailed text descriptions and element recognition for user interface layouts, supporting design reviews or accessibility audits directly from images.

Strengths & limitations

Strengths

+Specialized for UI/GUI tasks
+Efficient 7B scale with practical deployment
+Strong handling of extended 128k context
+Native support for image + text inputs

Limitations

–Narrow specialization may limit general-purpose use
–Smaller model size constrains complex reasoning depth
–Performance tied to UI-style visual domains

Cost calculator

Estimate what UI-TARS 7B would cost for your usage.

Input tokens / requestOutput tokens / requestRequests / month

$0.00020

per request

estimated / month

Based on UI-TARS 7B's $0.10/1M input · $0.20/1M output. Estimate only — actual cost varies by provider and caching.

Quick start

OpenRouter's API is OpenAI-compatible — most SDKs work by just swapping the base URL. Only the model slug changes between models.

JavaScript · openai

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://openrouter.ai/api/v1",
  apiKey: process.env.OPENROUTER_API_KEY,
});

const completion = await client.chat.completions.create({
  model: "bytedance/ui-tars-1.5-7b",
  messages: [{ role: "user", content: "Hello!" }],
});

console.log(completion.choices[0].message.content);

Model slug: bytedance/ui-tars-1.5-7b

Editor's verdict

Our take on UI-TARS 7B

UI-TARS 7B is Bytedance's proprietary multimodal with a 128K-token context window.

At $0.20 per 1M output tokens, it is very cost-efficient for its class.

It is available through Bytedance's API and aggregators like OpenRouter.

Best suited to specialized for ui/gui tasks and efficient 7b scale with practical deployment.

Did you find this helpful?

Frequently asked questions

The model handles up to 128000 tokens of context for processing extended multimodal sequences.

User reviews

Real, verified reviews from the community shape this model's rating.

Loading reviews…

Promote UI-TARS 7B

Add this badge to your website, or share the tool.

DFeatured on DhanasviUI-TARS 7B 1

UI-TARS 7B

About UI-TARS 7B

Capabilities