---
title: vision - Image Understanding
description: Analyze image content (recognition, description, OCR, etc.)
---
Analyze local images or image URLs using Vision API. Supports content description, text extraction (OCR), object recognition, and more.
## Model Selection
The vision tool uses a multi-level auto-selection strategy with automatic fallback — no manual configuration required:
1. **Main model** — uses the currently configured main model for image recognition (must be a multimodal model)
2. **Other configured models** — auto-discovers other multimodal models with configured API keys as alternatives
If the current provider fails, the tool automatically tries the next one until it succeeds or all fail.
### Supported Models
| Provider | Vision Model | Notes |
| --- | --- | --- |
| OpenAI / Compatible | Main model | All OpenAI-protocol-compatible multimodal models |
| Qwen (DashScope) | Main model | e.g. qwen3.6-plus, etc. |
| Claude | Main model | Anthropic native image format |
| Gemini | Main model | inlineData format |
| Doubao | Main model | doubao-seed-2-0 series natively supported |
| Kimi (Moonshot) | Main model | kimi-k2.6, kimi-k2.5 natively supported |
| ERNIE | Main model | Defaults to the multimodal main model (e.g. `ernie-5.1`); falls back to `ernie-4.5-turbo-vl` when the main model is not multimodal |
| ZhipuAI | glm-5v-turbo | Always uses the dedicated vision model |
| MiniMax | MiniMax-Text-01 | Always uses the dedicated vision model |
ZhipuAI and MiniMax text models do not support image understanding, so their dedicated vision models are always used automatically.
> When `use_linkai=true`, LinkAI's multimodal model is used by default.
## Custom Configuration
To specify the model used by Vision, configure it in `config.json`, for example:
```json
{
"tools": {
"vision": {
"model": "gpt-4.1"
}
}
}
```
The specified model is **used first**, and the tool automatically routes to the corresponding provider based on the model name; on failure, it falls back to other configured providers.
In most cases no configuration is needed — the tool works automatically as long as the main model supports multimodal input or any vision-capable API key is configured.
## Parameters
| Parameter | Type | Required | Description |
| --- | --- | --- | --- |
| `image` | string | Yes | Local file path or HTTP(S) image URL |
| `question` | string | Yes | Question to ask about the image |
Supported image formats: jpg, jpeg, png, gif, webp
## Use Cases
- Describe image content
- Extract text from images (OCR)
- Identify objects, colors, scenes
- Analyze screenshots and scanned documents
Images larger than 1MB are automatically compressed before upload. All images (including remote URLs) are converted to base64 for transmission to ensure compatibility with all model backends.