--- title: vision - Image Analysis description: Analyze image content (recognition, description, OCR, etc.) --- Analyze local images or image URLs using Vision API. Supports content description, text extraction (OCR), object recognition, and more. ## Model Selection The vision tool uses a multi-level auto-selection strategy with automatic fallback — no manual configuration required: 1. **Main model** — uses the currently configured main model for image recognition (zero extra cost) 2. **Other configured models** — auto-discovers other models with configured API keys as alternatives 3. **OpenAI** — uses `open_ai_api_key` to call gpt-4.1-mini 4. **LinkAI** — uses `linkai_api_key` to call LinkAI vision service When `use_linkai=true`, LinkAI is promoted to the highest priority. If the current provider fails, the tool automatically tries the next one until it succeeds or all fail. ### Supported Models | Vendor | Vision Model | Notes | | --- | --- | --- | | OpenAI / Compatible | Main model | All OpenAI-compatible multimodal models | | Baidu Qianfan | Main model | Multimodal main models (e.g. `ernie-5.1`) handle images directly; falls back to `ernie-4.5-turbo-vl` for text-only main models | | Qwen (DashScope) | Main model | Via MultiModalConversation API | | Claude | Main model | Anthropic native image format | | Gemini | Main model | inlineData format | | Doubao | Main model | doubao-seed-2-0 series natively supported | | Kimi (Moonshot) | Main model | kimi-k2.6, kimi-k2.5 natively supported | | ZhipuAI | glm-5v-turbo | Always uses dedicated vision model | | MiniMax | MiniMax-Text-01 | Always uses dedicated vision model | ZhipuAI and MiniMax text models do not support image understanding, so their dedicated vision models are always used automatically. ## Parameters | Parameter | Type | Required | Description | | --- | --- | --- | --- | | `image` | string | Yes | Local file path or HTTP(S) image URL | | `question` | string | Yes | Question to ask about the image | Supported image formats: jpg, jpeg, png, gif, webp ## Custom Configuration To specify a particular model for the vision tool, add to `config.json`: ```json { "tools": { "vision": { "model": "ernie-4.5-turbo-vl" } } } ``` In most cases no configuration is needed. The tool works automatically as long as the main model supports multimodal input or any vision-capable API key is configured. ## Use Cases - Describe image content - Extract text from images (OCR) - Identify objects, colors, scenes - Analyze screenshots and scanned documents Images larger than 1MB are automatically compressed (max edge 1536px). All images (including remote URLs) are converted to base64 for transmission to ensure compatibility with all model backends.