mirror of
https://github.com/zhayujie/chatgpt-on-wechat.git
synced 2026-06-02 00:57:41 +08:00
76 lines
2.8 KiB
Plaintext
76 lines
2.8 KiB
Plaintext
---
|
|
title: vision - Image Understanding
|
|
description: Analyze image content (recognition, description, OCR, etc.)
|
|
---
|
|
|
|
Analyze local images or image URLs using Vision API. Supports content description, text extraction (OCR), object recognition, and more.
|
|
|
|
## Model Selection
|
|
|
|
The vision tool uses a multi-level auto-selection strategy with automatic fallback — no manual configuration required:
|
|
|
|
1. **Main model** — uses the currently configured main model for image recognition (must be a multimodal model)
|
|
2. **Other configured models** — auto-discovers other multimodal models with configured API keys as alternatives
|
|
|
|
If the current provider fails, the tool automatically tries the next one until it succeeds or all fail.
|
|
|
|
### Supported Models
|
|
|
|
| Provider | Vision Model | Notes |
|
|
| --- | --- | --- |
|
|
| OpenAI / Compatible | Main model | All OpenAI-protocol-compatible multimodal models |
|
|
| Qwen (DashScope) | Main model | e.g. qwen3.6-plus, etc. |
|
|
| Claude | Main model | Anthropic native image format |
|
|
| Gemini | Main model | inlineData format |
|
|
| Doubao | Main model | doubao-seed-2-0 series natively supported |
|
|
| Kimi (Moonshot) | Main model | kimi-k2.6, kimi-k2.5 natively supported |
|
|
| ERNIE | Main model | Defaults to the multimodal main model (e.g. `ernie-5.1`); falls back to `ernie-4.5-turbo-vl` when the main model is not multimodal |
|
|
| ZhipuAI | glm-5v-turbo | Always uses the dedicated vision model |
|
|
| MiniMax | MiniMax-Text-01 | Always uses the dedicated vision model |
|
|
|
|
<Note>
|
|
ZhipuAI and MiniMax text models do not support image understanding, so their dedicated vision models are always used automatically.
|
|
</Note>
|
|
|
|
> When `use_linkai=true`, LinkAI's multimodal model is used by default.
|
|
|
|
## Custom Configuration
|
|
|
|
To specify the model used by Vision, configure it in `config.json`, for example:
|
|
|
|
```json
|
|
{
|
|
"tools": {
|
|
"vision": {
|
|
"model": "gpt-4.1"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
The specified model is **used first**, and the tool automatically routes to the corresponding provider based on the model name; on failure, it falls back to other configured providers.
|
|
|
|
In most cases no configuration is needed — the tool works automatically as long as the main model supports multimodal input or any vision-capable API key is configured.
|
|
|
|
## Parameters
|
|
|
|
| Parameter | Type | Required | Description |
|
|
| --- | --- | --- | --- |
|
|
| `image` | string | Yes | Local file path or HTTP(S) image URL |
|
|
| `question` | string | Yes | Question to ask about the image |
|
|
|
|
Supported image formats: jpg, jpeg, png, gif, webp
|
|
|
|
|
|
|
|
## Use Cases
|
|
|
|
- Describe image content
|
|
- Extract text from images (OCR)
|
|
- Identify objects, colors, scenes
|
|
- Analyze screenshots and scanned documents
|
|
|
|
<Note>
|
|
Images larger than 1MB are automatically compressed before upload. All images (including remote URLs) are converted to base64 for transmission to ensure compatibility with all model backends.
|
|
</Note>
|