feat(vision): prioritize main model for image recognition with multi-provider fallback

- Add call_vision method to all bot implementations (DashScope, Claude, Gemini, ZhipuAI, MiniMax, Doubao, Moonshot, OpenAICompatibleBot) using each vendor's native multimodal API format - Remove call_with_tools/call_vision from Bot base class to fix MRO shadowing issue with OpenAICompatibleBot mixin - Refactor vision tool provider resolution: MainModel → other configured models (auto-discovered) → OpenAI → LinkAI, with automatic fallback - Return actual model name used in call_vision responses - Sync config.json API keys to .env bidirectionally on startup - Fix bot instance cache to detect bot_type/use_linkai config changes - Add SSE reconnection support for web console - Preserve image path hints in Gemini text for correct vision tool calls - Update docs/tools/vision.mdx
2026-07-18 20:17:09 +08:00 · 2026-04-11 19:46:11 +08:00
parent 3cd92ccda3
commit 26693acc3f
17 changed files with 1173 additions and 359 deletions
--- a/docs/en/tools/vision.mdx
+++ b/docs/en/tools/vision.mdx
@@ -0,0 +1,72 @@
+---
+title: vision - Image Analysis
+description: Analyze image content (recognition, description, OCR, etc.)
+---
+
+Analyze local images or image URLs using Vision API. Supports content description, text extraction (OCR), object recognition, and more.
+
+## Model Selection
+
+The vision tool uses a multi-level auto-selection strategy with automatic fallback — no manual configuration required:
+
+1. **Main model** — uses the currently configured main model for image recognition (zero extra cost)
+2. **Other configured models** — auto-discovers other models with configured API keys as alternatives
+3. **OpenAI** — uses `open_ai_api_key` to call gpt-4.1-mini
+4. **LinkAI** — uses `linkai_api_key` to call LinkAI vision service
+
+When `use_linkai=true`, LinkAI is promoted to the highest priority.
+
+If the current provider fails, the tool automatically tries the next one until it succeeds or all fail.
+
+### Supported Models
+
+| Vendor | Vision Model | Notes |
+| --- | --- | --- |
+| OpenAI / Compatible | Main model | All OpenAI-compatible multimodal models |
+| Qwen (DashScope) | Main model | Via MultiModalConversation API |
+| Claude | Main model | Anthropic native image format |
+| Gemini | Main model | inlineData format |
+| Doubao | Main model | doubao-seed-2-0 series natively supported |
+| Kimi (Moonshot) | Main model | kimi-k2.5 natively supported |
+| ZhipuAI | glm-5v-turbo | Always uses dedicated vision model |
+| MiniMax | MiniMax-Text-01 | Always uses dedicated vision model |
+
+<Note>
+  ZhipuAI and MiniMax text models do not support image understanding, so their dedicated vision models are always used automatically.
+</Note>
+
+## Parameters
+
+| Parameter | Type | Required | Description |
+| --- | --- | --- | --- |
+| `image` | string | Yes | Local file path or HTTP(S) image URL |
+| `question` | string | Yes | Question to ask about the image |
+
+Supported image formats: jpg, jpeg, png, gif, webp
+
+## Custom Configuration
+
+To specify a particular model for the vision tool, add to `config.json`:
+
+```json
+{
+    "tool": {
+        "vision": {
+            "model": "gpt-4o"
+        }
+    }
+}
+```
+
+In most cases no configuration is needed. The tool works automatically as long as the main model supports multimodal input or any vision-capable API key is configured.
+
+## Use Cases
+
+- Describe image content
+- Extract text from images (OCR)
+- Identify objects, colors, scenes
+- Analyze screenshots and scanned documents
+
+<Note>
+  Images larger than 1MB are automatically compressed (max edge 1536px). All images (including remote URLs) are converted to base64 for transmission to ensure compatibility with all model backends.
+</Note>
--- a/docs/ja/tools/vision.mdx
+++ b/docs/ja/tools/vision.mdx
@@ -0,0 +1,72 @@
+---
+title: vision - 画像分析
+description: 画像コンテンツの分析（認識、説明、OCR など）
+---
+
+Vision API を使用してローカル画像や画像 URL を分析します。コンテンツの説明、テキスト抽出（OCR）、オブジェクト認識などに対応しています。
+
+## モデル選択
+
+Vision ツールは多段階の自動選択＋自動フォールバック戦略を採用しており、手動設定なしで利用可能です：
+
+1. **メインモデル** — 現在設定されているメインモデルで画像認識を実行（追加コストなし）
+2. **その他の設定済みモデル** — API キーが設定されている他のマルチモーダルモデルを自動検出
+3. **OpenAI** — `open_ai_api_key` を使用して gpt-4.1-mini を呼び出し
+4. **LinkAI** — `linkai_api_key` を使用して LinkAI ビジョンサービスを呼び出し
+
+`use_linkai=true` の場合、LinkAI が最優先になります。
+
+現在のプロバイダーが失敗した場合、成功するかすべて失敗するまで自動的に次のプロバイダーを試行します。
+
+### 対応モデル
+
+| ベンダー | ビジョンモデル | 説明 |
+| --- | --- | --- |
+| OpenAI / 互換プロトコル | メインモデル | すべての OpenAI 互換マルチモーダルモデルに対応 |
+| 通義千問 (DashScope) | メインモデル | MultiModalConversation API 経由 |
+| Claude | メインモデル | Anthropic ネイティブ画像形式 |
+| Gemini | メインモデル | inlineData 形式 |
+| 豆包 (Doubao) | メインモデル | doubao-seed-2-0 シリーズがネイティブ対応 |
+| Kimi (Moonshot) | メインモデル | kimi-k2.5 がネイティブ対応 |
+| 智谱 AI | glm-5v-turbo | 常にビジョン専用モデルを使用 |
+| MiniMax | MiniMax-Text-01 | 常にビジョン専用モデルを使用 |
+
+<Note>
+  智谱 AI と MiniMax のテキストモデルは画像理解に対応していないため、対応するビジョン専用モデルが自動的に使用されます。
+</Note>
+
+## パラメータ
+
+| パラメータ | 型 | 必須 | 説明 |
+| --- | --- | --- | --- |
+| `image` | string | はい | ローカルファイルパスまたは HTTP(S) 画像 URL |
+| `question` | string | はい | 画像に対する質問 |
+
+対応画像形式：jpg、jpeg、png、gif、webp
+
+## カスタム設定
+
+Vision ツールで使用するモデルを指定するには、`config.json` に以下を追加します：
+
+```json
+{
+    "tool": {
+        "vision": {
+            "model": "gpt-4o"
+        }
+    }
+}
+```
+
+ほとんどの場合、設定は不要です。メインモデルがマルチモーダルに対応しているか、ビジョン対応の API キーが設定されていれば自動的に動作します。
+
+## ユースケース
+
+- 画像コンテンツの説明
+- 画像からのテキスト抽出（OCR）
+- オブジェクト、色、シーンの識別
+- スクリーンショットやスキャン文書の分析
+
+<Note>
+  1MB を超える画像は自動的に圧縮されます（最大辺 1536px）。すべての画像（リモート URL を含む）は base64 に変換して送信され、すべてのモデルバックエンドとの互換性を確保します。
+</Note>
--- a/docs/tools/vision.mdx
+++ b/docs/tools/vision.mdx
@@ -5,14 +5,49 @@ description: 分析图片内容（识别、描述、OCR 等）

 使用 Vision API 分析本地图片或图片 URL，支持内容描述、文字提取（OCR）、物体识别等。

-## 依赖
+## 模型选择

-需要配置至少一个 API Key（通过 `env_config` 工具或工作空间 `.env` 文件配置）：
+Vision 工具采用多级自动选择 + 自动兜底策略，无需手动配置即可使用：

-| 后端 | 环境变量 | 优先级 |
+1. **主模型** — 优先使用当前配置的主模型进行图像识别（需要是多模态模型）
+2. **其他已配置模型** — 自动发现已配置 API Key 的其他多模态模型作为备选
+
+如果当前 provider 调用失败，会自动尝试下一个，直到成功或全部失败。
+
+### 支持的模型
+
+| 厂商 | 视觉模型 | 说明 |
 | --- | --- | --- |
-| OpenAI | `OPENAI_API_KEY` | 优先使用 |
-| LinkAI | `LINKAI_API_KEY` | 备选 |
+| OpenAI / 兼容协议 | 使用主模型 | 支持所有 OpenAI 协议兼容的多模态模型 |
+| 通义千问 (DashScope) | 使用主模型 | 例如 qwen3.6-plus 等 |
+| Claude | 使用主模型 | Anthropic 原生图像格式 |
+| Gemini | 使用主模型 | inlineData 格式 |
+| 豆包 (Doubao) | 使用主模型 | doubao-seed-2-0 系列原生支持 |
+| Kimi (Moonshot) | 使用主模型 | kimi-k2.5 原生支持 |
+| 智谱 AI | glm-5v-turbo | 固定使用视觉专用模型 |
+| MiniMax | MiniMax-Text-01 | 固定使用视觉专用模型 |
+
+<Note>
+  智谱和 MiniMax 的文本模型不支持图像理解，因此始终使用对应的视觉专用模型，无需手动指定。
+</Note>
+
+> 当 `use_linkai=true` 时，默认使用 LinkAI 的多模态模型进行
+
+## 自定义配置
+
+如果希望指定 Vision 使用的模型，可在 `config.json` 中配置，例如：
+
+```json
+{
+    "tool": {
+        "vision": {
+            "model": "gpt-4o"
+        }
+    }
+}
+```
+
+大多数情况下无需配置，主模型支持多模态或配置任意一个支持视觉的 API Key 即可自动工作。

 ## 参数

@@ -20,17 +55,18 @@ description: 分析图片内容（识别、描述、OCR 等）
 | --- | --- | --- | --- |
 | `image` | string | 是 | 本地文件路径或 HTTP(S) 图片 URL |
 | `question` | string | 是 | 对图片提出的问题 |
-| `model` | string | 否 | 模型名称（默认 gpt-4.1-mini） |

 支持的图片格式：jpg、jpeg、png、gif、webp

+
+
 ## 使用场景

 - 描述图片中的内容
 - 提取图片中的文字（OCR）
 - 识别物体、颜色、场景
- 分析截图、文档扫描件
+- 分析截图、文档扫描图片等

 <Note>
-  超过 1MB 的图片会自动压缩后上传。如果未配置任何 Vision API Key，该工具不会被加载。
+  超过 1MB 的图片会自动压缩后上传，所有图片（包括远程 URL）会统一转为 base64 传输，确保兼容所有模型后端。
 </Note>