feat(vision): prioritize main model for image recognition with multi-provider fallback

- Add call_vision method to all bot implementations (DashScope, Claude, Gemini, ZhipuAI, MiniMax, Doubao, Moonshot, OpenAICompatibleBot) using each vendor's native multimodal API format - Remove call_with_tools/call_vision from Bot base class to fix MRO shadowing issue with OpenAICompatibleBot mixin - Refactor vision tool provider resolution: MainModel → other configured models (auto-discovered) → OpenAI → LinkAI, with automatic fallback - Return actual model name used in call_vision responses - Sync config.json API keys to .env bidirectionally on startup - Fix bot instance cache to detect bot_type/use_linkai config changes - Add SSE reconnection support for web console - Preserve image path hints in Gemini text for correct vision tool calls - Update docs/tools/vision.mdx
2026-07-18 20:17:09 +08:00 · 2026-04-11 19:46:11 +08:00
parent 3cd92ccda3
commit 26693acc3f
17 changed files with 1173 additions and 359 deletions
--- a/docs/ja/tools/vision.mdx
+++ b/docs/ja/tools/vision.mdx
@@ -0,0 +1,72 @@
+---
+title: vision - 画像分析
+description: 画像コンテンツの分析（認識、説明、OCR など）
+---
+
+Vision API を使用してローカル画像や画像 URL を分析します。コンテンツの説明、テキスト抽出（OCR）、オブジェクト認識などに対応しています。
+
+## モデル選択
+
+Vision ツールは多段階の自動選択＋自動フォールバック戦略を採用しており、手動設定なしで利用可能です：
+
+1. **メインモデル** — 現在設定されているメインモデルで画像認識を実行（追加コストなし）
+2. **その他の設定済みモデル** — API キーが設定されている他のマルチモーダルモデルを自動検出
+3. **OpenAI** — `open_ai_api_key` を使用して gpt-4.1-mini を呼び出し
+4. **LinkAI** — `linkai_api_key` を使用して LinkAI ビジョンサービスを呼び出し
+
+`use_linkai=true` の場合、LinkAI が最優先になります。
+
+現在のプロバイダーが失敗した場合、成功するかすべて失敗するまで自動的に次のプロバイダーを試行します。
+
+### 対応モデル
+
+| ベンダー | ビジョンモデル | 説明 |
+| --- | --- | --- |
+| OpenAI / 互換プロトコル | メインモデル | すべての OpenAI 互換マルチモーダルモデルに対応 |
+| 通義千問 (DashScope) | メインモデル | MultiModalConversation API 経由 |
+| Claude | メインモデル | Anthropic ネイティブ画像形式 |
+| Gemini | メインモデル | inlineData 形式 |
+| 豆包 (Doubao) | メインモデル | doubao-seed-2-0 シリーズがネイティブ対応 |
+| Kimi (Moonshot) | メインモデル | kimi-k2.5 がネイティブ対応 |
+| 智谱 AI | glm-5v-turbo | 常にビジョン専用モデルを使用 |
+| MiniMax | MiniMax-Text-01 | 常にビジョン専用モデルを使用 |
+
+<Note>
+  智谱 AI と MiniMax のテキストモデルは画像理解に対応していないため、対応するビジョン専用モデルが自動的に使用されます。
+</Note>
+
+## パラメータ
+
+| パラメータ | 型 | 必須 | 説明 |
+| --- | --- | --- | --- |
+| `image` | string | はい | ローカルファイルパスまたは HTTP(S) 画像 URL |
+| `question` | string | はい | 画像に対する質問 |
+
+対応画像形式：jpg、jpeg、png、gif、webp
+
+## カスタム設定
+
+Vision ツールで使用するモデルを指定するには、`config.json` に以下を追加します：
+
+```json
+{
+    "tool": {
+        "vision": {
+            "model": "gpt-4o"
+        }
+    }
+}
+```
+
+ほとんどの場合、設定は不要です。メインモデルがマルチモーダルに対応しているか、ビジョン対応の API キーが設定されていれば自動的に動作します。
+
+## ユースケース
+
+- 画像コンテンツの説明
+- 画像からのテキスト抽出（OCR）
+- オブジェクト、色、シーンの識別
+- スクリーンショットやスキャン文書の分析
+
+<Note>
+  1MB を超える画像は自動的に圧縮されます（最大辺 1536px）。すべての画像（リモート URL を含む）は base64 に変換して送信され、すべてのモデルバックエンドとの互換性を確保します。
+</Note>