docs: update media auto-detect

2026-01-23 05:47:16 +00:00
parent 93bef830ce
commit bd7443b39b
4 changed files with 47 additions and 20 deletions
--- a/docs/gateway/configuration-examples.md
+++ b/docs/gateway/configuration-examples.md
@@ -129,7 +129,7 @@ Save to `~/.clawdbot/clawdbot.json` and you can DM the bot from that number.
        enabled: true,
        maxBytes: 20971520,
        models: [
-          { provider: "openai", model: "whisper-1" },
+          { provider: "openai", model: "gpt-4o-mini-transcribe" },
          // Optional CLI fallback (Whisper binary):
          // { type: "cli", command: "whisper", args: ["--model", "base", "{{MediaPath}}"] }
        ],
--- a/docs/gateway/configuration.md
+++ b/docs/gateway/configuration.md
@@ -1865,7 +1865,7 @@ Note: `applyPatch` is only under `tools.exec`.
 - Each `models[]` entry:
  - Provider entry (`type: "provider"` or omitted):
    - `provider`: API provider id (`openai`, `anthropic`, `google`/`gemini`, `groq`, etc).
-    - `model`: model id override (required for image; defaults to `whisper-1`/`whisper-large-v3-turbo` for audio providers, and `gemini-3-flash-preview` for video).
+    - `model`: model id override (required for image; defaults to `gpt-4o-mini-transcribe`/`whisper-large-v3-turbo` for audio providers, and `gemini-3-flash-preview` for video).
    - `profile` / `preferredProfile`: auth profile selection.
  - CLI entry (`type: "cli"`):
    - `command`: executable to run.
@@ -1890,7 +1890,7 @@ Example:
          rules: [{ action: "allow", match: { chatType: "direct" } }]
        },
        models: [
-          { provider: "openai", model: "whisper-1" },
+          { provider: "openai", model: "gpt-4o-mini-transcribe" },
          { type: "cli", command: "whisper", args: ["--model", "base", "{{MediaPath}}"] }
        ]
      },
--- a/docs/nodes/audio.md
+++ b/docs/nodes/audio.md
@@ -6,7 +6,7 @@ read_when:
 # Audio / Voice Notes — 2026-01-17

 ## What works
- **Media understanding (audio)**: If `tools.media.audio` is enabled (or a shared `tools.media.models` entry supports audio), Clawdbot:
+- **Media understanding (audio)**: If audio understanding is enabled (or auto‑detected), Clawdbot:
  1) Locates the first audio attachment (local path or URL) and downloads it if needed.
  2) Enforces `maxBytes` before sending to each model entry.
  3) Runs the first eligible model entry in order (provider or CLI).
@@ -15,6 +15,21 @@ read_when:
 - **Command parsing**: When transcription succeeds, `CommandBody`/`RawBody` are set to the transcript so slash commands still work.
 - **Verbose logging**: In `--verbose`, we log when transcription runs and when it replaces the body.

+## Auto-detection (default)
+If you **don’t configure models** and `tools.media.audio.enabled` is **not** set to `false`,
+Clawdbot auto-detects in this order and stops at the first working option:
+
+1) **Local CLIs** (if installed)
+   - `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens)
+   - `whisper-cli` (from `whisper-cpp`; uses `WHISPER_CPP_MODEL` or the bundled tiny model)
+   - `whisper` (Python CLI; downloads models automatically)
+2) **Gemini CLI** (`gemini`) using `read_many_files`
+3) **Provider keys** (OpenAI → Groq → Deepgram → Google)
+
+To disable auto-detection, set `tools.media.audio.enabled: false`.
+To customize, set `tools.media.audio.models`.
+Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI is on `PATH` (we expand `~`), or set an explicit CLI model with a full command path.
+
 ## Config examples

 ### Provider + CLI fallback (OpenAI + Whisper CLI)
@@ -26,7 +41,7 @@ read_when:
        enabled: true,
        maxBytes: 20971520,
        models: [
-          { provider: "openai", model: "whisper-1" },
+          { provider: "openai", model: "gpt-4o-mini-transcribe" },
          {
            type: "cli",
            command: "whisper",
@@ -54,7 +69,7 @@ read_when:
          ]
        },
        models: [
-          { provider: "openai", model: "whisper-1" }
+          { provider: "openai", model: "gpt-4o-mini-transcribe" }
        ]
      }
    }
@@ -83,6 +98,7 @@ read_when:
 - Audio providers can override `baseUrl`, `headers`, and `providerOptions` via `tools.media.audio`.
 - Default size cap is 20MB (`tools.media.audio.maxBytes`). Oversize audio is skipped for that model and the next entry is tried.
 - Default `maxChars` for audio is **unset** (full transcript). Set `tools.media.audio.maxChars` or per-entry `maxChars` to trim output.
+- OpenAI auto default is `gpt-4o-mini-transcribe`; set `model: "gpt-4o-transcribe"` for higher accuracy.
 - Use `tools.media.audio.attachments` to process multiple voice notes (`mode: "all"` + `maxAttachments`).
 - Transcript is available to templates as `{{Transcript}}`.
 - CLI stdout is capped (5MB); keep CLI output concise.
--- a/docs/nodes/media-understanding.md
+++ b/docs/nodes/media-understanding.md
@@ -6,7 +6,7 @@ read_when:
 ---
 # Media Understanding (Inbound) — 2026-01-17

-Clawdbot can optionally **summarize inbound media** (image/audio/video) before the reply pipeline runs. This is **opt-in** and separate from the base attachment flow—if understanding is off, models still receive the original files/URLs as usual.
+Clawdbot can **summarize inbound media** (image/audio/video) before the reply pipeline runs. It auto‑detects when local tools or provider keys are available, and can be disabled or customized. If understanding is off, models still receive the original files/URLs as usual.

 ## Goals
 - Optional: pre‑digest inbound media into short text for faster routing + better command parsing.
@@ -88,6 +88,11 @@ Each `models[]` entry can be **provider** or **CLI**:
 }
 ```

+CLI templates can also use:
+- `{{MediaDir}}` (directory containing the media file)
+- `{{OutputDir}}` (scratch dir created for this run)
+- `{{OutputBase}}` (scratch file base path, no extension)
+
 ## Defaults and limits
 Recommended defaults:
 - `maxChars`: **500** for image/video (short, command‑friendly)
@@ -104,17 +109,22 @@ Rules:
 - If `<capability>.enabled: true` but no models are configured, Clawdbot tries the
  **active reply model** when its provider supports the capability.

-### Auto-enable audio (when keys exist)
-If `tools.media.audio.enabled` is **not** set to `false` and you have any supported
-audio provider keys configured, Clawdbot will **auto-enable audio transcription**
-even when you haven’t listed models explicitly.
+### Auto-detect media understanding (default)
+If `tools.media.<capability>.enabled` is **not** set to `false` and you haven’t
+configured models, Clawdbot auto-detects in this order and **stops at the first
+working option**:

-Providers checked (in order):
-1) OpenAI
-2) Groq
-3) Deepgram
+1) **Local CLIs** (audio only; if installed)
+   - `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens)
+   - `whisper-cli` (`whisper-cpp`; uses `WHISPER_CPP_MODEL` or the bundled tiny model)
+   - `whisper` (Python CLI; downloads models automatically)
+2) **Gemini CLI** (`gemini`) using `read_many_files`
+3) **Provider keys**
+   - Audio: OpenAI → Groq → Deepgram → Google
+   - Image: OpenAI → Anthropic → Google → MiniMax
+   - Video: Google

-To disable this behavior, set:
+To disable auto-detection, set:
 ```json5
 {
  tools: {
@@ -126,6 +136,7 @@ To disable this behavior, set:
  }
 }
 ```
+Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI is on `PATH` (we expand `~`), or set an explicit CLI model with a full command path.

 ## Capabilities (optional)
 If you set `capabilities`, the entry only runs for those media types. For shared
@@ -142,7 +153,7 @@ If you omit `capabilities`, the entry is eligible for the list it appears in.
 | Capability | Provider integration | Notes |
 |------------|----------------------|-------|
 | Image | OpenAI / Anthropic / Google / others via `pi-ai` | Any image-capable model in the registry works. |
-| Audio | OpenAI, Groq, Deepgram | Provider transcription (Whisper/Deepgram). |
+| Audio | OpenAI, Groq, Deepgram, Google | Provider transcription (Whisper/Deepgram/Gemini). |
 | Video | Google (Gemini API) | Provider video understanding. |

 ## Recommended providers
@@ -151,8 +162,8 @@ If you omit `capabilities`, the entry is eligible for the list it appears in.
 - Good defaults: `openai/gpt-5.2`, `anthropic/claude-opus-4-5`, `google/gemini-3-pro-preview`.

 **Audio**
- `openai/whisper-1`, `groq/whisper-large-v3-turbo`, or `deepgram/nova-3`.
- CLI fallback: `whisper` binary.
+- `openai/gpt-4o-mini-transcribe`, `groq/whisper-large-v3-turbo`, or `deepgram/nova-3`.
+- CLI fallback: `whisper-cli` (whisper-cpp) or `whisper`.
 - Deepgram setup: [Deepgram (audio transcription)](/providers/deepgram).

 **Video**
@@ -209,7 +220,7 @@ When `mode: "all"`, outputs are labeled `[Image 1/2]`, `[Audio 2/2]`, etc.
      audio: {
        enabled: true,
        models: [
-          { provider: "openai", model: "whisper-1" },
+          { provider: "openai", model: "gpt-4o-mini-transcribe" },
          {
            type: "cli",
            command: "whisper",