docs: update media auto-detect

2026-01-23 05:47:16 +00:00
parent 93bef830ce
commit bd7443b39b
4 changed files with 47 additions and 20 deletions
--- a/docs/nodes/audio.md
+++ b/docs/nodes/audio.md
@@ -6,7 +6,7 @@ read_when:
 # Audio / Voice Notes — 2026-01-17

 ## What works
- **Media understanding (audio)**: If `tools.media.audio` is enabled (or a shared `tools.media.models` entry supports audio), Clawdbot:
+- **Media understanding (audio)**: If audio understanding is enabled (or auto‑detected), Clawdbot:
  1) Locates the first audio attachment (local path or URL) and downloads it if needed.
  2) Enforces `maxBytes` before sending to each model entry.
  3) Runs the first eligible model entry in order (provider or CLI).
@@ -15,6 +15,21 @@ read_when:
 - **Command parsing**: When transcription succeeds, `CommandBody`/`RawBody` are set to the transcript so slash commands still work.
 - **Verbose logging**: In `--verbose`, we log when transcription runs and when it replaces the body.

+## Auto-detection (default)
+If you **don’t configure models** and `tools.media.audio.enabled` is **not** set to `false`,
+Clawdbot auto-detects in this order and stops at the first working option:
+
+1) **Local CLIs** (if installed)
+   - `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens)
+   - `whisper-cli` (from `whisper-cpp`; uses `WHISPER_CPP_MODEL` or the bundled tiny model)
+   - `whisper` (Python CLI; downloads models automatically)
+2) **Gemini CLI** (`gemini`) using `read_many_files`
+3) **Provider keys** (OpenAI → Groq → Deepgram → Google)
+
+To disable auto-detection, set `tools.media.audio.enabled: false`.
+To customize, set `tools.media.audio.models`.
+Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI is on `PATH` (we expand `~`), or set an explicit CLI model with a full command path.
+
 ## Config examples

 ### Provider + CLI fallback (OpenAI + Whisper CLI)
@@ -26,7 +41,7 @@ read_when:
        enabled: true,
        maxBytes: 20971520,
        models: [
-          { provider: "openai", model: "whisper-1" },
+          { provider: "openai", model: "gpt-4o-mini-transcribe" },
          {
            type: "cli",
            command: "whisper",
@@ -54,7 +69,7 @@ read_when:
          ]
        },
        models: [
-          { provider: "openai", model: "whisper-1" }
+          { provider: "openai", model: "gpt-4o-mini-transcribe" }
        ]
      }
    }
@@ -83,6 +98,7 @@ read_when:
 - Audio providers can override `baseUrl`, `headers`, and `providerOptions` via `tools.media.audio`.
 - Default size cap is 20MB (`tools.media.audio.maxBytes`). Oversize audio is skipped for that model and the next entry is tried.
 - Default `maxChars` for audio is **unset** (full transcript). Set `tools.media.audio.maxChars` or per-entry `maxChars` to trim output.
+- OpenAI auto default is `gpt-4o-mini-transcribe`; set `model: "gpt-4o-transcribe"` for higher accuracy.
 - Use `tools.media.audio.attachments` to process multiple voice notes (`mode: "all"` + `maxAttachments`).
 - Transcript is available to templates as `{{Transcript}}`.
 - CLI stdout is capped (5MB); keep CLI output concise.