docs: update media auto-detect
This commit is contained in:
@@ -129,7 +129,7 @@ Save to `~/.clawdbot/clawdbot.json` and you can DM the bot from that number.
|
|||||||
enabled: true,
|
enabled: true,
|
||||||
maxBytes: 20971520,
|
maxBytes: 20971520,
|
||||||
models: [
|
models: [
|
||||||
{ provider: "openai", model: "whisper-1" },
|
{ provider: "openai", model: "gpt-4o-mini-transcribe" },
|
||||||
// Optional CLI fallback (Whisper binary):
|
// Optional CLI fallback (Whisper binary):
|
||||||
// { type: "cli", command: "whisper", args: ["--model", "base", "{{MediaPath}}"] }
|
// { type: "cli", command: "whisper", args: ["--model", "base", "{{MediaPath}}"] }
|
||||||
],
|
],
|
||||||
|
|||||||
@@ -1865,7 +1865,7 @@ Note: `applyPatch` is only under `tools.exec`.
|
|||||||
- Each `models[]` entry:
|
- Each `models[]` entry:
|
||||||
- Provider entry (`type: "provider"` or omitted):
|
- Provider entry (`type: "provider"` or omitted):
|
||||||
- `provider`: API provider id (`openai`, `anthropic`, `google`/`gemini`, `groq`, etc).
|
- `provider`: API provider id (`openai`, `anthropic`, `google`/`gemini`, `groq`, etc).
|
||||||
- `model`: model id override (required for image; defaults to `whisper-1`/`whisper-large-v3-turbo` for audio providers, and `gemini-3-flash-preview` for video).
|
- `model`: model id override (required for image; defaults to `gpt-4o-mini-transcribe`/`whisper-large-v3-turbo` for audio providers, and `gemini-3-flash-preview` for video).
|
||||||
- `profile` / `preferredProfile`: auth profile selection.
|
- `profile` / `preferredProfile`: auth profile selection.
|
||||||
- CLI entry (`type: "cli"`):
|
- CLI entry (`type: "cli"`):
|
||||||
- `command`: executable to run.
|
- `command`: executable to run.
|
||||||
@@ -1890,7 +1890,7 @@ Example:
|
|||||||
rules: [{ action: "allow", match: { chatType: "direct" } }]
|
rules: [{ action: "allow", match: { chatType: "direct" } }]
|
||||||
},
|
},
|
||||||
models: [
|
models: [
|
||||||
{ provider: "openai", model: "whisper-1" },
|
{ provider: "openai", model: "gpt-4o-mini-transcribe" },
|
||||||
{ type: "cli", command: "whisper", args: ["--model", "base", "{{MediaPath}}"] }
|
{ type: "cli", command: "whisper", args: ["--model", "base", "{{MediaPath}}"] }
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
|
|||||||
@@ -6,7 +6,7 @@ read_when:
|
|||||||
# Audio / Voice Notes — 2026-01-17
|
# Audio / Voice Notes — 2026-01-17
|
||||||
|
|
||||||
## What works
|
## What works
|
||||||
- **Media understanding (audio)**: If `tools.media.audio` is enabled (or a shared `tools.media.models` entry supports audio), Clawdbot:
|
- **Media understanding (audio)**: If audio understanding is enabled (or auto‑detected), Clawdbot:
|
||||||
1) Locates the first audio attachment (local path or URL) and downloads it if needed.
|
1) Locates the first audio attachment (local path or URL) and downloads it if needed.
|
||||||
2) Enforces `maxBytes` before sending to each model entry.
|
2) Enforces `maxBytes` before sending to each model entry.
|
||||||
3) Runs the first eligible model entry in order (provider or CLI).
|
3) Runs the first eligible model entry in order (provider or CLI).
|
||||||
@@ -15,6 +15,21 @@ read_when:
|
|||||||
- **Command parsing**: When transcription succeeds, `CommandBody`/`RawBody` are set to the transcript so slash commands still work.
|
- **Command parsing**: When transcription succeeds, `CommandBody`/`RawBody` are set to the transcript so slash commands still work.
|
||||||
- **Verbose logging**: In `--verbose`, we log when transcription runs and when it replaces the body.
|
- **Verbose logging**: In `--verbose`, we log when transcription runs and when it replaces the body.
|
||||||
|
|
||||||
|
## Auto-detection (default)
|
||||||
|
If you **don’t configure models** and `tools.media.audio.enabled` is **not** set to `false`,
|
||||||
|
Clawdbot auto-detects in this order and stops at the first working option:
|
||||||
|
|
||||||
|
1) **Local CLIs** (if installed)
|
||||||
|
- `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens)
|
||||||
|
- `whisper-cli` (from `whisper-cpp`; uses `WHISPER_CPP_MODEL` or the bundled tiny model)
|
||||||
|
- `whisper` (Python CLI; downloads models automatically)
|
||||||
|
2) **Gemini CLI** (`gemini`) using `read_many_files`
|
||||||
|
3) **Provider keys** (OpenAI → Groq → Deepgram → Google)
|
||||||
|
|
||||||
|
To disable auto-detection, set `tools.media.audio.enabled: false`.
|
||||||
|
To customize, set `tools.media.audio.models`.
|
||||||
|
Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI is on `PATH` (we expand `~`), or set an explicit CLI model with a full command path.
|
||||||
|
|
||||||
## Config examples
|
## Config examples
|
||||||
|
|
||||||
### Provider + CLI fallback (OpenAI + Whisper CLI)
|
### Provider + CLI fallback (OpenAI + Whisper CLI)
|
||||||
@@ -26,7 +41,7 @@ read_when:
|
|||||||
enabled: true,
|
enabled: true,
|
||||||
maxBytes: 20971520,
|
maxBytes: 20971520,
|
||||||
models: [
|
models: [
|
||||||
{ provider: "openai", model: "whisper-1" },
|
{ provider: "openai", model: "gpt-4o-mini-transcribe" },
|
||||||
{
|
{
|
||||||
type: "cli",
|
type: "cli",
|
||||||
command: "whisper",
|
command: "whisper",
|
||||||
@@ -54,7 +69,7 @@ read_when:
|
|||||||
]
|
]
|
||||||
},
|
},
|
||||||
models: [
|
models: [
|
||||||
{ provider: "openai", model: "whisper-1" }
|
{ provider: "openai", model: "gpt-4o-mini-transcribe" }
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -83,6 +98,7 @@ read_when:
|
|||||||
- Audio providers can override `baseUrl`, `headers`, and `providerOptions` via `tools.media.audio`.
|
- Audio providers can override `baseUrl`, `headers`, and `providerOptions` via `tools.media.audio`.
|
||||||
- Default size cap is 20MB (`tools.media.audio.maxBytes`). Oversize audio is skipped for that model and the next entry is tried.
|
- Default size cap is 20MB (`tools.media.audio.maxBytes`). Oversize audio is skipped for that model and the next entry is tried.
|
||||||
- Default `maxChars` for audio is **unset** (full transcript). Set `tools.media.audio.maxChars` or per-entry `maxChars` to trim output.
|
- Default `maxChars` for audio is **unset** (full transcript). Set `tools.media.audio.maxChars` or per-entry `maxChars` to trim output.
|
||||||
|
- OpenAI auto default is `gpt-4o-mini-transcribe`; set `model: "gpt-4o-transcribe"` for higher accuracy.
|
||||||
- Use `tools.media.audio.attachments` to process multiple voice notes (`mode: "all"` + `maxAttachments`).
|
- Use `tools.media.audio.attachments` to process multiple voice notes (`mode: "all"` + `maxAttachments`).
|
||||||
- Transcript is available to templates as `{{Transcript}}`.
|
- Transcript is available to templates as `{{Transcript}}`.
|
||||||
- CLI stdout is capped (5MB); keep CLI output concise.
|
- CLI stdout is capped (5MB); keep CLI output concise.
|
||||||
|
|||||||
@@ -6,7 +6,7 @@ read_when:
|
|||||||
---
|
---
|
||||||
# Media Understanding (Inbound) — 2026-01-17
|
# Media Understanding (Inbound) — 2026-01-17
|
||||||
|
|
||||||
Clawdbot can optionally **summarize inbound media** (image/audio/video) before the reply pipeline runs. This is **opt-in** and separate from the base attachment flow—if understanding is off, models still receive the original files/URLs as usual.
|
Clawdbot can **summarize inbound media** (image/audio/video) before the reply pipeline runs. It auto‑detects when local tools or provider keys are available, and can be disabled or customized. If understanding is off, models still receive the original files/URLs as usual.
|
||||||
|
|
||||||
## Goals
|
## Goals
|
||||||
- Optional: pre‑digest inbound media into short text for faster routing + better command parsing.
|
- Optional: pre‑digest inbound media into short text for faster routing + better command parsing.
|
||||||
@@ -88,6 +88,11 @@ Each `models[]` entry can be **provider** or **CLI**:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
CLI templates can also use:
|
||||||
|
- `{{MediaDir}}` (directory containing the media file)
|
||||||
|
- `{{OutputDir}}` (scratch dir created for this run)
|
||||||
|
- `{{OutputBase}}` (scratch file base path, no extension)
|
||||||
|
|
||||||
## Defaults and limits
|
## Defaults and limits
|
||||||
Recommended defaults:
|
Recommended defaults:
|
||||||
- `maxChars`: **500** for image/video (short, command‑friendly)
|
- `maxChars`: **500** for image/video (short, command‑friendly)
|
||||||
@@ -104,17 +109,22 @@ Rules:
|
|||||||
- If `<capability>.enabled: true` but no models are configured, Clawdbot tries the
|
- If `<capability>.enabled: true` but no models are configured, Clawdbot tries the
|
||||||
**active reply model** when its provider supports the capability.
|
**active reply model** when its provider supports the capability.
|
||||||
|
|
||||||
### Auto-enable audio (when keys exist)
|
### Auto-detect media understanding (default)
|
||||||
If `tools.media.audio.enabled` is **not** set to `false` and you have any supported
|
If `tools.media.<capability>.enabled` is **not** set to `false` and you haven’t
|
||||||
audio provider keys configured, Clawdbot will **auto-enable audio transcription**
|
configured models, Clawdbot auto-detects in this order and **stops at the first
|
||||||
even when you haven’t listed models explicitly.
|
working option**:
|
||||||
|
|
||||||
Providers checked (in order):
|
1) **Local CLIs** (audio only; if installed)
|
||||||
1) OpenAI
|
- `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens)
|
||||||
2) Groq
|
- `whisper-cli` (`whisper-cpp`; uses `WHISPER_CPP_MODEL` or the bundled tiny model)
|
||||||
3) Deepgram
|
- `whisper` (Python CLI; downloads models automatically)
|
||||||
|
2) **Gemini CLI** (`gemini`) using `read_many_files`
|
||||||
|
3) **Provider keys**
|
||||||
|
- Audio: OpenAI → Groq → Deepgram → Google
|
||||||
|
- Image: OpenAI → Anthropic → Google → MiniMax
|
||||||
|
- Video: Google
|
||||||
|
|
||||||
To disable this behavior, set:
|
To disable auto-detection, set:
|
||||||
```json5
|
```json5
|
||||||
{
|
{
|
||||||
tools: {
|
tools: {
|
||||||
@@ -126,6 +136,7 @@ To disable this behavior, set:
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI is on `PATH` (we expand `~`), or set an explicit CLI model with a full command path.
|
||||||
|
|
||||||
## Capabilities (optional)
|
## Capabilities (optional)
|
||||||
If you set `capabilities`, the entry only runs for those media types. For shared
|
If you set `capabilities`, the entry only runs for those media types. For shared
|
||||||
@@ -142,7 +153,7 @@ If you omit `capabilities`, the entry is eligible for the list it appears in.
|
|||||||
| Capability | Provider integration | Notes |
|
| Capability | Provider integration | Notes |
|
||||||
|------------|----------------------|-------|
|
|------------|----------------------|-------|
|
||||||
| Image | OpenAI / Anthropic / Google / others via `pi-ai` | Any image-capable model in the registry works. |
|
| Image | OpenAI / Anthropic / Google / others via `pi-ai` | Any image-capable model in the registry works. |
|
||||||
| Audio | OpenAI, Groq, Deepgram | Provider transcription (Whisper/Deepgram). |
|
| Audio | OpenAI, Groq, Deepgram, Google | Provider transcription (Whisper/Deepgram/Gemini). |
|
||||||
| Video | Google (Gemini API) | Provider video understanding. |
|
| Video | Google (Gemini API) | Provider video understanding. |
|
||||||
|
|
||||||
## Recommended providers
|
## Recommended providers
|
||||||
@@ -151,8 +162,8 @@ If you omit `capabilities`, the entry is eligible for the list it appears in.
|
|||||||
- Good defaults: `openai/gpt-5.2`, `anthropic/claude-opus-4-5`, `google/gemini-3-pro-preview`.
|
- Good defaults: `openai/gpt-5.2`, `anthropic/claude-opus-4-5`, `google/gemini-3-pro-preview`.
|
||||||
|
|
||||||
**Audio**
|
**Audio**
|
||||||
- `openai/whisper-1`, `groq/whisper-large-v3-turbo`, or `deepgram/nova-3`.
|
- `openai/gpt-4o-mini-transcribe`, `groq/whisper-large-v3-turbo`, or `deepgram/nova-3`.
|
||||||
- CLI fallback: `whisper` binary.
|
- CLI fallback: `whisper-cli` (whisper-cpp) or `whisper`.
|
||||||
- Deepgram setup: [Deepgram (audio transcription)](/providers/deepgram).
|
- Deepgram setup: [Deepgram (audio transcription)](/providers/deepgram).
|
||||||
|
|
||||||
**Video**
|
**Video**
|
||||||
@@ -209,7 +220,7 @@ When `mode: "all"`, outputs are labeled `[Image 1/2]`, `[Audio 2/2]`, etc.
|
|||||||
audio: {
|
audio: {
|
||||||
enabled: true,
|
enabled: true,
|
||||||
models: [
|
models: [
|
||||||
{ provider: "openai", model: "whisper-1" },
|
{ provider: "openai", model: "gpt-4o-mini-transcribe" },
|
||||||
{
|
{
|
||||||
type: "cli",
|
type: "cli",
|
||||||
command: "whisper",
|
command: "whisper",
|
||||||
|
|||||||
Reference in New Issue
Block a user