feat: add inbound media understanding

Co-authored-by: Tristan Manchester <tmanchester96@gmail.com>
This commit is contained in:
Peter Steinberger
2026-01-17 03:52:37 +00:00
parent 4b749f1b8f
commit 1b973f7506
42 changed files with 2547 additions and 101 deletions

View File

@@ -124,10 +124,21 @@ Save to `~/.clawdbot/clawdbot.json` and you can DM the bot from that number.
// Tooling
tools: {
audio: {
transcription: {
args: ["--model", "base", "{{MediaPath}}"],
media: {
audio: {
enabled: true,
maxBytes: 20971520,
models: [
{ provider: "openai", model: "whisper-1" },
// Optional CLI fallback (Whisper binary):
// { type: "cli", command: "whisper", args: ["--model", "base", "{{MediaPath}}"] }
],
timeoutSeconds: 120
},
video: {
enabled: true,
maxBytes: 52428800,
models: [{ provider: "google", model: "gemini-3-flash-preview" }]
}
}
},

View File

@@ -1769,6 +1769,58 @@ Legacy: `tools.bash` is still accepted as an alias.
- `tools.web.fetch.firecrawl.maxAgeMs` (optional)
- `tools.web.fetch.firecrawl.timeoutSeconds` (optional)
`tools.media` configures inbound media understanding (image/audio/video):
- `tools.media.image` / `tools.media.audio` / `tools.media.video`:
- `enabled`: opt-out switch (default true).
- `prompt`: optional prompt override (image/video append a `maxChars` hint automatically).
- `maxChars`: max output characters (default 500 for image/video; unset for audio).
- `maxBytes`: max media size to send (defaults: image 10MB, audio 20MB, video 50MB).
- `timeoutSeconds`: request timeout (defaults: image 60s, audio 60s, video 120s).
- `language`: optional audio hint.
- `scope`: optional gating (first match wins) with `match.channel`, `match.chatType`, or `match.keyPrefix`.
- `models`: ordered list of model entries; failures or oversize media fall back to the next entry.
- Each `models[]` entry:
- Provider entry (`type: "provider"` or omitted):
- `provider`: API provider id (`openai`, `anthropic`, `google`/`gemini`, `groq`, etc).
- `model`: model id override (required for image; defaults to `whisper-1`/`whisper-large-v3-turbo` for audio providers, and `gemini-3-flash-preview` for video).
- `profile` / `preferredProfile`: auth profile selection.
- CLI entry (`type: "cli"`):
- `command`: executable to run.
- `args`: templated args (supports `{{MediaPath}}`, `{{Prompt}}`, `{{MaxChars}}`, etc).
- `capabilities`: optional list (`image`, `audio`, `video`) to gate a shared entry.
- `prompt`, `maxChars`, `maxBytes`, `timeoutSeconds`, `language` can be overridden per entry.
If no models are configured (or `enabled: false`), understanding is skipped; the model still receives the original attachments.
Provider auth follows the standard model auth order (auth profiles, env vars like `OPENAI_API_KEY`/`GROQ_API_KEY`/`GEMINI_API_KEY`, or `models.providers.*.apiKey`).
Example:
```json5
{
tools: {
media: {
audio: {
enabled: true,
maxBytes: 20971520,
scope: {
default: "deny",
rules: [{ action: "allow", match: { chatType: "direct" } }]
},
models: [
{ provider: "openai", model: "whisper-1" },
{ type: "cli", command: "whisper", args: ["--model", "base", "{{MediaPath}}"] }
]
},
video: {
enabled: true,
maxBytes: 52428800,
models: [{ provider: "google", model: "gemini-3-flash-preview" }]
}
}
}
}
```
`agents.defaults.subagents` configures sub-agent defaults:
- `model`: default model for spawned sub-agents (string or `{ primary, fallbacks }`). If omitted, sub-agents inherit the callers model unless overridden per agent or per call.
- `maxConcurrent`: max concurrent sub-agent runs (default 1)
@@ -2848,7 +2900,7 @@ clawdbot dns setup --apply
## Template variables
Template placeholders are expanded in `tools.audio.transcription.args` (and any future templated argument fields).
Template placeholders are expanded in `tools.media.*.models[].args` (and any future templated argument fields).
| Variable | Description |
|----------|-------------|
@@ -2864,6 +2916,8 @@ Template placeholders are expanded in `tools.audio.transcription.args` (and any
| `{{MediaPath}}` | Local media path (if downloaded) |
| `{{MediaType}}` | Media type (image/audio/document/…) |
| `{{Transcript}}` | Audio transcript (when enabled) |
| `{{Prompt}}` | Resolved media prompt for CLI entries |
| `{{MaxChars}}` | Resolved max output chars for CLI entries |
| `{{ChatType}}` | `"direct"` or `"group"` |
| `{{GroupSubject}}` | Group subject (best effort) |
| `{{GroupMembers}}` | Group members preview (best effort) |

View File

@@ -111,7 +111,7 @@ Current migrations:
- `routing.bindings` → top-level `bindings`
- `routing.agents`/`routing.defaultAgentId``agents.list` + `agents.list[].default`
- `routing.agentToAgent``tools.agentToAgent`
- `routing.transcribeAudio``tools.audio.transcription`
- `routing.transcribeAudio``tools.media.audio.models`
- `bindings[].match.accountID``bindings[].match.accountId`
- `identity``agents.list[].identity`
- `agent.*``agents.defaults` + `tools.*` (tools/elevated/exec/sandbox/subagents)