feat: add inbound media understanding

Co-authored-by: Tristan Manchester <tmanchester96@gmail.com>
This commit is contained in:
Peter Steinberger
2026-01-17 03:52:37 +00:00
parent 4b749f1b8f
commit 1b973f7506
42 changed files with 2547 additions and 101 deletions

View File

@@ -124,10 +124,21 @@ Save to `~/.clawdbot/clawdbot.json` and you can DM the bot from that number.
// Tooling
tools: {
audio: {
transcription: {
args: ["--model", "base", "{{MediaPath}}"],
media: {
audio: {
enabled: true,
maxBytes: 20971520,
models: [
{ provider: "openai", model: "whisper-1" },
// Optional CLI fallback (Whisper binary):
// { type: "cli", command: "whisper", args: ["--model", "base", "{{MediaPath}}"] }
],
timeoutSeconds: 120
},
video: {
enabled: true,
maxBytes: 52428800,
models: [{ provider: "google", model: "gemini-3-flash-preview" }]
}
}
},

View File

@@ -1769,6 +1769,58 @@ Legacy: `tools.bash` is still accepted as an alias.
- `tools.web.fetch.firecrawl.maxAgeMs` (optional)
- `tools.web.fetch.firecrawl.timeoutSeconds` (optional)
`tools.media` configures inbound media understanding (image/audio/video):
- `tools.media.image` / `tools.media.audio` / `tools.media.video`:
- `enabled`: opt-out switch (default true).
- `prompt`: optional prompt override (image/video append a `maxChars` hint automatically).
- `maxChars`: max output characters (default 500 for image/video; unset for audio).
- `maxBytes`: max media size to send (defaults: image 10MB, audio 20MB, video 50MB).
- `timeoutSeconds`: request timeout (defaults: image 60s, audio 60s, video 120s).
- `language`: optional audio hint.
- `scope`: optional gating (first match wins) with `match.channel`, `match.chatType`, or `match.keyPrefix`.
- `models`: ordered list of model entries; failures or oversize media fall back to the next entry.
- Each `models[]` entry:
- Provider entry (`type: "provider"` or omitted):
- `provider`: API provider id (`openai`, `anthropic`, `google`/`gemini`, `groq`, etc).
- `model`: model id override (required for image; defaults to `whisper-1`/`whisper-large-v3-turbo` for audio providers, and `gemini-3-flash-preview` for video).
- `profile` / `preferredProfile`: auth profile selection.
- CLI entry (`type: "cli"`):
- `command`: executable to run.
- `args`: templated args (supports `{{MediaPath}}`, `{{Prompt}}`, `{{MaxChars}}`, etc).
- `capabilities`: optional list (`image`, `audio`, `video`) to gate a shared entry.
- `prompt`, `maxChars`, `maxBytes`, `timeoutSeconds`, `language` can be overridden per entry.
If no models are configured (or `enabled: false`), understanding is skipped; the model still receives the original attachments.
Provider auth follows the standard model auth order (auth profiles, env vars like `OPENAI_API_KEY`/`GROQ_API_KEY`/`GEMINI_API_KEY`, or `models.providers.*.apiKey`).
Example:
```json5
{
tools: {
media: {
audio: {
enabled: true,
maxBytes: 20971520,
scope: {
default: "deny",
rules: [{ action: "allow", match: { chatType: "direct" } }]
},
models: [
{ provider: "openai", model: "whisper-1" },
{ type: "cli", command: "whisper", args: ["--model", "base", "{{MediaPath}}"] }
]
},
video: {
enabled: true,
maxBytes: 52428800,
models: [{ provider: "google", model: "gemini-3-flash-preview" }]
}
}
}
}
```
`agents.defaults.subagents` configures sub-agent defaults:
- `model`: default model for spawned sub-agents (string or `{ primary, fallbacks }`). If omitted, sub-agents inherit the callers model unless overridden per agent or per call.
- `maxConcurrent`: max concurrent sub-agent runs (default 1)
@@ -2848,7 +2900,7 @@ clawdbot dns setup --apply
## Template variables
Template placeholders are expanded in `tools.audio.transcription.args` (and any future templated argument fields).
Template placeholders are expanded in `tools.media.*.models[].args` (and any future templated argument fields).
| Variable | Description |
|----------|-------------|
@@ -2864,6 +2916,8 @@ Template placeholders are expanded in `tools.audio.transcription.args` (and any
| `{{MediaPath}}` | Local media path (if downloaded) |
| `{{MediaType}}` | Media type (image/audio/document/…) |
| `{{Transcript}}` | Audio transcript (when enabled) |
| `{{Prompt}}` | Resolved media prompt for CLI entries |
| `{{MaxChars}}` | Resolved max output chars for CLI entries |
| `{{ChatType}}` | `"direct"` or `"group"` |
| `{{GroupSubject}}` | Group subject (best effort) |
| `{{GroupMembers}}` | Group members preview (best effort) |

View File

@@ -111,7 +111,7 @@ Current migrations:
- `routing.bindings` → top-level `bindings`
- `routing.agents`/`routing.defaultAgentId``agents.list` + `agents.list[].default`
- `routing.agentToAgent``tools.agentToAgent`
- `routing.transcribeAudio``tools.audio.transcription`
- `routing.transcribeAudio``tools.media.audio.models`
- `bindings[].match.accountID``bindings[].match.accountId`
- `identity``agents.list[].identity`
- `agent.*``agents.defaults` + `tools.*` (tools/elevated/exec/sandbox/subagents)

View File

@@ -3,25 +3,59 @@ summary: "How inbound audio/voice notes are downloaded, transcribed, and injecte
read_when:
- Changing audio transcription or media handling
---
# Audio / Voice Notes — 2025-12-05
# Audio / Voice Notes — 2026-01-17
## What works
- **Optional transcription**: If `tools.audio.transcription` is set in `~/.clawdbot/clawdbot.json`, Clawdbot will:
1) Download inbound audio to a temp path when WhatsApp only provides a URL.
2) Run the configured CLI args (templated with `{{MediaPath}}`), expecting transcript on stdout.
3) Replace `Body` with the transcript, set `{{Transcript}}`, and prepend the original media path plus a `Transcript:` section in the command prompt so models see both.
4) Continue through the normal auto-reply pipeline (templating, sessions, Pi command).
- **Verbose logging**: In `--verbose`, we log when transcription runs and when the transcript replaces the body.
- **Media understanding (audio)**: If `tools.media.audio` is enabled and has `models`, Clawdbot:
1) Locates the first audio attachment (local path or URL) and downloads it if needed.
2) Enforces `maxBytes` before sending to each model entry.
3) Runs the first eligible model entry in order (provider or CLI).
4) If it fails or skips (size/timeout), it tries the next entry.
5) On success, it replaces `Body` with an `[Audio]` block and sets `{{Transcript}}`.
- **Command parsing**: When transcription succeeds, `CommandBody`/`RawBody` are set to the transcript so slash commands still work.
- **Verbose logging**: In `--verbose`, we log when transcription runs and when it replaces the body.
## Config example (Whisper CLI)
Requires `whisper` CLI installed:
## Config examples
### Provider + CLI fallback (OpenAI + Whisper CLI)
```json5
{
tools: {
audio: {
transcription: {
args: ["--model", "base", "{{MediaPath}}"],
timeoutSeconds: 45
media: {
audio: {
enabled: true,
maxBytes: 20971520,
models: [
{ provider: "openai", model: "whisper-1" },
{
type: "cli",
command: "whisper",
args: ["--model", "base", "{{MediaPath}}"],
timeoutSeconds: 45
}
]
}
}
}
}
```
### Provider-only with scope gating
```json5
{
tools: {
media: {
audio: {
enabled: true,
scope: {
default: "allow",
rules: [
{ action: "deny", match: { chatType: "group" } }
]
},
models: [
{ provider: "openai", model: "whisper-1" }
]
}
}
}
@@ -29,12 +63,13 @@ Requires `whisper` CLI installed:
```
## Notes & limits
- We dont ship a transcriber; you opt in with the Whisper CLI on your PATH.
- Size guard: inbound audio must be ≤5MB (matches the temp media store and transcript pipeline).
- Outbound caps: web send supports audio/voice up to 16MB (sent as a voice note with `ptt: true`).
- If transcription fails, we fall back to the original body/media note; replies still go through.
- Transcript is available to templates as `{{Transcript}}`; models get both the media path and a `Transcript:` block in the prompt when using command mode.
- Provider auth follows the standard model auth order (auth profiles, env vars, `models.providers.*.apiKey`).
- Default size cap is 20MB (`tools.media.audio.maxBytes`). Oversize audio is skipped for that model and the next entry is tried.
- Default `maxChars` for audio is **unset** (full transcript). Set `tools.media.audio.maxChars` or per-entry `maxChars` to trim output.
- Transcript is available to templates as `{{Transcript}}`.
- CLI stdout is capped (5MB); keep CLI output concise.
## Gotchas
- Scope rules use first-match wins. `chatType` is normalized to `direct`, `group`, or `room`.
- Ensure your CLI exits 0 and prints plain text; JSON needs to be massaged via `jq -r .text`.
- Keep timeouts reasonable (`timeoutSeconds`, default 45s) to avoid blocking the reply queue.
- Keep timeouts reasonable (`timeoutSeconds`, default 60s) to avoid blocking the reply queue.

View File

@@ -38,13 +38,23 @@ The WhatsApp channel runs via **Baileys Web**. This document captures the curren
- `{{MediaUrl}}` pseudo-URL for the inbound media.
- `{{MediaPath}}` local temp path written before running the command.
- When a per-session Docker sandbox is enabled, inbound media is copied into the sandbox workspace and `MediaPath`/`MediaUrl` are rewritten to a relative path like `media/inbound/<filename>`.
- Audio transcription (if configured via `tools.audio.transcription`) runs before templating and can replace `Body` with the transcript.
- Media understanding (if configured via `tools.media.*`) runs before templating and can insert `[Image]`, `[Audio]`, and `[Video]` blocks into `Body`.
- Audio sets `{{Transcript}}` and uses the transcript for command parsing so slash commands still work.
- Video and image descriptions preserve any caption text for command parsing.
- Only the first matching image/audio/video attachment is processed; remaining attachments are left untouched.
## Limits & Errors
**Outbound send caps (WhatsApp web send)**
- Images: ~6MB cap after recompression.
- Audio/voice/video: 16MB cap; documents: 100MB cap.
- Oversize or unreadable media → clear error in logs and the reply is skipped.
**Media understanding caps (transcription/description)**
- Image default: 10MB (`tools.media.image.maxBytes`).
- Audio default: 20MB (`tools.media.audio.maxBytes`).
- Video default: 50MB (`tools.media.video.maxBytes`).
- Oversize media skips understanding, but replies still go through with the original body.
## Notes for Tests
- Cover send + reply flows for image/audio/document cases.
- Validate recompression for images (size bound) and voice-note flag for audio.

View File

@@ -0,0 +1,217 @@
---
summary: "Inbound image/audio/video understanding (optional) with provider + CLI fallbacks"
read_when:
- Designing or refactoring media understanding
- Tuning inbound audio/video/image preprocessing
---
# Media Understanding (Inbound) — 2026-01-17
Clawdbot can optionally **summarize inbound media** (image/audio/video) before the reply pipeline runs. This is **opt-in** and separate from the base attachment flow—if understanding is off, models still receive the original files/URLs as usual.
## Goals
- Optional: predigest inbound media into short text for faster routing + better command parsing.
- Preserve original media delivery to the model (always).
- Support **provider APIs** and **CLI fallbacks**.
- Allow multiple models with ordered fallback (error/size/timeout).
## Highlevel behavior
1) Collect inbound attachments (`MediaPaths`, `MediaUrls`, `MediaTypes`).
2) For each enabled capability (image/audio/video), pick the **first matching attachment**.
3) Choose the first eligible model entry (size + capability + auth).
4) If a model fails or the media is too large, **fall back to the next entry**.
5) On success:
- `Body` becomes `[Image]`, `[Audio]`, or `[Video]` block.
- Audio sets `{{Transcript}}` and `CommandBody`/`RawBody` for command parsing.
- Captions are preserved as `User text:` inside the block.
If understanding fails or is disabled, **the reply flow continues** with the original body + attachments.
## Config overview
Use **percapability configs** under `tools.media`. Each capability can define:
- defaults (`prompt`, `maxChars`, `maxBytes`, `timeoutSeconds`, `language`)
- **ordered `models` list** (fallback order)
- `scope` (optional gating by channel/chatType/session key)
```json5
{
tools: {
media: {
image: { /* config */ },
audio: { /* config */ },
video: { /* config */ }
}
}
}
```
### Model entries
Each `models[]` entry can be **provider** or **CLI**:
```json5
{
type: "provider", // default if omitted
provider: "openai",
model: "gpt-5.2",
prompt: "Describe the image in <= 500 chars.",
maxChars: 500,
maxBytes: 10485760,
timeoutSeconds: 60,
capabilities: ["image"], // optional, used for multimodal entries
profile: "vision-profile",
preferredProfile: "vision-fallback"
}
```
```json5
{
type: "cli",
command: "gemini",
args: [
"-m",
"gemini-3-flash",
"--allowed-tools",
"read_file",
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
],
maxChars: 500,
maxBytes: 52428800,
timeoutSeconds: 120,
capabilities: ["video", "image"]
}
```
## Defaults and limits
Recommended defaults:
- `maxChars`: **500** for image/video (short, commandfriendly)
- `maxChars`: **unset** for audio (full transcript unless you set a limit)
- `maxBytes`:
- image: **10MB**
- audio: **20MB**
- video: **50MB**
Rules:
- If media exceeds `maxBytes`, that model is skipped and the **next model is tried**.
- If the model returns more than `maxChars`, output is trimmed.
- `prompt` defaults to simple “Describe the {media}.” plus the `maxChars` guidance (image/video only).
## Capabilities (optional)
If you set `capabilities`, the entry only runs for those media types. Suggested
defaults when you opt in:
- `openai`, `anthropic`: **image**
- `google` (Gemini API): **image + audio + video**
- CLI entries: declare the exact capabilities you support.
If you omit `capabilities`, the entry is eligible for the list it appears in.
## Provider support matrix (Clawdbot integrations)
| Capability | Provider integration | Notes |
|------------|----------------------|-------|
| Image | OpenAI / Anthropic / Google / others via `pi-ai` | Any image-capable model in the registry works. |
| Audio | OpenAI, Groq | Provider transcription (Whisper). |
| Video | Google (Gemini API) | Provider video understanding. |
## Recommended providers
**Image**
- Prefer your active model if it supports images.
- Good defaults: `openai/gpt-5.2`, `anthropic/claude-opus-4-5`, `google/gemini-3-pro-preview`.
**Audio**
- `openai/whisper-1` or `groq/whisper-large-v3-turbo`.
- CLI fallback: `whisper` binary.
**Video**
- `google/gemini-3-flash-preview` (fast), `google/gemini-3-pro-preview` (richer).
- CLI fallback: `gemini` CLI (supports `read_file` on video/audio).
## Config examples
### 1) Audio + Video only (image off)
```json5
{
tools: {
media: {
audio: {
enabled: true,
models: [
{ provider: "openai", model: "whisper-1" },
{
type: "cli",
command: "whisper",
args: ["--model", "base", "{{MediaPath}}"]
}
]
},
video: {
enabled: true,
maxChars: 500,
models: [
{ provider: "google", model: "gemini-3-flash-preview" },
{
type: "cli",
command: "gemini",
args: [
"-m",
"gemini-3-flash",
"--allowed-tools",
"read_file",
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
]
}
]
}
}
}
}
```
### 2) Optional image understanding
```json5
{
tools: {
media: {
image: {
enabled: true,
maxBytes: 10485760,
maxChars: 500,
models: [
{ provider: "openai", model: "gpt-5.2" },
{ provider: "anthropic", model: "claude-opus-4-5" },
{
type: "cli",
command: "gemini",
args: [
"-m",
"gemini-3-flash",
"--allowed-tools",
"read_file",
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
]
}
]
}
}
}
}
```
### 3) Multimodal single entry (explicit capabilities)
```json5
{
tools: {
media: {
image: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] },
audio: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] },
video: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] }
}
}
}
```
## Notes
- Understanding is **besteffort**. Errors do not block replies.
- Attachments are still passed to models even when understanding is disabled.
- Use `scope` to limit where understanding runs (e.g. only DMs).
## Related docs
- [Configuration](/gateway/configuration)
- [Image & Media Support](/nodes/images)