303 lines
9.1 KiB
Markdown
303 lines
9.1 KiB
Markdown
---
|
||
summary: "Inbound image/audio/video understanding (optional) with provider + CLI fallbacks"
|
||
read_when:
|
||
- Designing or refactoring media understanding
|
||
- Tuning inbound audio/video/image preprocessing
|
||
---
|
||
# Media Understanding (Inbound) — 2026-01-17
|
||
|
||
Clawdbot can optionally **summarize inbound media** (image/audio/video) before the reply pipeline runs. This is **opt-in** and separate from the base attachment flow—if understanding is off, models still receive the original files/URLs as usual.
|
||
|
||
## Goals
|
||
- Optional: pre‑digest inbound media into short text for faster routing + better command parsing.
|
||
- Preserve original media delivery to the model (always).
|
||
- Support **provider APIs** and **CLI fallbacks**.
|
||
- Allow multiple models with ordered fallback (error/size/timeout).
|
||
|
||
## High‑level behavior
|
||
1) Collect inbound attachments (`MediaPaths`, `MediaUrls`, `MediaTypes`).
|
||
2) For each enabled capability (image/audio/video), select attachments per policy (default: **first**).
|
||
3) Choose the first eligible model entry (size + capability + auth).
|
||
4) If a model fails or the media is too large, **fall back to the next entry**.
|
||
5) On success:
|
||
- `Body` becomes `[Image]`, `[Audio]`, or `[Video]` block.
|
||
- Audio sets `{{Transcript}}`; command parsing uses caption text when present,
|
||
otherwise the transcript.
|
||
- Captions are preserved as `User text:` inside the block.
|
||
|
||
If understanding fails or is disabled, **the reply flow continues** with the original body + attachments.
|
||
|
||
## Config overview
|
||
`tools.media` supports **shared models** plus per‑capability overrides:
|
||
- `tools.media.models`: shared model list (use `capabilities` to gate).
|
||
- `tools.media.image` / `tools.media.audio` / `tools.media.video`:
|
||
- defaults (`prompt`, `maxChars`, `maxBytes`, `timeoutSeconds`, `language`)
|
||
- provider overrides (`baseUrl`, `headers`, `providerOptions`)
|
||
- Deepgram audio options via `tools.media.audio.providerOptions.deepgram`
|
||
- optional **per‑capability `models` list** (preferred before shared models)
|
||
- `attachments` policy (`mode`, `maxAttachments`, `prefer`)
|
||
- `scope` (optional gating by channel/chatType/session key)
|
||
- `tools.media.concurrency`: max concurrent capability runs (default **2**).
|
||
|
||
```json5
|
||
{
|
||
tools: {
|
||
media: {
|
||
models: [ /* shared list */ ],
|
||
image: { /* optional overrides */ },
|
||
audio: { /* optional overrides */ },
|
||
video: { /* optional overrides */ }
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### Model entries
|
||
Each `models[]` entry can be **provider** or **CLI**:
|
||
|
||
```json5
|
||
{
|
||
type: "provider", // default if omitted
|
||
provider: "openai",
|
||
model: "gpt-5.2",
|
||
prompt: "Describe the image in <= 500 chars.",
|
||
maxChars: 500,
|
||
maxBytes: 10485760,
|
||
timeoutSeconds: 60,
|
||
capabilities: ["image"], // optional, used for multi‑modal entries
|
||
profile: "vision-profile",
|
||
preferredProfile: "vision-fallback"
|
||
}
|
||
```
|
||
|
||
```json5
|
||
{
|
||
type: "cli",
|
||
command: "gemini",
|
||
args: [
|
||
"-m",
|
||
"gemini-3-flash",
|
||
"--allowed-tools",
|
||
"read_file",
|
||
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
|
||
],
|
||
maxChars: 500,
|
||
maxBytes: 52428800,
|
||
timeoutSeconds: 120,
|
||
capabilities: ["video", "image"]
|
||
}
|
||
```
|
||
|
||
## Defaults and limits
|
||
Recommended defaults:
|
||
- `maxChars`: **500** for image/video (short, command‑friendly)
|
||
- `maxChars`: **unset** for audio (full transcript unless you set a limit)
|
||
- `maxBytes`:
|
||
- image: **10MB**
|
||
- audio: **20MB**
|
||
- video: **50MB**
|
||
|
||
Rules:
|
||
- If media exceeds `maxBytes`, that model is skipped and the **next model is tried**.
|
||
- If the model returns more than `maxChars`, output is trimmed.
|
||
- `prompt` defaults to simple “Describe the {media}.” plus the `maxChars` guidance (image/video only).
|
||
- If `<capability>.enabled: true` but no models are configured, Clawdbot tries the
|
||
**active reply model** when its provider supports the capability.
|
||
|
||
### Auto-enable audio (when keys exist)
|
||
If `tools.media.audio.enabled` is **not** set to `false` and you have any supported
|
||
audio provider keys configured, Clawdbot will **auto-enable audio transcription**
|
||
even when you haven’t listed models explicitly.
|
||
|
||
Providers checked (in order):
|
||
1) OpenAI
|
||
2) Groq
|
||
3) Deepgram
|
||
|
||
To disable this behavior, set:
|
||
```json5
|
||
{
|
||
tools: {
|
||
media: {
|
||
audio: {
|
||
enabled: false
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
## Capabilities (optional)
|
||
If you set `capabilities`, the entry only runs for those media types. For shared
|
||
lists, Clawdbot can infer defaults:
|
||
- `openai`, `anthropic`, `minimax`: **image**
|
||
- `google` (Gemini API): **image + audio + video**
|
||
- `groq`: **audio**
|
||
- `deepgram`: **audio**
|
||
|
||
For CLI entries, **set `capabilities` explicitly** to avoid surprising matches.
|
||
If you omit `capabilities`, the entry is eligible for the list it appears in.
|
||
|
||
## Provider support matrix (Clawdbot integrations)
|
||
| Capability | Provider integration | Notes |
|
||
|------------|----------------------|-------|
|
||
| Image | OpenAI / Anthropic / Google / others via `pi-ai` | Any image-capable model in the registry works. |
|
||
| Audio | OpenAI, Groq, Deepgram | Provider transcription (Whisper/Deepgram). |
|
||
| Video | Google (Gemini API) | Provider video understanding. |
|
||
|
||
## Recommended providers
|
||
**Image**
|
||
- Prefer your active model if it supports images.
|
||
- Good defaults: `openai/gpt-5.2`, `anthropic/claude-opus-4-5`, `google/gemini-3-pro-preview`.
|
||
|
||
**Audio**
|
||
- `openai/whisper-1`, `groq/whisper-large-v3-turbo`, or `deepgram/nova-3`.
|
||
- CLI fallback: `whisper` binary.
|
||
- Deepgram setup: [Deepgram (audio transcription)](/providers/deepgram).
|
||
|
||
**Video**
|
||
- `google/gemini-3-flash-preview` (fast), `google/gemini-3-pro-preview` (richer).
|
||
- CLI fallback: `gemini` CLI (supports `read_file` on video/audio).
|
||
|
||
## Attachment policy
|
||
Per‑capability `attachments` controls which attachments are processed:
|
||
- `mode`: `first` (default) or `all`
|
||
- `maxAttachments`: cap the number processed (default **1**)
|
||
- `prefer`: `first`, `last`, `path`, `url`
|
||
|
||
When `mode: "all"`, outputs are labeled `[Image 1/2]`, `[Audio 2/2]`, etc.
|
||
|
||
## Config examples
|
||
|
||
### 1) Shared models list + overrides
|
||
```json5
|
||
{
|
||
tools: {
|
||
media: {
|
||
models: [
|
||
{ provider: "openai", model: "gpt-5.2", capabilities: ["image"] },
|
||
{ provider: "google", model: "gemini-3-flash-preview", capabilities: ["image", "audio", "video"] },
|
||
{
|
||
type: "cli",
|
||
command: "gemini",
|
||
args: [
|
||
"-m",
|
||
"gemini-3-flash",
|
||
"--allowed-tools",
|
||
"read_file",
|
||
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
|
||
],
|
||
capabilities: ["image", "video"]
|
||
}
|
||
],
|
||
audio: {
|
||
attachments: { mode: "all", maxAttachments: 2 }
|
||
},
|
||
video: {
|
||
maxChars: 500
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### 2) Audio + Video only (image off)
|
||
```json5
|
||
{
|
||
tools: {
|
||
media: {
|
||
audio: {
|
||
enabled: true,
|
||
models: [
|
||
{ provider: "openai", model: "whisper-1" },
|
||
{
|
||
type: "cli",
|
||
command: "whisper",
|
||
args: ["--model", "base", "{{MediaPath}}"]
|
||
}
|
||
]
|
||
},
|
||
video: {
|
||
enabled: true,
|
||
maxChars: 500,
|
||
models: [
|
||
{ provider: "google", model: "gemini-3-flash-preview" },
|
||
{
|
||
type: "cli",
|
||
command: "gemini",
|
||
args: [
|
||
"-m",
|
||
"gemini-3-flash",
|
||
"--allowed-tools",
|
||
"read_file",
|
||
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
|
||
]
|
||
}
|
||
]
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### 3) Optional image understanding
|
||
```json5
|
||
{
|
||
tools: {
|
||
media: {
|
||
image: {
|
||
enabled: true,
|
||
maxBytes: 10485760,
|
||
maxChars: 500,
|
||
models: [
|
||
{ provider: "openai", model: "gpt-5.2" },
|
||
{ provider: "anthropic", model: "claude-opus-4-5" },
|
||
{
|
||
type: "cli",
|
||
command: "gemini",
|
||
args: [
|
||
"-m",
|
||
"gemini-3-flash",
|
||
"--allowed-tools",
|
||
"read_file",
|
||
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
|
||
]
|
||
}
|
||
]
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### 4) Multi‑modal single entry (explicit capabilities)
|
||
```json5
|
||
{
|
||
tools: {
|
||
media: {
|
||
image: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] },
|
||
audio: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] },
|
||
video: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] }
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
## Status output
|
||
When media understanding runs, `/status` includes a short summary line:
|
||
|
||
```
|
||
📎 Media: image ok (openai/gpt-5.2) · audio skipped (maxBytes)
|
||
```
|
||
|
||
This shows per‑capability outcomes and the chosen provider/model when applicable.
|
||
|
||
## Notes
|
||
- Understanding is **best‑effort**. Errors do not block replies.
|
||
- Attachments are still passed to models even when understanding is disabled.
|
||
- Use `scope` to limit where understanding runs (e.g. only DMs).
|
||
|
||
## Related docs
|
||
- [Configuration](/gateway/configuration)
|
||
- [Image & Media Support](/nodes/images)
|