9.1 KiB
summary, read_when
| summary | read_when | ||
|---|---|---|---|
| Inbound image/audio/video understanding (optional) with provider + CLI fallbacks |
|
Media Understanding (Inbound) — 2026-01-17
Clawdbot can optionally summarize inbound media (image/audio/video) before the reply pipeline runs. This is opt-in and separate from the base attachment flow—if understanding is off, models still receive the original files/URLs as usual.
Goals
- Optional: pre‑digest inbound media into short text for faster routing + better command parsing.
- Preserve original media delivery to the model (always).
- Support provider APIs and CLI fallbacks.
- Allow multiple models with ordered fallback (error/size/timeout).
High‑level behavior
- Collect inbound attachments (
MediaPaths,MediaUrls,MediaTypes). - For each enabled capability (image/audio/video), select attachments per policy (default: first).
- Choose the first eligible model entry (size + capability + auth).
- If a model fails or the media is too large, fall back to the next entry.
- On success:
Bodybecomes[Image],[Audio], or[Video]block.- Audio sets
{{Transcript}}; command parsing uses caption text when present, otherwise the transcript. - Captions are preserved as
User text:inside the block.
If understanding fails or is disabled, the reply flow continues with the original body + attachments.
Config overview
tools.media supports shared models plus per‑capability overrides:
tools.media.models: shared model list (usecapabilitiesto gate).tools.media.image/tools.media.audio/tools.media.video:- defaults (
prompt,maxChars,maxBytes,timeoutSeconds,language) - provider overrides (
baseUrl,headers,providerOptions) - Deepgram audio options via
tools.media.audio.providerOptions.deepgram - optional per‑capability
modelslist (preferred before shared models) attachmentspolicy (mode,maxAttachments,prefer)scope(optional gating by channel/chatType/session key)
- defaults (
tools.media.concurrency: max concurrent capability runs (default 2).
{
tools: {
media: {
models: [ /* shared list */ ],
image: { /* optional overrides */ },
audio: { /* optional overrides */ },
video: { /* optional overrides */ }
}
}
}
Model entries
Each models[] entry can be provider or CLI:
{
type: "provider", // default if omitted
provider: "openai",
model: "gpt-5.2",
prompt: "Describe the image in <= 500 chars.",
maxChars: 500,
maxBytes: 10485760,
timeoutSeconds: 60,
capabilities: ["image"], // optional, used for multi‑modal entries
profile: "vision-profile",
preferredProfile: "vision-fallback"
}
{
type: "cli",
command: "gemini",
args: [
"-m",
"gemini-3-flash",
"--allowed-tools",
"read_file",
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
],
maxChars: 500,
maxBytes: 52428800,
timeoutSeconds: 120,
capabilities: ["video", "image"]
}
Defaults and limits
Recommended defaults:
maxChars: 500 for image/video (short, command‑friendly)maxChars: unset for audio (full transcript unless you set a limit)maxBytes:- image: 10MB
- audio: 20MB
- video: 50MB
Rules:
- If media exceeds
maxBytes, that model is skipped and the next model is tried. - If the model returns more than
maxChars, output is trimmed. promptdefaults to simple “Describe the {media}.” plus themaxCharsguidance (image/video only).- If
<capability>.enabled: truebut no models are configured, Clawdbot tries the active reply model when its provider supports the capability.
Auto-enable audio (when keys exist)
If tools.media.audio.enabled is not set to false and you have any supported
audio provider keys configured, Clawdbot will auto-enable audio transcription
even when you haven’t listed models explicitly.
Providers checked (in order):
- OpenAI
- Groq
- Deepgram
To disable this behavior, set:
{
tools: {
media: {
audio: {
enabled: false
}
}
}
}
Capabilities (optional)
If you set capabilities, the entry only runs for those media types. For shared
lists, Clawdbot can infer defaults:
openai,anthropic,minimax: imagegoogle(Gemini API): image + audio + videogroq: audiodeepgram: audio
For CLI entries, set capabilities explicitly to avoid surprising matches.
If you omit capabilities, the entry is eligible for the list it appears in.
Provider support matrix (Clawdbot integrations)
| Capability | Provider integration | Notes |
|---|---|---|
| Image | OpenAI / Anthropic / Google / others via pi-ai |
Any image-capable model in the registry works. |
| Audio | OpenAI, Groq, Deepgram | Provider transcription (Whisper/Deepgram). |
| Video | Google (Gemini API) | Provider video understanding. |
Recommended providers
Image
- Prefer your active model if it supports images.
- Good defaults:
openai/gpt-5.2,anthropic/claude-opus-4-5,google/gemini-3-pro-preview.
Audio
openai/whisper-1,groq/whisper-large-v3-turbo, ordeepgram/nova-3.- CLI fallback:
whisperbinary. - Deepgram setup: Deepgram (audio transcription).
Video
google/gemini-3-flash-preview(fast),google/gemini-3-pro-preview(richer).- CLI fallback:
geminiCLI (supportsread_fileon video/audio).
Attachment policy
Per‑capability attachments controls which attachments are processed:
mode:first(default) orallmaxAttachments: cap the number processed (default 1)prefer:first,last,path,url
When mode: "all", outputs are labeled [Image 1/2], [Audio 2/2], etc.
Config examples
1) Shared models list + overrides
{
tools: {
media: {
models: [
{ provider: "openai", model: "gpt-5.2", capabilities: ["image"] },
{ provider: "google", model: "gemini-3-flash-preview", capabilities: ["image", "audio", "video"] },
{
type: "cli",
command: "gemini",
args: [
"-m",
"gemini-3-flash",
"--allowed-tools",
"read_file",
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
],
capabilities: ["image", "video"]
}
],
audio: {
attachments: { mode: "all", maxAttachments: 2 }
},
video: {
maxChars: 500
}
}
}
}
2) Audio + Video only (image off)
{
tools: {
media: {
audio: {
enabled: true,
models: [
{ provider: "openai", model: "whisper-1" },
{
type: "cli",
command: "whisper",
args: ["--model", "base", "{{MediaPath}}"]
}
]
},
video: {
enabled: true,
maxChars: 500,
models: [
{ provider: "google", model: "gemini-3-flash-preview" },
{
type: "cli",
command: "gemini",
args: [
"-m",
"gemini-3-flash",
"--allowed-tools",
"read_file",
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
]
}
]
}
}
}
}
3) Optional image understanding
{
tools: {
media: {
image: {
enabled: true,
maxBytes: 10485760,
maxChars: 500,
models: [
{ provider: "openai", model: "gpt-5.2" },
{ provider: "anthropic", model: "claude-opus-4-5" },
{
type: "cli",
command: "gemini",
args: [
"-m",
"gemini-3-flash",
"--allowed-tools",
"read_file",
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
]
}
]
}
}
}
}
4) Multi‑modal single entry (explicit capabilities)
{
tools: {
media: {
image: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] },
audio: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] },
video: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] }
}
}
}
Status output
When media understanding runs, /status includes a short summary line:
📎 Media: image ok (openai/gpt-5.2) · audio skipped (maxBytes)
This shows per‑capability outcomes and the chosen provider/model when applicable.
Notes
- Understanding is best‑effort. Errors do not block replies.
- Attachments are still passed to models even when understanding is disabled.
- Use
scopeto limit where understanding runs (e.g. only DMs).