8.2 KiB
summary, read_when
| summary | read_when | ||
|---|---|---|---|
| Inbound image/audio/video understanding (optional) with provider + CLI fallbacks |
|
Media Understanding (Inbound) — 2026-01-17
Clawdbot can optionally summarize inbound media (image/audio/video) before the reply pipeline runs. This is opt-in and separate from the base attachment flow—if understanding is off, models still receive the original files/URLs as usual.
Goals
- Optional: pre‑digest inbound media into short text for faster routing + better command parsing.
- Preserve original media delivery to the model (always).
- Support provider APIs and CLI fallbacks.
- Allow multiple models with ordered fallback (error/size/timeout).
High‑level behavior
- Collect inbound attachments (
MediaPaths,MediaUrls,MediaTypes). - For each enabled capability (image/audio/video), select attachments per policy (default: first).
- Choose the first eligible model entry (size + capability + auth).
- If a model fails or the media is too large, fall back to the next entry.
- On success:
Bodybecomes[Image],[Audio], or[Video]block.- Audio sets
{{Transcript}}; command parsing uses caption text when present, otherwise the transcript. - Captions are preserved as
User text:inside the block.
If understanding fails or is disabled, the reply flow continues with the original body + attachments.
Config overview
tools.media supports shared models plus per‑capability overrides:
tools.media.models: shared model list (usecapabilitiesto gate).tools.media.image/tools.media.audio/tools.media.video:- defaults (
prompt,maxChars,maxBytes,timeoutSeconds,language) - optional per‑capability
modelslist (preferred before shared models) attachmentspolicy (mode,maxAttachments,prefer)scope(optional gating by channel/chatType/session key)
- defaults (
tools.media.concurrency: max concurrent capability runs (default 2).
{
tools: {
media: {
models: [ /* shared list */ ],
image: { /* optional overrides */ },
audio: { /* optional overrides */ },
video: { /* optional overrides */ }
}
}
}
Model entries
Each models[] entry can be provider or CLI:
{
type: "provider", // default if omitted
provider: "openai",
model: "gpt-5.2",
prompt: "Describe the image in <= 500 chars.",
maxChars: 500,
maxBytes: 10485760,
timeoutSeconds: 60,
capabilities: ["image"], // optional, used for multi‑modal entries
profile: "vision-profile",
preferredProfile: "vision-fallback"
}
{
type: "cli",
command: "gemini",
args: [
"-m",
"gemini-3-flash",
"--allowed-tools",
"read_file",
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
],
maxChars: 500,
maxBytes: 52428800,
timeoutSeconds: 120,
capabilities: ["video", "image"]
}
Defaults and limits
Recommended defaults:
maxChars: 500 for image/video (short, command‑friendly)maxChars: unset for audio (full transcript unless you set a limit)maxBytes:- image: 10MB
- audio: 20MB
- video: 50MB
Rules:
- If media exceeds
maxBytes, that model is skipped and the next model is tried. - If the model returns more than
maxChars, output is trimmed. promptdefaults to simple “Describe the {media}.” plus themaxCharsguidance (image/video only).- If
<capability>.enabled: truebut no models are configured, Clawdbot tries the active reply model when its provider supports the capability.
Capabilities (optional)
If you set capabilities, the entry only runs for those media types. For shared
lists, Clawdbot can infer defaults:
openai,anthropic,minimax: imagegoogle(Gemini API): image + audio + videogroq: audio
For CLI entries, set capabilities explicitly to avoid surprising matches.
If you omit capabilities, the entry is eligible for the list it appears in.
Provider support matrix (Clawdbot integrations)
| Capability | Provider integration | Notes |
|---|---|---|
| Image | OpenAI / Anthropic / Google / others via pi-ai |
Any image-capable model in the registry works. |
| Audio | OpenAI, Groq | Provider transcription (Whisper). |
| Video | Google (Gemini API) | Provider video understanding. |
Recommended providers
Image
- Prefer your active model if it supports images.
- Good defaults:
openai/gpt-5.2,anthropic/claude-opus-4-5,google/gemini-3-pro-preview.
Audio
openai/whisper-1orgroq/whisper-large-v3-turbo.- CLI fallback:
whisperbinary.
Video
google/gemini-3-flash-preview(fast),google/gemini-3-pro-preview(richer).- CLI fallback:
geminiCLI (supportsread_fileon video/audio).
Attachment policy
Per‑capability attachments controls which attachments are processed:
mode:first(default) orallmaxAttachments: cap the number processed (default 1)prefer:first,last,path,url
When mode: "all", outputs are labeled [Image 1/2], [Audio 2/2], etc.
Config examples
1) Shared models list + overrides
{
tools: {
media: {
models: [
{ provider: "openai", model: "gpt-5.2", capabilities: ["image"] },
{ provider: "google", model: "gemini-3-flash-preview", capabilities: ["image", "audio", "video"] },
{
type: "cli",
command: "gemini",
args: [
"-m",
"gemini-3-flash",
"--allowed-tools",
"read_file",
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
],
capabilities: ["image", "video"]
}
],
audio: {
attachments: { mode: "all", maxAttachments: 2 }
},
video: {
maxChars: 500
}
}
}
}
2) Audio + Video only (image off)
{
tools: {
media: {
audio: {
enabled: true,
models: [
{ provider: "openai", model: "whisper-1" },
{
type: "cli",
command: "whisper",
args: ["--model", "base", "{{MediaPath}}"]
}
]
},
video: {
enabled: true,
maxChars: 500,
models: [
{ provider: "google", model: "gemini-3-flash-preview" },
{
type: "cli",
command: "gemini",
args: [
"-m",
"gemini-3-flash",
"--allowed-tools",
"read_file",
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
]
}
]
}
}
}
}
3) Optional image understanding
{
tools: {
media: {
image: {
enabled: true,
maxBytes: 10485760,
maxChars: 500,
models: [
{ provider: "openai", model: "gpt-5.2" },
{ provider: "anthropic", model: "claude-opus-4-5" },
{
type: "cli",
command: "gemini",
args: [
"-m",
"gemini-3-flash",
"--allowed-tools",
"read_file",
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
]
}
]
}
}
}
}
4) Multi‑modal single entry (explicit capabilities)
{
tools: {
media: {
image: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] },
audio: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] },
video: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] }
}
}
}
Notes
- Understanding is best‑effort. Errors do not block replies.
- Attachments are still passed to models even when understanding is disabled.
- Use
scopeto limit where understanding runs (e.g. only DMs).