refactor: unify media understanding pipeline

2026-01-17 04:38:20 +00:00
parent 49ecbd8fea
commit fcb7c9ff65
24 changed files with 1250 additions and 643 deletions
--- a/docs/nodes/audio.md
+++ b/docs/nodes/audio.md
@@ -6,7 +6,7 @@ read_when:
 # Audio / Voice Notes — 2026-01-17

 ## What works
- **Media understanding (audio)**: If `tools.media.audio` is enabled and has `models`, Clawdbot:
+- **Media understanding (audio)**: If `tools.media.audio` is enabled (or a shared `tools.media.models` entry supports audio), Clawdbot:
  1) Locates the first audio attachment (local path or URL) and downloads it if needed.
  2) Enforces `maxBytes` before sending to each model entry.
  3) Runs the first eligible model entry in order (provider or CLI).
@@ -66,6 +66,7 @@ read_when:
 - Provider auth follows the standard model auth order (auth profiles, env vars, `models.providers.*.apiKey`).
 - Default size cap is 20MB (`tools.media.audio.maxBytes`). Oversize audio is skipped for that model and the next entry is tried.
 - Default `maxChars` for audio is **unset** (full transcript). Set `tools.media.audio.maxChars` or per-entry `maxChars` to trim output.
+- Use `tools.media.audio.attachments` to process multiple voice notes (`mode: "all"` + `maxAttachments`).
 - Transcript is available to templates as `{{Transcript}}`.
 - CLI stdout is capped (5MB); keep CLI output concise.

--- a/docs/nodes/images.md
+++ b/docs/nodes/images.md
@@ -38,10 +38,10 @@ The WhatsApp channel runs via **Baileys Web**. This document captures the curren
  - `{{MediaUrl}}` pseudo-URL for the inbound media.
  - `{{MediaPath}}` local temp path written before running the command.
 - When a per-session Docker sandbox is enabled, inbound media is copied into the sandbox workspace and `MediaPath`/`MediaUrl` are rewritten to a relative path like `media/inbound/<filename>`.
- Media understanding (if configured via `tools.media.*`) runs before templating and can insert `[Image]`, `[Audio]`, and `[Video]` blocks into `Body`.
+- Media understanding (if configured via `tools.media.*` or shared `tools.media.models`) runs before templating and can insert `[Image]`, `[Audio]`, and `[Video]` blocks into `Body`.
  - Audio sets `{{Transcript}}` and uses the transcript for command parsing so slash commands still work.
  - Video and image descriptions preserve any caption text for command parsing.
- Only the first matching image/audio/video attachment is processed; remaining attachments are left untouched.
+- By default only the first matching image/audio/video attachment is processed; set `tools.media.<cap>.attachments` to process multiple attachments.

 ## Limits & Errors
 **Outbound send caps (WhatsApp web send)**
--- a/docs/nodes/media-understanding.md
+++ b/docs/nodes/media-understanding.md
@@ -16,7 +16,7 @@ Clawdbot can optionally **summarize inbound media** (image/audio/video) before t

 ## High‑level behavior
 1) Collect inbound attachments (`MediaPaths`, `MediaUrls`, `MediaTypes`).
-2) For each enabled capability (image/audio/video), pick the **first matching attachment**.
+2) For each enabled capability (image/audio/video), select attachments per policy (default: **first**).
 3) Choose the first eligible model entry (size + capability + auth).  
 4) If a model fails or the media is too large, **fall back to the next entry**.
 5) On success:
@@ -27,18 +27,23 @@ Clawdbot can optionally **summarize inbound media** (image/audio/video) before t
 If understanding fails or is disabled, **the reply flow continues** with the original body + attachments.

 ## Config overview
-Use **per‑capability configs** under `tools.media`. Each capability can define:
- defaults (`prompt`, `maxChars`, `maxBytes`, `timeoutSeconds`, `language`)
- **ordered `models` list** (fallback order)
- `scope` (optional gating by channel/chatType/session key)
+`tools.media` supports **shared models** plus per‑capability overrides:
+- `tools.media.models`: shared model list (use `capabilities` to gate).
+- `tools.media.image` / `tools.media.audio` / `tools.media.video`:
+  - defaults (`prompt`, `maxChars`, `maxBytes`, `timeoutSeconds`, `language`)
+  - optional **per‑capability `models` list** (preferred before shared models)
+  - `attachments` policy (`mode`, `maxAttachments`, `prefer`)
+  - `scope` (optional gating by channel/chatType/session key)
+- `tools.media.concurrency`: max concurrent capability runs (default **2**).

 ```json5
 {
  tools: {
    media: {
-      image: { /* config */ },
-      audio: { /* config */ },
-      video: { /* config */ }
+      models: [ /* shared list */ ],
+      image: { /* optional overrides */ },
+      audio: { /* optional overrides */ },
+      video: { /* optional overrides */ }
    }
  }
 }
@@ -95,12 +100,13 @@ Rules:
 - `prompt` defaults to simple “Describe the {media}.” plus the `maxChars` guidance (image/video only).

 ## Capabilities (optional)
-If you set `capabilities`, the entry only runs for those media types. Suggested
-defaults when you opt in:
- `openai`, `anthropic`: **image**
+If you set `capabilities`, the entry only runs for those media types. For shared
+lists, Clawdbot can infer defaults:
+- `openai`, `anthropic`, `minimax`: **image**
 - `google` (Gemini API): **image + audio + video**
- CLI entries: declare the exact capabilities you support.
+- `groq`: **audio**

+For CLI entries, **set `capabilities` explicitly** to avoid surprising matches.
 If you omit `capabilities`, the entry is eligible for the list it appears in.

 ## Provider support matrix (Clawdbot integrations)
@@ -123,9 +129,49 @@ If you omit `capabilities`, the entry is eligible for the list it appears in.
 - `google/gemini-3-flash-preview` (fast), `google/gemini-3-pro-preview` (richer).
 - CLI fallback: `gemini` CLI (supports `read_file` on video/audio).

+## Attachment policy
+Per‑capability `attachments` controls which attachments are processed:
+- `mode`: `first` (default) or `all`
+- `maxAttachments`: cap the number processed (default **1**)
+- `prefer`: `first`, `last`, `path`, `url`
+
+When `mode: "all"`, outputs are labeled `[Image 1/2]`, `[Audio 2/2]`, etc.
+
 ## Config examples

-### 1) Audio + Video only (image off)
+### 1) Shared models list + overrides
+```json5
+{
+  tools: {
+    media: {
+      models: [
+        { provider: "openai", model: "gpt-5.2", capabilities: ["image"] },
+        { provider: "google", model: "gemini-3-flash-preview", capabilities: ["image", "audio", "video"] },
+        {
+          type: "cli",
+          command: "gemini",
+          args: [
+            "-m",
+            "gemini-3-flash",
+            "--allowed-tools",
+            "read_file",
+            "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
+          ],
+          capabilities: ["image", "video"]
+        }
+      ],
+      audio: {
+        attachments: { mode: "all", maxAttachments: 2 }
+      },
+      video: {
+        maxChars: 500
+      }
+    }
+  }
+}
+```
+
+### 2) Audio + Video only (image off)
 ```json5
 {
  tools: {
@@ -164,7 +210,7 @@ If you omit `capabilities`, the entry is eligible for the list it appears in.
 }
 ```

-### 2) Optional image understanding
+### 3) Optional image understanding
 ```json5
 {
  tools: {
@@ -194,7 +240,7 @@ If you omit `capabilities`, the entry is eligible for the list it appears in.
 }
 ```

-### 3) Multi‑modal single entry (explicit capabilities)
+### 4) Multi‑modal single entry (explicit capabilities)
 ```json5
 {
  tools: {