let5see/clawdbot

Fork 0

Files

Peter Steinberger fcb7c9ff65 refactor: unify media understanding pipeline

2026-01-17 04:39:00 +00:00

2.6 KiB

Raw Blame History

summary, read_when

summary

read_when

How inbound audio/voice notes are downloaded, transcribed, and injected into replies

Changing audio transcription or media handling

Audio / Voice Notes — 2026-01-17

What works

Media understanding (audio): If tools.media.audio is enabled (or a shared tools.media.models entry supports audio), Clawdbot:
1. Locates the first audio attachment (local path or URL) and downloads it if needed.
2. Enforces maxBytes before sending to each model entry.
3. Runs the first eligible model entry in order (provider or CLI).
4. If it fails or skips (size/timeout), it tries the next entry.
5. On success, it replaces Body with an [Audio] block and sets {{Transcript}}.
Command parsing: When transcription succeeds, CommandBody/RawBody are set to the transcript so slash commands still work.
Verbose logging: In --verbose, we log when transcription runs and when it replaces the body.

Config examples

Provider + CLI fallback (OpenAI + Whisper CLI)

{
  tools: {
    media: {
      audio: {
        enabled: true,
        maxBytes: 20971520,
        models: [
          { provider: "openai", model: "whisper-1" },
          {
            type: "cli",
            command: "whisper",
            args: ["--model", "base", "{{MediaPath}}"],
            timeoutSeconds: 45
          }
        ]
      }
    }
  }
}

Provider-only with scope gating

{
  tools: {
    media: {
      audio: {
        enabled: true,
        scope: {
          default: "allow",
          rules: [
            { action: "deny", match: { chatType: "group" } }
          ]
        },
        models: [
          { provider: "openai", model: "whisper-1" }
        ]
      }
    }
  }
}

Notes & limits

Provider auth follows the standard model auth order (auth profiles, env vars, models.providers.*.apiKey).
Default size cap is 20MB (tools.media.audio.maxBytes). Oversize audio is skipped for that model and the next entry is tried.
Default maxChars for audio is unset (full transcript). Set tools.media.audio.maxChars or per-entry maxChars to trim output.
Use tools.media.audio.attachments to process multiple voice notes (mode: "all" + maxAttachments).
Transcript is available to templates as {{Transcript}}.
CLI stdout is capped (5MB); keep CLI output concise.

Gotchas

Scope rules use first-match wins. chatType is normalized to direct, group, or room.
Ensure your CLI exits 0 and prints plain text; JSON needs to be massaged via jq -r .text.
Keep timeouts reasonable (timeoutSeconds, default 60s) to avoid blocking the reply queue.

2.6 KiB Raw Blame History

Audio / Voice Notes — 2026-01-17

What works

Config examples

Provider + CLI fallback (OpenAI + Whisper CLI)

Provider-only with scope gating

Notes & limits

Gotchas

2.6 KiB

Raw Blame History