feat: transcribe audio and surface transcript to prompts

2025-11-25 23:13:22 +01:00
parent 7d0ae151e8
commit e642f128ae
6 changed files with 169 additions and 100 deletions
--- a/docs/audio.md
+++ b/docs/audio.md
@@ -0,0 +1,47 @@
+# Audio / Voice Notes — 2025-11-25
+
+## What works
+- **Optional transcription**: If `inbound.transcribeAudio.command` is set in `~/.warelay/warelay.json`, warelay will:
+  1) Download inbound audio (Web or Twilio) to a temp path if only a URL is present.
+  2) Run the configured CLI (templated with `{{MediaPath}}`), expecting transcript on stdout.
+  3) Replace `Body` with the transcript, set `{{Transcript}}`, and prepend the original media path plus a `Transcript:` section in the command prompt so models see both.
+  4) Continue through the normal auto-reply pipeline (templating, sessions, Claude/command).
+- **Verbose logging**: In `--verbose`, we log when transcription runs and when the transcript replaces the body.
+
+## Config example (OpenAI Whisper CLI)
+Requires `OPENAI_API_KEY` in env and `openai` CLI installed:
+```json5
+{
+  inbound: {
+    transcribeAudio: {
+      command: [
+        "openai",
+        "api",
+        "audio.transcriptions.create",
+        "-m",
+        "whisper-1",
+        "-f",
+        "{{MediaPath}}",
+        "--response-format",
+        "text"
+      ],
+      timeoutSeconds: 45
+    },
+    reply: {
+      mode: "command",
+      command: ["claude", "{{Body}}"]
+    }
+  }
+}
+```
+
+## Notes & limits
+- We don’t ship a transcriber; you opt in with any CLI that prints text to stdout (Whisper cloud, whisper.cpp, vosk, Deepgram, etc.).
+- Size guard: inbound audio must be ≤5 MB (same as other media).
+- If transcription fails, we fall back to the original body/media note; replies still go through.
+- Transcript is available to templates as `{{Transcript}}`; models get both the media path and a `Transcript:` block in the prompt when using command mode.
+
+## Gotchas
+- Ensure your CLI exits 0 and prints plain text; JSON needs to be massaged via `jq -r .text`.
+- Keep timeouts reasonable (`timeoutSeconds`, default 45s) to avoid blocking the reply queue.
+- Twilio paths are hosted URLs; Web paths are local. The temp download uses HTTPS for Twilio and a temp file for Web-only media.
--- a/docs/images.md
+++ b/docs/images.md
@@ -57,7 +57,7 @@ This document defines how `warelay` should handle sending and replying with imag
  - `{{MediaUrl}}` original URL (Twilio) or pseudo-URL (web).
  - `{{MediaPath}}` local temp path written before running the command.
 - Size guard: only download if ≤5 MB; else skip and log.
- Audio/voice notes: if you set `inbound.transcribeAudio.command`, warelay will run that CLI (templated with `{{MediaPath}}`) and replace `Body` with the transcript before continuing the reply flow; verbose logs indicate when transcription runs.
+- Audio/voice notes: if you set `inbound.transcribeAudio.command`, warelay will run that CLI (templated with `{{MediaPath}}`) and replace `Body` with the transcript before continuing the reply flow; verbose logs indicate when transcription runs. The command prompt includes the original media path plus a `Transcript:` section so the model sees both.

 ## Errors & Messaging
 - Local path with twilio + Funnel disabled → error: “Twilio media needs a public URL; start `warelay webhook --ingress tailscale` or pass an https:// URL.”