feat: transcribe audio and surface transcript to prompts

This commit is contained in:
Peter Steinberger
2025-11-25 23:13:22 +01:00
parent 7d0ae151e8
commit e642f128ae
6 changed files with 169 additions and 100 deletions

47
docs/audio.md Normal file
View File

@@ -0,0 +1,47 @@
# Audio / Voice Notes — 2025-11-25
## What works
- **Optional transcription**: If `inbound.transcribeAudio.command` is set in `~/.warelay/warelay.json`, warelay will:
1) Download inbound audio (Web or Twilio) to a temp path if only a URL is present.
2) Run the configured CLI (templated with `{{MediaPath}}`), expecting transcript on stdout.
3) Replace `Body` with the transcript, set `{{Transcript}}`, and prepend the original media path plus a `Transcript:` section in the command prompt so models see both.
4) Continue through the normal auto-reply pipeline (templating, sessions, Claude/command).
- **Verbose logging**: In `--verbose`, we log when transcription runs and when the transcript replaces the body.
## Config example (OpenAI Whisper CLI)
Requires `OPENAI_API_KEY` in env and `openai` CLI installed:
```json5
{
inbound: {
transcribeAudio: {
command: [
"openai",
"api",
"audio.transcriptions.create",
"-m",
"whisper-1",
"-f",
"{{MediaPath}}",
"--response-format",
"text"
],
timeoutSeconds: 45
},
reply: {
mode: "command",
command: ["claude", "{{Body}}"]
}
}
}
```
## Notes & limits
- We dont ship a transcriber; you opt in with any CLI that prints text to stdout (Whisper cloud, whisper.cpp, vosk, Deepgram, etc.).
- Size guard: inbound audio must be ≤5MB (same as other media).
- If transcription fails, we fall back to the original body/media note; replies still go through.
- Transcript is available to templates as `{{Transcript}}`; models get both the media path and a `Transcript:` block in the prompt when using command mode.
## Gotchas
- Ensure your CLI exits 0 and prints plain text; JSON needs to be massaged via `jq -r .text`.
- Keep timeouts reasonable (`timeoutSeconds`, default 45s) to avoid blocking the reply queue.
- Twilio paths are hosted URLs; Web paths are local. The temp download uses HTTPS for Twilio and a temp file for Web-only media.

View File

@@ -57,7 +57,7 @@ This document defines how `warelay` should handle sending and replying with imag
- `{{MediaUrl}}` original URL (Twilio) or pseudo-URL (web).
- `{{MediaPath}}` local temp path written before running the command.
- Size guard: only download if ≤5MB; else skip and log.
- Audio/voice notes: if you set `inbound.transcribeAudio.command`, warelay will run that CLI (templated with `{{MediaPath}}`) and replace `Body` with the transcript before continuing the reply flow; verbose logs indicate when transcription runs.
- Audio/voice notes: if you set `inbound.transcribeAudio.command`, warelay will run that CLI (templated with `{{MediaPath}}`) and replace `Body` with the transcript before continuing the reply flow; verbose logs indicate when transcription runs. The command prompt includes the original media path plus a `Transcript:` section so the model sees both.
## Errors & Messaging
- Local path with twilio + Funnel disabled → error: “Twilio media needs a public URL; start `warelay webhook --ingress tailscale` or pass an https:// URL.”