feat: add Deepgram audio transcription
Co-authored-by: Safzan Pirani <safzanpirani@users.noreply.github.com>
This commit is contained in:
@@ -62,8 +62,24 @@ read_when:
|
||||
}
|
||||
```
|
||||
|
||||
### Provider-only (Deepgram)
|
||||
```json5
|
||||
{
|
||||
tools: {
|
||||
media: {
|
||||
audio: {
|
||||
enabled: true,
|
||||
models: [{ provider: "deepgram", model: "nova-3" }]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Notes & limits
|
||||
- Provider auth follows the standard model auth order (auth profiles, env vars, `models.providers.*.apiKey`).
|
||||
- Deepgram picks up `DEEPGRAM_API_KEY` when `provider: "deepgram"` is used.
|
||||
- Deepgram setup details: [Deepgram (audio transcription)](/providers/deepgram).
|
||||
- Default size cap is 20MB (`tools.media.audio.maxBytes`). Oversize audio is skipped for that model and the next entry is tried.
|
||||
- Default `maxChars` for audio is **unset** (full transcript). Set `tools.media.audio.maxChars` or per-entry `maxChars` to trim output.
|
||||
- Use `tools.media.audio.attachments` to process multiple voice notes (`mode: "all"` + `maxAttachments`).
|
||||
|
||||
@@ -108,6 +108,7 @@ lists, Clawdbot can infer defaults:
|
||||
- `openai`, `anthropic`, `minimax`: **image**
|
||||
- `google` (Gemini API): **image + audio + video**
|
||||
- `groq`: **audio**
|
||||
- `deepgram`: **audio**
|
||||
|
||||
For CLI entries, **set `capabilities` explicitly** to avoid surprising matches.
|
||||
If you omit `capabilities`, the entry is eligible for the list it appears in.
|
||||
@@ -116,7 +117,7 @@ If you omit `capabilities`, the entry is eligible for the list it appears in.
|
||||
| Capability | Provider integration | Notes |
|
||||
|------------|----------------------|-------|
|
||||
| Image | OpenAI / Anthropic / Google / others via `pi-ai` | Any image-capable model in the registry works. |
|
||||
| Audio | OpenAI, Groq | Provider transcription (Whisper). |
|
||||
| Audio | OpenAI, Groq, Deepgram | Provider transcription (Whisper/Deepgram). |
|
||||
| Video | Google (Gemini API) | Provider video understanding. |
|
||||
|
||||
## Recommended providers
|
||||
@@ -125,8 +126,9 @@ If you omit `capabilities`, the entry is eligible for the list it appears in.
|
||||
- Good defaults: `openai/gpt-5.2`, `anthropic/claude-opus-4-5`, `google/gemini-3-pro-preview`.
|
||||
|
||||
**Audio**
|
||||
- `openai/whisper-1` or `groq/whisper-large-v3-turbo`.
|
||||
- `openai/whisper-1`, `groq/whisper-large-v3-turbo`, or `deepgram/nova-3`.
|
||||
- CLI fallback: `whisper` binary.
|
||||
- Deepgram setup: [Deepgram (audio transcription)](/providers/deepgram).
|
||||
|
||||
**Video**
|
||||
- `google/gemini-3-flash-preview` (fast), `google/gemini-3-pro-preview` (richer).
|
||||
|
||||
64
docs/providers/deepgram.md
Normal file
64
docs/providers/deepgram.md
Normal file
@@ -0,0 +1,64 @@
|
||||
---
|
||||
summary: "Deepgram transcription for inbound voice notes"
|
||||
read_when:
|
||||
- You want Deepgram speech-to-text for audio attachments
|
||||
- You need a quick Deepgram config example
|
||||
---
|
||||
# Deepgram (Audio Transcription)
|
||||
|
||||
Deepgram is a speech-to-text API. In Clawdbot it is used for **inbound audio/voice note
|
||||
transcription** via `tools.media.audio`.
|
||||
|
||||
When enabled, Clawdbot uploads the audio file to Deepgram and injects the transcript
|
||||
into the reply pipeline (`{{Transcript}}` + `[Audio]` block). This is **not streaming**;
|
||||
it uses the pre-recorded transcription endpoint.
|
||||
|
||||
Website: https://deepgram.com
|
||||
Docs: https://developers.deepgram.com
|
||||
|
||||
## Quick start
|
||||
|
||||
1) Set your API key:
|
||||
```
|
||||
DEEPGRAM_API_KEY=dg_...
|
||||
```
|
||||
|
||||
2) Enable the provider:
|
||||
```json5
|
||||
{
|
||||
tools: {
|
||||
media: {
|
||||
audio: {
|
||||
enabled: true,
|
||||
models: [{ provider: "deepgram", model: "nova-3" }]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Options
|
||||
|
||||
- `model`: Deepgram model id (default: `nova-3`)
|
||||
- `language`: language hint (optional)
|
||||
|
||||
Example with language:
|
||||
```json5
|
||||
{
|
||||
tools: {
|
||||
media: {
|
||||
audio: {
|
||||
enabled: true,
|
||||
models: [
|
||||
{ provider: "deepgram", model: "nova-3", language: "en" }
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- Authentication follows the standard provider auth order; `DEEPGRAM_API_KEY` is the simplest path.
|
||||
- Output follows the same audio rules as other providers (size caps, timeouts, transcript injection).
|
||||
@@ -34,5 +34,9 @@ Looking for chat channel docs (WhatsApp/Telegram/Discord/Slack/etc.)? See [Chann
|
||||
- [GLM models](/providers/glm)
|
||||
- [MiniMax](/providers/minimax)
|
||||
|
||||
## Transcription providers
|
||||
|
||||
- [Deepgram (audio transcription)](/providers/deepgram)
|
||||
|
||||
For the full provider catalog (xAI, Groq, Mistral, etc.) and advanced configuration,
|
||||
see [Model providers](/concepts/model-providers).
|
||||
|
||||
@@ -290,6 +290,11 @@ Live tests discover credentials the same way the CLI does. Practical implication
|
||||
|
||||
If you want to rely on env keys (e.g. exported in your `~/.profile`), run local tests after `source ~/.profile`, or use the Docker runners below (they can mount `~/.profile` into the container).
|
||||
|
||||
## Deepgram live (audio transcription)
|
||||
|
||||
- Test: `src/media-understanding/providers/deepgram/audio.live.test.ts`
|
||||
- Enable: `DEEPGRAM_API_KEY=... DEEPGRAM_LIVE_TEST=1 pnpm test:live src/media-understanding/providers/deepgram/audio.live.test.ts`
|
||||
|
||||
## Docker runners (optional “works in Linux” checks)
|
||||
|
||||
These run `pnpm test:live` inside the repo Docker image, mounting your local config dir and workspace (and sourcing `~/.profile` if mounted):
|
||||
|
||||
Reference in New Issue
Block a user