Files
clawdbot/docs/tts.md
2026-01-24 09:58:06 +00:00

297 lines
7.7 KiB
Markdown

---
summary: "Text-to-speech (TTS) for outbound replies"
read_when:
- Enabling text-to-speech for replies
- Configuring TTS providers or limits
- Using /tts commands
---
# Text-to-speech (TTS)
Clawdbot can convert outbound replies into audio using ElevenLabs or OpenAI.
It works anywhere Clawdbot can send audio; Telegram gets a round voice-note bubble.
## Supported services
- **ElevenLabs** (primary or fallback provider)
- **OpenAI** (primary or fallback provider; also used for summaries)
## Required keys
At least one of:
- `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
- `OPENAI_API_KEY`
If both are configured, the selected provider is used first and the other is a fallback.
Auto-summary uses the configured `summaryModel` (or `agents.defaults.model.primary`),
so that provider must also be authenticated if you enable summaries.
## Service links
- [OpenAI Text-to-Speech guide](https://platform.openai.com/docs/guides/text-to-speech)
- [OpenAI Audio API reference](https://platform.openai.com/docs/api-reference/audio)
- [ElevenLabs Text to Speech](https://elevenlabs.io/docs/api-reference/text-to-speech)
- [ElevenLabs Authentication](https://elevenlabs.io/docs/api-reference/authentication)
## Is it enabled by default?
No. TTS is **disabled** by default. Enable it in config or with `/tts on`,
which writes a local preference override.
## Config
TTS config lives under `messages.tts` in `clawdbot.json`.
Full schema is in [Gateway configuration](/gateway/configuration).
### Minimal config (enable + provider)
```json5
{
messages: {
tts: {
enabled: true,
provider: "elevenlabs"
}
}
}
```
### OpenAI primary with ElevenLabs fallback
```json5
{
messages: {
tts: {
enabled: true,
provider: "openai",
summaryModel: "openai/gpt-4.1-mini",
modelOverrides: {
enabled: true
},
openai: {
apiKey: "openai_api_key",
model: "gpt-4o-mini-tts",
voice: "alloy"
},
elevenlabs: {
apiKey: "elevenlabs_api_key",
baseUrl: "https://api.elevenlabs.io",
voiceId: "voice_id",
modelId: "eleven_multilingual_v2",
seed: 42,
applyTextNormalization: "auto",
languageCode: "en",
voiceSettings: {
stability: 0.5,
similarityBoost: 0.75,
style: 0.0,
useSpeakerBoost: true,
speed: 1.0
}
}
}
}
}
```
### Custom limits + prefs path
```json5
{
messages: {
tts: {
enabled: true,
maxTextLength: 4000,
timeoutMs: 30000,
prefsPath: "~/.clawdbot/settings/tts.json"
}
}
}
```
### Disable auto-summary for long replies
```json5
{
messages: {
tts: {
enabled: true
}
}
}
```
Then run:
```
/tts summary off
```
### Notes on fields
- `enabled`: master toggle (default `false`; local prefs can override).
- `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
- `provider`: `"elevenlabs"` or `"openai"` (fallback is automatic).
- `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`.
- Accepts `provider/model` or a configured model alias.
- `modelOverrides`: allow the model to emit TTS directives (on by default).
- `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded.
- `timeoutMs`: request timeout (ms).
- `prefsPath`: override the local prefs JSON path.
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `OPENAI_API_KEY`).
- `elevenlabs.baseUrl`: override ElevenLabs API base URL.
- `elevenlabs.voiceSettings`:
- `stability`, `similarityBoost`, `style`: `0..1`
- `useSpeakerBoost`: `true|false`
- `speed`: `0.5..2.0` (1.0 = normal)
- `elevenlabs.applyTextNormalization`: `auto|on|off`
- `elevenlabs.languageCode`: 2-letter ISO 639-1 (e.g. `en`, `de`)
- `elevenlabs.seed`: integer `0..4294967295` (best-effort determinism)
## Model-driven overrides (default on)
By default, the model **can** emit TTS directives for a single reply.
When enabled, the model can emit `[[tts:...]]` directives to override the voice
for a single reply, plus an optional `[[tts:text]]...[[/tts:text]]` block to
provide expressive tags (laughter, singing cues, etc) that should only appear in
the audio.
Example reply payload:
```
Here you go.
[[tts:provider=elevenlabs voiceId=pMsXgVXv3BLzUgSXRplE model=eleven_v3 speed=1.1]]
[[tts:text]](laughs) Read the song once more.[[/tts:text]]
```
Available directive keys (when enabled):
- `provider` (`openai` | `elevenlabs`)
- `voice` (OpenAI voice) or `voiceId` (ElevenLabs)
- `model` (OpenAI TTS model or ElevenLabs model id)
- `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
- `applyTextNormalization` (`auto|on|off`)
- `languageCode` (ISO 639-1)
- `seed`
Disable all model overrides:
```json5
{
messages: {
tts: {
modelOverrides: {
enabled: false
}
}
}
}
```
Optional allowlist (disable specific overrides while keeping tags enabled):
```json5
{
messages: {
tts: {
modelOverrides: {
enabled: true,
allowProvider: false,
allowSeed: false
}
}
}
}
```
## Per-user preferences
Slash commands write local overrides to `prefsPath` (default:
`~/.clawdbot/settings/tts.json`, override with `CLAWDBOT_TTS_PREFS` or
`messages.tts.prefsPath`).
Stored fields:
- `enabled`
- `provider`
- `maxLength` (summary threshold; default 1500 chars)
- `summarize` (default `true`)
These override `messages.tts.*` for that host.
## Output formats (fixed)
- **Telegram**: Opus voice note (`opus_48000_64` from ElevenLabs, `opus` from OpenAI).
- 48kHz / 64kbps is a good voice-note tradeoff and required for the round bubble.
- **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
- 44.1kHz / 128kbps is the default balance for speech clarity.
This is not configurable; Telegram expects Opus for voice-note UX.
## Auto-TTS behavior
When enabled, Clawdbot:
- skips TTS if the reply already contains media or a `MEDIA:` directive.
- skips very short replies (< 10 chars).
- summarizes long replies when enabled using `agents.defaults.model.primary` (or `summaryModel`).
- attaches the generated audio to the reply.
If the reply exceeds `maxLength` and summary is off (or no API key for the
summary model), audio
is skipped and the normal text reply is sent.
## Flow diagram
```
Reply -> TTS enabled?
no -> send text
yes -> has media / MEDIA: / short?
yes -> send text
no -> length > limit?
no -> TTS -> attach audio
yes -> summary enabled?
no -> send text
yes -> summarize (summaryModel or agents.defaults.model.primary)
-> TTS -> attach audio
```
## Slash command usage
There is a single command: `/tts`.
See [Slash commands](/tools/slash-commands) for enablement details.
Discord note: `/tts` is a built-in Discord command, so Clawdbot registers
`/voice` as the native command there. Text `/tts ...` still works.
```
/tts on
/tts off
/tts status
/tts provider openai
/tts limit 2000
/tts summary off
/tts audio Hello from Clawdbot
```
Notes:
- Commands require an authorized sender (allowlist/owner rules still apply).
- `commands.text` or native command registration must be enabled.
- `limit` and `summary` are stored in local prefs, not the main config.
- `/tts audio` generates a one-off audio reply (does not toggle TTS on).
## Agent tool
The `tts` tool converts text to speech and returns a `MEDIA:` path. When the
result is Telegram-compatible, the tool includes `[[audio_as_voice]]` so
Telegram sends a voice bubble.
## Gateway RPC
Gateway methods:
- `tts.status`
- `tts.enable`
- `tts.disable`
- `tts.convert`
- `tts.setProvider`
- `tts.providers`