feat: default TTS model overrides on (#1559) (thanks @Glucksberg)
Co-authored-by: Glucksberg <80581902+Glucksberg@users.noreply.github.com>
This commit is contained in:
@@ -1459,13 +1459,28 @@ voice notes; other channels send MP3 audio.
|
||||
enabled: true,
|
||||
mode: "final", // final | all (include tool/block replies)
|
||||
provider: "elevenlabs",
|
||||
summaryModel: "openai/gpt-4.1-mini",
|
||||
modelOverrides: {
|
||||
enabled: true
|
||||
},
|
||||
maxTextLength: 4000,
|
||||
timeoutMs: 30000,
|
||||
prefsPath: "~/.clawdbot/settings/tts.json",
|
||||
elevenlabs: {
|
||||
apiKey: "elevenlabs_api_key",
|
||||
baseUrl: "https://api.elevenlabs.io",
|
||||
voiceId: "voice_id",
|
||||
modelId: "eleven_multilingual_v2"
|
||||
modelId: "eleven_multilingual_v2",
|
||||
seed: 42,
|
||||
applyTextNormalization: "auto",
|
||||
languageCode: "en",
|
||||
voiceSettings: {
|
||||
stability: 0.5,
|
||||
similarityBoost: 0.75,
|
||||
style: 0.0,
|
||||
useSpeakerBoost: true,
|
||||
speed: 1.0
|
||||
}
|
||||
},
|
||||
openai: {
|
||||
apiKey: "openai_api_key",
|
||||
@@ -1478,11 +1493,17 @@ voice notes; other channels send MP3 audio.
|
||||
```
|
||||
|
||||
Notes:
|
||||
- `messages.tts.enabled` can be overridden by local user prefs (see `/tts_on`, `/tts_off`).
|
||||
- `messages.tts.enabled` can be overridden by local user prefs (see `/tts on`, `/tts off`).
|
||||
- `prefsPath` stores local overrides (enabled/provider/limit/summarize).
|
||||
- `maxTextLength` is a hard cap for TTS input; summaries are truncated to fit.
|
||||
- `/tts_limit` and `/tts_summary` control per-user summarization settings.
|
||||
- `summaryModel` overrides `agents.defaults.model.primary` for auto-summary.
|
||||
- Accepts `provider/model` or an alias from `agents.defaults.models`.
|
||||
- `modelOverrides` enables model-driven overrides like `[[tts:...]]` tags (on by default).
|
||||
- `/tts limit` and `/tts summary` control per-user summarization settings.
|
||||
- `apiKey` values fall back to `ELEVENLABS_API_KEY`/`XI_API_KEY` and `OPENAI_API_KEY`.
|
||||
- `elevenlabs.baseUrl` overrides the ElevenLabs API base URL.
|
||||
- `elevenlabs.voiceSettings` supports `stability`/`similarityBoost`/`style` (0..1),
|
||||
`useSpeakerBoost`, and `speed` (0.5..2.0).
|
||||
|
||||
### `talk`
|
||||
|
||||
|
||||
@@ -67,13 +67,7 @@ Text + native (when enabled):
|
||||
- `/config show|get|set|unset` (persist config to disk, owner-only; requires `commands.config: true`)
|
||||
- `/debug show|set|unset|reset` (runtime overrides, owner-only; requires `commands.debug: true`)
|
||||
- `/usage off|tokens|full|cost` (per-response usage footer or local cost summary)
|
||||
- `/tts_on` (enable TTS replies)
|
||||
- `/tts_off` (disable TTS replies)
|
||||
- `/tts_provider [openai|elevenlabs]` (set or show TTS provider)
|
||||
- `/tts_limit <chars>` (max chars before TTS summarization)
|
||||
- `/tts_summary on|off` (toggle TTS auto-summary)
|
||||
- `/tts_status` (show TTS status)
|
||||
- `/audio <text>` (convert text to a TTS audio reply)
|
||||
- `/tts on|off|status|provider|limit|summary|audio` (control TTS; see [/tts](/tts))
|
||||
- `/stop`
|
||||
- `/restart`
|
||||
- `/dock-telegram` (alias: `/dock_telegram`) (switch replies to Telegram)
|
||||
|
||||
293
docs/tts.md
Normal file
293
docs/tts.md
Normal file
@@ -0,0 +1,293 @@
|
||||
---
|
||||
summary: "Text-to-speech (TTS) for outbound replies"
|
||||
read_when:
|
||||
- Enabling text-to-speech for replies
|
||||
- Configuring TTS providers or limits
|
||||
- Using /tts commands
|
||||
---
|
||||
|
||||
# Text-to-speech (TTS)
|
||||
|
||||
Clawdbot can convert outbound replies into audio using ElevenLabs or OpenAI.
|
||||
It works anywhere Clawdbot can send audio; Telegram gets a round voice-note bubble.
|
||||
|
||||
## Supported services
|
||||
|
||||
- **ElevenLabs** (primary or fallback provider)
|
||||
- **OpenAI** (primary or fallback provider; also used for summaries)
|
||||
|
||||
## Required keys
|
||||
|
||||
At least one of:
|
||||
- `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
|
||||
- `OPENAI_API_KEY`
|
||||
|
||||
If both are configured, the selected provider is used first and the other is a fallback.
|
||||
Auto-summary uses the configured `summaryModel` (or `agents.defaults.model.primary`),
|
||||
so that provider must also be authenticated if you enable summaries.
|
||||
|
||||
## Service links
|
||||
|
||||
- [OpenAI Text-to-Speech guide](https://platform.openai.com/docs/guides/text-to-speech)
|
||||
- [OpenAI Audio API reference](https://platform.openai.com/docs/api-reference/audio)
|
||||
- [ElevenLabs Text to Speech](https://elevenlabs.io/docs/api-reference/text-to-speech)
|
||||
- [ElevenLabs Authentication](https://elevenlabs.io/docs/api-reference/authentication)
|
||||
|
||||
## Is it enabled by default?
|
||||
|
||||
No. TTS is **disabled** by default. Enable it in config or with `/tts on`,
|
||||
which writes a local preference override.
|
||||
|
||||
## Config
|
||||
|
||||
TTS config lives under `messages.tts` in `clawdbot.json`.
|
||||
Full schema is in [Gateway configuration](/gateway/configuration).
|
||||
|
||||
### Minimal config (enable + provider)
|
||||
|
||||
```json5
|
||||
{
|
||||
messages: {
|
||||
tts: {
|
||||
enabled: true,
|
||||
provider: "elevenlabs"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### OpenAI primary with ElevenLabs fallback
|
||||
|
||||
```json5
|
||||
{
|
||||
messages: {
|
||||
tts: {
|
||||
enabled: true,
|
||||
provider: "openai",
|
||||
summaryModel: "openai/gpt-4.1-mini",
|
||||
modelOverrides: {
|
||||
enabled: true
|
||||
},
|
||||
openai: {
|
||||
apiKey: "openai_api_key",
|
||||
model: "gpt-4o-mini-tts",
|
||||
voice: "alloy"
|
||||
},
|
||||
elevenlabs: {
|
||||
apiKey: "elevenlabs_api_key",
|
||||
baseUrl: "https://api.elevenlabs.io",
|
||||
voiceId: "voice_id",
|
||||
modelId: "eleven_multilingual_v2",
|
||||
seed: 42,
|
||||
applyTextNormalization: "auto",
|
||||
languageCode: "en",
|
||||
voiceSettings: {
|
||||
stability: 0.5,
|
||||
similarityBoost: 0.75,
|
||||
style: 0.0,
|
||||
useSpeakerBoost: true,
|
||||
speed: 1.0
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Custom limits + prefs path
|
||||
|
||||
```json5
|
||||
{
|
||||
messages: {
|
||||
tts: {
|
||||
enabled: true,
|
||||
maxTextLength: 4000,
|
||||
timeoutMs: 30000,
|
||||
prefsPath: "~/.clawdbot/settings/tts.json"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Disable auto-summary for long replies
|
||||
|
||||
```json5
|
||||
{
|
||||
messages: {
|
||||
tts: {
|
||||
enabled: true
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Then run:
|
||||
|
||||
```
|
||||
/tts summary off
|
||||
```
|
||||
|
||||
### Notes on fields
|
||||
|
||||
- `enabled`: master toggle (default `false`; local prefs can override).
|
||||
- `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
|
||||
- `provider`: `"elevenlabs"` or `"openai"` (fallback is automatic).
|
||||
- `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`.
|
||||
- Accepts `provider/model` or a configured model alias.
|
||||
- `modelOverrides`: allow the model to emit TTS directives (on by default).
|
||||
- `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded.
|
||||
- `timeoutMs`: request timeout (ms).
|
||||
- `prefsPath`: override the local prefs JSON path.
|
||||
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `OPENAI_API_KEY`).
|
||||
- `elevenlabs.baseUrl`: override ElevenLabs API base URL.
|
||||
- `elevenlabs.voiceSettings`:
|
||||
- `stability`, `similarityBoost`, `style`: `0..1`
|
||||
- `useSpeakerBoost`: `true|false`
|
||||
- `speed`: `0.5..2.0` (1.0 = normal)
|
||||
- `elevenlabs.applyTextNormalization`: `auto|on|off`
|
||||
- `elevenlabs.languageCode`: 2-letter ISO 639-1 (e.g. `en`, `de`)
|
||||
- `elevenlabs.seed`: integer `0..4294967295` (best-effort determinism)
|
||||
|
||||
## Model-driven overrides (default on)
|
||||
|
||||
By default, the model **can** emit TTS directives for a single reply.
|
||||
|
||||
When enabled, the model can emit `[[tts:...]]` directives to override the voice
|
||||
for a single reply, plus an optional `[[tts:text]]...[[/tts:text]]` block to
|
||||
provide expressive tags (laughter, singing cues, etc) that should only appear in
|
||||
the audio.
|
||||
|
||||
Example reply payload:
|
||||
|
||||
```
|
||||
Here you go.
|
||||
|
||||
[[tts:provider=elevenlabs voiceId=pMsXgVXv3BLzUgSXRplE model=eleven_v3 speed=1.1]]
|
||||
[[tts:text]](laughs) Read the song once more.[[/tts:text]]
|
||||
```
|
||||
|
||||
Available directive keys (when enabled):
|
||||
- `provider` (`openai` | `elevenlabs`)
|
||||
- `voice` (OpenAI voice) or `voiceId` (ElevenLabs)
|
||||
- `model` (OpenAI TTS model or ElevenLabs model id)
|
||||
- `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
|
||||
- `applyTextNormalization` (`auto|on|off`)
|
||||
- `languageCode` (ISO 639-1)
|
||||
- `seed`
|
||||
|
||||
Disable all model overrides:
|
||||
|
||||
```json5
|
||||
{
|
||||
messages: {
|
||||
tts: {
|
||||
modelOverrides: {
|
||||
enabled: false
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Optional allowlist (disable specific overrides while keeping tags enabled):
|
||||
|
||||
```json5
|
||||
{
|
||||
messages: {
|
||||
tts: {
|
||||
modelOverrides: {
|
||||
enabled: true,
|
||||
allowProvider: false,
|
||||
allowSeed: false
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Per-user preferences
|
||||
|
||||
Slash commands write local overrides to `prefsPath` (default:
|
||||
`~/.clawdbot/settings/tts.json`, override with `CLAWDBOT_TTS_PREFS` or
|
||||
`messages.tts.prefsPath`).
|
||||
|
||||
Stored fields:
|
||||
- `enabled`
|
||||
- `provider`
|
||||
- `maxLength` (summary threshold; default 1500 chars)
|
||||
- `summarize` (default `true`)
|
||||
|
||||
These override `messages.tts.*` for that host.
|
||||
|
||||
## Output formats (fixed)
|
||||
|
||||
- **Telegram**: Opus voice note (`opus_48000_64` from ElevenLabs, `opus` from OpenAI).
|
||||
- 48kHz / 64kbps is a good voice-note tradeoff and required for the round bubble.
|
||||
- **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
|
||||
- 44.1kHz / 128kbps is the default balance for speech clarity.
|
||||
|
||||
This is not configurable; Telegram expects Opus for voice-note UX.
|
||||
|
||||
## Auto-TTS behavior
|
||||
|
||||
When enabled, Clawdbot:
|
||||
- skips TTS if the reply already contains media or a `MEDIA:` directive.
|
||||
- skips very short replies (< 10 chars).
|
||||
- summarizes long replies when enabled using `agents.defaults.model.primary` (or `summaryModel`).
|
||||
- attaches the generated audio to the reply.
|
||||
|
||||
If the reply exceeds `maxLength` and summary is off (or no API key for the
|
||||
summary model), audio
|
||||
is skipped and the normal text reply is sent.
|
||||
|
||||
## Flow diagram
|
||||
|
||||
```
|
||||
Reply -> TTS enabled?
|
||||
no -> send text
|
||||
yes -> has media / MEDIA: / short?
|
||||
yes -> send text
|
||||
no -> length > limit?
|
||||
no -> TTS -> attach audio
|
||||
yes -> summary enabled?
|
||||
no -> send text
|
||||
yes -> summarize (summaryModel or agents.defaults.model.primary)
|
||||
-> TTS -> attach audio
|
||||
```
|
||||
|
||||
## Slash command usage
|
||||
|
||||
There is a single command: `/tts`.
|
||||
See [Slash commands](/tools/slash-commands) for enablement details.
|
||||
|
||||
```
|
||||
/tts on
|
||||
/tts off
|
||||
/tts status
|
||||
/tts provider openai
|
||||
/tts limit 2000
|
||||
/tts summary off
|
||||
/tts audio Hello from Clawdbot
|
||||
```
|
||||
|
||||
Notes:
|
||||
- Commands require an authorized sender (allowlist/owner rules still apply).
|
||||
- `commands.text` or native command registration must be enabled.
|
||||
- `limit` and `summary` are stored in local prefs, not the main config.
|
||||
- `/tts audio` generates a one-off audio reply (does not toggle TTS on).
|
||||
|
||||
## Agent tool
|
||||
|
||||
The `tts` tool converts text to speech and returns a `MEDIA:` path. When the
|
||||
result is Telegram-compatible, the tool includes `[[audio_as_voice]]` so
|
||||
Telegram sends a voice bubble.
|
||||
|
||||
## Gateway RPC
|
||||
|
||||
Gateway methods:
|
||||
- `tts.status`
|
||||
- `tts.enable`
|
||||
- `tts.disable`
|
||||
- `tts.convert`
|
||||
- `tts.setProvider`
|
||||
- `tts.providers`
|
||||
Reference in New Issue
Block a user