7.7 KiB
summary, read_when
| summary | read_when | |||
|---|---|---|---|---|
| Text-to-speech (TTS) for outbound replies |
|
Text-to-speech (TTS)
Clawdbot can convert outbound replies into audio using ElevenLabs or OpenAI. It works anywhere Clawdbot can send audio; Telegram gets a round voice-note bubble.
Supported services
- ElevenLabs (primary or fallback provider)
- OpenAI (primary or fallback provider; also used for summaries)
Required keys
At least one of:
ELEVENLABS_API_KEY(orXI_API_KEY)OPENAI_API_KEY
If both are configured, the selected provider is used first and the other is a fallback.
Auto-summary uses the configured summaryModel (or agents.defaults.model.primary),
so that provider must also be authenticated if you enable summaries.
Service links
- OpenAI Text-to-Speech guide
- OpenAI Audio API reference
- ElevenLabs Text to Speech
- ElevenLabs Authentication
Is it enabled by default?
No. TTS is disabled by default. Enable it in config or with /tts on,
which writes a local preference override.
Config
TTS config lives under messages.tts in clawdbot.json.
Full schema is in Gateway configuration.
Minimal config (enable + provider)
{
messages: {
tts: {
enabled: true,
provider: "elevenlabs"
}
}
}
OpenAI primary with ElevenLabs fallback
{
messages: {
tts: {
enabled: true,
provider: "openai",
summaryModel: "openai/gpt-4.1-mini",
modelOverrides: {
enabled: true
},
openai: {
apiKey: "openai_api_key",
model: "gpt-4o-mini-tts",
voice: "alloy"
},
elevenlabs: {
apiKey: "elevenlabs_api_key",
baseUrl: "https://api.elevenlabs.io",
voiceId: "voice_id",
modelId: "eleven_multilingual_v2",
seed: 42,
applyTextNormalization: "auto",
languageCode: "en",
voiceSettings: {
stability: 0.5,
similarityBoost: 0.75,
style: 0.0,
useSpeakerBoost: true,
speed: 1.0
}
}
}
}
}
Custom limits + prefs path
{
messages: {
tts: {
enabled: true,
maxTextLength: 4000,
timeoutMs: 30000,
prefsPath: "~/.clawdbot/settings/tts.json"
}
}
}
Disable auto-summary for long replies
{
messages: {
tts: {
enabled: true
}
}
}
Then run:
/tts summary off
Notes on fields
enabled: master toggle (defaultfalse; local prefs can override).mode:"final"(default) or"all"(includes tool/block replies).provider:"elevenlabs"or"openai"(fallback is automatic).summaryModel: optional cheap model for auto-summary; defaults toagents.defaults.model.primary.- Accepts
provider/modelor a configured model alias.
- Accepts
modelOverrides: allow the model to emit TTS directives (on by default).maxTextLength: hard cap for TTS input (chars)./tts audiofails if exceeded.timeoutMs: request timeout (ms).prefsPath: override the local prefs JSON path.apiKeyvalues fall back to env vars (ELEVENLABS_API_KEY/XI_API_KEY,OPENAI_API_KEY).elevenlabs.baseUrl: override ElevenLabs API base URL.elevenlabs.voiceSettings:stability,similarityBoost,style:0..1useSpeakerBoost:true|falsespeed:0.5..2.0(1.0 = normal)
elevenlabs.applyTextNormalization:auto|on|offelevenlabs.languageCode: 2-letter ISO 639-1 (e.g.en,de)elevenlabs.seed: integer0..4294967295(best-effort determinism)
Model-driven overrides (default on)
By default, the model can emit TTS directives for a single reply.
When enabled, the model can emit [[tts:...]] directives to override the voice
for a single reply, plus an optional [[tts:text]]...[[/tts:text]] block to
provide expressive tags (laughter, singing cues, etc) that should only appear in
the audio.
Example reply payload:
Here you go.
[[tts:provider=elevenlabs voiceId=pMsXgVXv3BLzUgSXRplE model=eleven_v3 speed=1.1]]
[[tts:text]](laughs) Read the song once more.[[/tts:text]]
Available directive keys (when enabled):
provider(openai|elevenlabs)voice(OpenAI voice) orvoiceId(ElevenLabs)model(OpenAI TTS model or ElevenLabs model id)stability,similarityBoost,style,speed,useSpeakerBoostapplyTextNormalization(auto|on|off)languageCode(ISO 639-1)seed
Disable all model overrides:
{
messages: {
tts: {
modelOverrides: {
enabled: false
}
}
}
}
Optional allowlist (disable specific overrides while keeping tags enabled):
{
messages: {
tts: {
modelOverrides: {
enabled: true,
allowProvider: false,
allowSeed: false
}
}
}
}
Per-user preferences
Slash commands write local overrides to prefsPath (default:
~/.clawdbot/settings/tts.json, override with CLAWDBOT_TTS_PREFS or
messages.tts.prefsPath).
Stored fields:
enabledprovidermaxLength(summary threshold; default 1500 chars)summarize(defaulttrue)
These override messages.tts.* for that host.
Output formats (fixed)
- Telegram: Opus voice note (
opus_48000_64from ElevenLabs,opusfrom OpenAI).- 48kHz / 64kbps is a good voice-note tradeoff and required for the round bubble.
- Other channels: MP3 (
mp3_44100_128from ElevenLabs,mp3from OpenAI).- 44.1kHz / 128kbps is the default balance for speech clarity.
This is not configurable; Telegram expects Opus for voice-note UX.
Auto-TTS behavior
When enabled, Clawdbot:
- skips TTS if the reply already contains media or a
MEDIA:directive. - skips very short replies (< 10 chars).
- summarizes long replies when enabled using
agents.defaults.model.primary(orsummaryModel). - attaches the generated audio to the reply.
If the reply exceeds maxLength and summary is off (or no API key for the
summary model), audio
is skipped and the normal text reply is sent.
Flow diagram
Reply -> TTS enabled?
no -> send text
yes -> has media / MEDIA: / short?
yes -> send text
no -> length > limit?
no -> TTS -> attach audio
yes -> summary enabled?
no -> send text
yes -> summarize (summaryModel or agents.defaults.model.primary)
-> TTS -> attach audio
Slash command usage
There is a single command: /tts.
See Slash commands for enablement details.
Discord note: /tts is a built-in Discord command, so Clawdbot registers
/voice as the native command there. Text /tts ... still works.
/tts on
/tts off
/tts status
/tts provider openai
/tts limit 2000
/tts summary off
/tts audio Hello from Clawdbot
Notes:
- Commands require an authorized sender (allowlist/owner rules still apply).
commands.textor native command registration must be enabled.limitandsummaryare stored in local prefs, not the main config./tts audiogenerates a one-off audio reply (does not toggle TTS on).
Agent tool
The tts tool converts text to speech and returns a MEDIA: path. When the
result is Telegram-compatible, the tool includes [[audio_as_voice]] so
Telegram sends a voice bubble.
Gateway RPC
Gateway methods:
tts.statustts.enabletts.disabletts.converttts.setProvidertts.providers