feat: default TTS model overrides on (#1559) (thanks @Glucksberg)

Co-authored-by: Glucksberg <80581902+Glucksberg@users.noreply.github.com>
2026-01-24 09:40:18 +00:00
parent 4074fa0471
commit 6765fd15eb
11 changed files with 1187 additions and 290 deletions
--- a/docs/gateway/configuration.md
+++ b/docs/gateway/configuration.md
@@ -1459,13 +1459,28 @@ voice notes; other channels send MP3 audio.
      enabled: true,
      mode: "final", // final | all (include tool/block replies)
      provider: "elevenlabs",
+      summaryModel: "openai/gpt-4.1-mini",
+      modelOverrides: {
+        enabled: true
+      },
      maxTextLength: 4000,
      timeoutMs: 30000,
      prefsPath: "~/.clawdbot/settings/tts.json",
      elevenlabs: {
        apiKey: "elevenlabs_api_key",
+        baseUrl: "https://api.elevenlabs.io",
        voiceId: "voice_id",
-        modelId: "eleven_multilingual_v2"
+        modelId: "eleven_multilingual_v2",
+        seed: 42,
+        applyTextNormalization: "auto",
+        languageCode: "en",
+        voiceSettings: {
+          stability: 0.5,
+          similarityBoost: 0.75,
+          style: 0.0,
+          useSpeakerBoost: true,
+          speed: 1.0
+        }
      },
      openai: {
        apiKey: "openai_api_key",
@@ -1478,11 +1493,17 @@ voice notes; other channels send MP3 audio.
 ```

 Notes:
- `messages.tts.enabled` can be overridden by local user prefs (see `/tts_on`, `/tts_off`).
+- `messages.tts.enabled` can be overridden by local user prefs (see `/tts on`, `/tts off`).
 - `prefsPath` stores local overrides (enabled/provider/limit/summarize).
 - `maxTextLength` is a hard cap for TTS input; summaries are truncated to fit.
- `/tts_limit` and `/tts_summary` control per-user summarization settings.
+- `summaryModel` overrides `agents.defaults.model.primary` for auto-summary.
+  - Accepts `provider/model` or an alias from `agents.defaults.models`.
+- `modelOverrides` enables model-driven overrides like `[[tts:...]]` tags (on by default).
+- `/tts limit` and `/tts summary` control per-user summarization settings.
 - `apiKey` values fall back to `ELEVENLABS_API_KEY`/`XI_API_KEY` and `OPENAI_API_KEY`.
+- `elevenlabs.baseUrl` overrides the ElevenLabs API base URL.
+- `elevenlabs.voiceSettings` supports `stability`/`similarityBoost`/`style` (0..1),
+  `useSpeakerBoost`, and `speed` (0.5..2.0).

 ### `talk`

--- a/docs/tools/slash-commands.md
+++ b/docs/tools/slash-commands.md
@@ -67,13 +67,7 @@ Text + native (when enabled):
 - `/config show|get|set|unset` (persist config to disk, owner-only; requires `commands.config: true`)
 - `/debug show|set|unset|reset` (runtime overrides, owner-only; requires `commands.debug: true`)
 - `/usage off|tokens|full|cost` (per-response usage footer or local cost summary)
- `/tts_on` (enable TTS replies)
- `/tts_off` (disable TTS replies)
- `/tts_provider [openai|elevenlabs]` (set or show TTS provider)
- `/tts_limit <chars>` (max chars before TTS summarization)
- `/tts_summary on|off` (toggle TTS auto-summary)
- `/tts_status` (show TTS status)
- `/audio <text>` (convert text to a TTS audio reply)
+- `/tts on|off|status|provider|limit|summary|audio` (control TTS; see [/tts](/tts))
 - `/stop`
 - `/restart`
 - `/dock-telegram` (alias: `/dock_telegram`) (switch replies to Telegram)
--- a/docs/tts.md
+++ b/docs/tts.md
@@ -0,0 +1,293 @@
+---
+summary: "Text-to-speech (TTS) for outbound replies"
+read_when:
+  - Enabling text-to-speech for replies
+  - Configuring TTS providers or limits
+  - Using /tts commands
+---
+
+# Text-to-speech (TTS)
+
+Clawdbot can convert outbound replies into audio using ElevenLabs or OpenAI.
+It works anywhere Clawdbot can send audio; Telegram gets a round voice-note bubble.
+
+## Supported services
+
+- **ElevenLabs** (primary or fallback provider)
+- **OpenAI** (primary or fallback provider; also used for summaries)
+
+## Required keys
+
+At least one of:
+- `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
+- `OPENAI_API_KEY`
+
+If both are configured, the selected provider is used first and the other is a fallback.
+Auto-summary uses the configured `summaryModel` (or `agents.defaults.model.primary`),
+so that provider must also be authenticated if you enable summaries.
+
+## Service links
+
+- [OpenAI Text-to-Speech guide](https://platform.openai.com/docs/guides/text-to-speech)
+- [OpenAI Audio API reference](https://platform.openai.com/docs/api-reference/audio)
+- [ElevenLabs Text to Speech](https://elevenlabs.io/docs/api-reference/text-to-speech)
+- [ElevenLabs Authentication](https://elevenlabs.io/docs/api-reference/authentication)
+
+## Is it enabled by default?
+
+No. TTS is **disabled** by default. Enable it in config or with `/tts on`,
+which writes a local preference override.
+
+## Config
+
+TTS config lives under `messages.tts` in `clawdbot.json`.
+Full schema is in [Gateway configuration](/gateway/configuration).
+
+### Minimal config (enable + provider)
+
+```json5
+{
+  messages: {
+    tts: {
+      enabled: true,
+      provider: "elevenlabs"
+    }
+  }
+}
+```
+
+### OpenAI primary with ElevenLabs fallback
+
+```json5
+{
+  messages: {
+    tts: {
+      enabled: true,
+      provider: "openai",
+      summaryModel: "openai/gpt-4.1-mini",
+      modelOverrides: {
+        enabled: true
+      },
+      openai: {
+        apiKey: "openai_api_key",
+        model: "gpt-4o-mini-tts",
+        voice: "alloy"
+      },
+      elevenlabs: {
+        apiKey: "elevenlabs_api_key",
+        baseUrl: "https://api.elevenlabs.io",
+        voiceId: "voice_id",
+        modelId: "eleven_multilingual_v2",
+        seed: 42,
+        applyTextNormalization: "auto",
+        languageCode: "en",
+        voiceSettings: {
+          stability: 0.5,
+          similarityBoost: 0.75,
+          style: 0.0,
+          useSpeakerBoost: true,
+          speed: 1.0
+        }
+      }
+    }
+  }
+}
+```
+
+### Custom limits + prefs path
+
+```json5
+{
+  messages: {
+    tts: {
+      enabled: true,
+      maxTextLength: 4000,
+      timeoutMs: 30000,
+      prefsPath: "~/.clawdbot/settings/tts.json"
+    }
+  }
+}
+```
+
+### Disable auto-summary for long replies
+
+```json5
+{
+  messages: {
+    tts: {
+      enabled: true
+    }
+  }
+}
+```
+
+Then run:
+
+```
+/tts summary off
+```
+
+### Notes on fields
+
+- `enabled`: master toggle (default `false`; local prefs can override).
+- `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
+- `provider`: `"elevenlabs"` or `"openai"` (fallback is automatic).
+- `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`.
+  - Accepts `provider/model` or a configured model alias.
+- `modelOverrides`: allow the model to emit TTS directives (on by default).
+- `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded.
+- `timeoutMs`: request timeout (ms).
+- `prefsPath`: override the local prefs JSON path.
+- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `OPENAI_API_KEY`).
+- `elevenlabs.baseUrl`: override ElevenLabs API base URL.
+- `elevenlabs.voiceSettings`:
+  - `stability`, `similarityBoost`, `style`: `0..1`
+  - `useSpeakerBoost`: `true|false`
+  - `speed`: `0.5..2.0` (1.0 = normal)
+- `elevenlabs.applyTextNormalization`: `auto|on|off`
+- `elevenlabs.languageCode`: 2-letter ISO 639-1 (e.g. `en`, `de`)
+- `elevenlabs.seed`: integer `0..4294967295` (best-effort determinism)
+
+## Model-driven overrides (default on)
+
+By default, the model **can** emit TTS directives for a single reply.
+
+When enabled, the model can emit `[[tts:...]]` directives to override the voice
+for a single reply, plus an optional `[[tts:text]]...[[/tts:text]]` block to
+provide expressive tags (laughter, singing cues, etc) that should only appear in
+the audio.
+
+Example reply payload:
+
+```
+Here you go.
+
+[[tts:provider=elevenlabs voiceId=pMsXgVXv3BLzUgSXRplE model=eleven_v3 speed=1.1]]
+[[tts:text]](laughs) Read the song once more.[[/tts:text]]
+```
+
+Available directive keys (when enabled):
+- `provider` (`openai` | `elevenlabs`)
+- `voice` (OpenAI voice) or `voiceId` (ElevenLabs)
+- `model` (OpenAI TTS model or ElevenLabs model id)
+- `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
+- `applyTextNormalization` (`auto|on|off`)
+- `languageCode` (ISO 639-1)
+- `seed`
+
+Disable all model overrides:
+
+```json5
+{
+  messages: {
+    tts: {
+      modelOverrides: {
+        enabled: false
+      }
+    }
+  }
+}
+```
+
+Optional allowlist (disable specific overrides while keeping tags enabled):
+
+```json5
+{
+  messages: {
+    tts: {
+      modelOverrides: {
+        enabled: true,
+        allowProvider: false,
+        allowSeed: false
+      }
+    }
+  }
+}
+```
+
+## Per-user preferences
+
+Slash commands write local overrides to `prefsPath` (default:
+`~/.clawdbot/settings/tts.json`, override with `CLAWDBOT_TTS_PREFS` or
+`messages.tts.prefsPath`).
+
+Stored fields:
+- `enabled`
+- `provider`
+- `maxLength` (summary threshold; default 1500 chars)
+- `summarize` (default `true`)
+
+These override `messages.tts.*` for that host.
+
+## Output formats (fixed)
+
+- **Telegram**: Opus voice note (`opus_48000_64` from ElevenLabs, `opus` from OpenAI).
+  - 48kHz / 64kbps is a good voice-note tradeoff and required for the round bubble.
+- **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
+  - 44.1kHz / 128kbps is the default balance for speech clarity.
+
+This is not configurable; Telegram expects Opus for voice-note UX.
+
+## Auto-TTS behavior
+
+When enabled, Clawdbot:
+- skips TTS if the reply already contains media or a `MEDIA:` directive.
+- skips very short replies (< 10 chars).
+- summarizes long replies when enabled using `agents.defaults.model.primary` (or `summaryModel`).
+- attaches the generated audio to the reply.
+
+If the reply exceeds `maxLength` and summary is off (or no API key for the
+summary model), audio
+is skipped and the normal text reply is sent.
+
+## Flow diagram
+
+```
+Reply -> TTS enabled?
+  no  -> send text
+  yes -> has media / MEDIA: / short?
+          yes -> send text
+          no  -> length > limit?
+                   no  -> TTS -> attach audio
+                   yes -> summary enabled?
+                            no  -> send text
+                            yes -> summarize (summaryModel or agents.defaults.model.primary)
+                                      -> TTS -> attach audio
+```
+
+## Slash command usage
+
+There is a single command: `/tts`.
+See [Slash commands](/tools/slash-commands) for enablement details.
+
+```
+/tts on
+/tts off
+/tts status
+/tts provider openai
+/tts limit 2000
+/tts summary off
+/tts audio Hello from Clawdbot
+```
+
+Notes:
+- Commands require an authorized sender (allowlist/owner rules still apply).
+- `commands.text` or native command registration must be enabled.
+- `limit` and `summary` are stored in local prefs, not the main config.
+- `/tts audio` generates a one-off audio reply (does not toggle TTS on).
+
+## Agent tool
+
+The `tts` tool converts text to speech and returns a `MEDIA:` path. When the
+result is Telegram-compatible, the tool includes `[[audio_as_voice]]` so
+Telegram sends a voice bubble.
+
+## Gateway RPC
+
+Gateway methods:
+- `tts.status`
+- `tts.enable`
+- `tts.disable`
+- `tts.convert`
+- `tts.setProvider`
+- `tts.providers`