let5see/clawdbot

Fork 0

Files

Peter Steinberger fc0e303e05 feat: add edge tts fallback provider

2026-01-25 01:05:43 +00:00

11 KiB

Raw Blame History

summary, read_when

summary

read_when

Text-to-speech (TTS) for outbound replies

Enabling text-to-speech for replies

Configuring TTS providers or limits

Using /tts commands

Text-to-speech (TTS)

Clawdbot can convert outbound replies into audio using ElevenLabs, OpenAI, or Edge TTS. It works anywhere Clawdbot can send audio; Telegram gets a round voice-note bubble.

Supported services

ElevenLabs (primary or fallback provider)
OpenAI (primary or fallback provider; also used for summaries)
Edge TTS (primary or fallback provider; uses node-edge-tts, default when no API keys)

Edge TTS notes

Edge TTS uses Microsoft Edge's online neural TTS service via the node-edge-tts library. It's a hosted service (not local), uses Microsoft’s endpoints, and does not require an API key. node-edge-tts exposes speech configuration options and output formats, but not all options are supported by the Edge service. citeturn2search0

Because Edge TTS is a public web service without a published SLA or quota, treat it as best-effort. If you need guaranteed limits and support, use OpenAI or ElevenLabs. Microsoft's Speech REST API documents a 10‑minute audio limit per request; Edge TTS does not publish limits, so assume similar or lower limits. citeturn0search3

Optional keys

If you want OpenAI or ElevenLabs:

ELEVENLABS_API_KEY (or XI_API_KEY)
OPENAI_API_KEY

Edge TTS does not require an API key. If no API keys are found, Clawdbot defaults to Edge TTS (unless disabled via messages.tts.edge.enabled=false).

If multiple providers are configured, the selected provider is used first and the others are fallback options. Auto-summary uses the configured summaryModel (or agents.defaults.model.primary), so that provider must also be authenticated if you enable summaries.

Service links

Is it enabled by default?

No. TTS is disabled by default. Enable it in config or with /tts on, which writes a local preference override.

Edge TTS is enabled by default once TTS is on, and is used automatically when no OpenAI or ElevenLabs API keys are available.

Config

TTS config lives under messages.tts in clawdbot.json. Full schema is in Gateway configuration.

Minimal config (enable + provider)

{
  messages: {
    tts: {
      enabled: true,
      provider: "elevenlabs"
    }
  }
}

OpenAI primary with ElevenLabs fallback

{
  messages: {
    tts: {
      enabled: true,
      provider: "openai",
      summaryModel: "openai/gpt-4.1-mini",
      modelOverrides: {
        enabled: true
      },
      openai: {
        apiKey: "openai_api_key",
        model: "gpt-4o-mini-tts",
        voice: "alloy"
      },
      elevenlabs: {
        apiKey: "elevenlabs_api_key",
        baseUrl: "https://api.elevenlabs.io",
        voiceId: "voice_id",
        modelId: "eleven_multilingual_v2",
        seed: 42,
        applyTextNormalization: "auto",
        languageCode: "en",
        voiceSettings: {
          stability: 0.5,
          similarityBoost: 0.75,
          style: 0.0,
          useSpeakerBoost: true,
          speed: 1.0
        }
      }
    }
  }
}

Edge TTS primary (no API key)

{
  messages: {
    tts: {
      enabled: true,
      provider: "edge",
      edge: {
        enabled: true,
        voice: "en-US-MichelleNeural",
        lang: "en-US",
        outputFormat: "audio-24khz-48kbitrate-mono-mp3",
        rate: "+10%",
        pitch: "-5%"
      }
    }
  }
}

Disable Edge TTS

{
  messages: {
    tts: {
      edge: {
        enabled: false
      }
    }
  }
}

Custom limits + prefs path

{
  messages: {
    tts: {
      enabled: true,
      maxTextLength: 4000,
      timeoutMs: 30000,
      prefsPath: "~/.clawdbot/settings/tts.json"
    }
  }
}

Disable auto-summary for long replies

{
  messages: {
    tts: {
      enabled: true
    }
  }
}

Then run:

/tts summary off

Notes on fields

enabled: master toggle (default false; local prefs can override).
mode: "final" (default) or "all" (includes tool/block replies).
provider: "elevenlabs", "openai", or "edge" (fallback is automatic).
If provider is unset, Clawdbot prefers openai (if key), then elevenlabs (if key), otherwise edge.
summaryModel: optional cheap model for auto-summary; defaults to agents.defaults.model.primary.
- Accepts provider/model or a configured model alias.
modelOverrides: allow the model to emit TTS directives (on by default).
maxTextLength: hard cap for TTS input (chars). /tts audio fails if exceeded.
timeoutMs: request timeout (ms).
prefsPath: override the local prefs JSON path.
apiKey values fall back to env vars (ELEVENLABS_API_KEY/XI_API_KEY, OPENAI_API_KEY).
elevenlabs.baseUrl: override ElevenLabs API base URL.
elevenlabs.voiceSettings:
- stability, similarityBoost, style: 0..1
- useSpeakerBoost: true|false
- speed: 0.5..2.0 (1.0 = normal)
elevenlabs.applyTextNormalization: auto|on|off
elevenlabs.languageCode: 2-letter ISO 639-1 (e.g. en, de)
elevenlabs.seed: integer 0..4294967295 (best-effort determinism)
edge.enabled: allow Edge TTS usage (default true; no API key).
edge.voice: Edge neural voice name (e.g. en-US-MichelleNeural).
edge.lang: language code (e.g. en-US).
edge.outputFormat: Edge output format (e.g. audio-24khz-48kbitrate-mono-mp3).
- See Microsoft Speech output formats for valid values; not all formats are supported by Edge.
edge.rate / edge.pitch / edge.volume: percent strings (e.g. +10%, -5%).
edge.saveSubtitles: write JSON subtitles alongside the audio file.
edge.proxy: proxy URL for Edge TTS requests.
edge.timeoutMs: request timeout override (ms).

Model-driven overrides (default on)

By default, the model can emit TTS directives for a single reply.

When enabled, the model can emit [[tts:...]] directives to override the voice for a single reply, plus an optional [[tts:text]]...[[/tts:text]] block to provide expressive tags (laughter, singing cues, etc) that should only appear in the audio.

Example reply payload:

Here you go.

[[tts:provider=elevenlabs voiceId=pMsXgVXv3BLzUgSXRplE model=eleven_v3 speed=1.1]]
[[tts:text]](laughs) Read the song once more.[[/tts:text]]

Available directive keys (when enabled):

provider (openai | elevenlabs | edge)
voice (OpenAI voice) or voiceId (ElevenLabs)
model (OpenAI TTS model or ElevenLabs model id)
stability, similarityBoost, style, speed, useSpeakerBoost
applyTextNormalization (auto|on|off)
languageCode (ISO 639-1)
seed

Disable all model overrides:

{
  messages: {
    tts: {
      modelOverrides: {
        enabled: false
      }
    }
  }
}

Optional allowlist (disable specific overrides while keeping tags enabled):

{
  messages: {
    tts: {
      modelOverrides: {
        enabled: true,
        allowProvider: false,
        allowSeed: false
      }
    }
  }
}

Per-user preferences

Slash commands write local overrides to prefsPath (default: ~/.clawdbot/settings/tts.json, override with CLAWDBOT_TTS_PREFS or messages.tts.prefsPath).

Stored fields:

enabled
provider
maxLength (summary threshold; default 1500 chars)
summarize (default true)

These override messages.tts.* for that host.

Output formats (fixed)

Telegram: Opus voice note (opus_48000_64 from ElevenLabs, opus from OpenAI).
- 48kHz / 64kbps is a good voice-note tradeoff and required for the round bubble.
Other channels: MP3 (mp3_44100_128 from ElevenLabs, mp3 from OpenAI).
- 44.1kHz / 128kbps is the default balance for speech clarity.
Edge TTS: uses edge.outputFormat (default audio-24khz-48kbitrate-mono-mp3).
- node-edge-tts accepts an outputFormat, but not all formats are available from the Edge service. citeturn2search0
- Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus). citeturn1search0
- Telegram sendVoice accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need guaranteed Opus voice notes. citeturn1search1
- If the configured Edge output format fails, Clawdbot retries with MP3.

OpenAI/ElevenLabs formats are fixed; Telegram expects Opus for voice-note UX.

Auto-TTS behavior

When enabled, Clawdbot:

skips TTS if the reply already contains media or a MEDIA: directive.
skips very short replies (< 10 chars).
summarizes long replies when enabled using agents.defaults.model.primary (or summaryModel).
attaches the generated audio to the reply.

If the reply exceeds maxLength and summary is off (or no API key for the summary model), audio is skipped and the normal text reply is sent.

Flow diagram

Reply -> TTS enabled?
  no  -> send text
  yes -> has media / MEDIA: / short?
          yes -> send text
          no  -> length > limit?
                   no  -> TTS -> attach audio
                   yes -> summary enabled?
                            no  -> send text
                            yes -> summarize (summaryModel or agents.defaults.model.primary)
                                      -> TTS -> attach audio

Slash command usage

There is a single command: /tts. See Slash commands for enablement details.

Discord note: /tts is a built-in Discord command, so Clawdbot registers /voice as the native command there. Text /tts ... still works.

/tts on
/tts off
/tts status
/tts provider openai
/tts limit 2000
/tts summary off
/tts audio Hello from Clawdbot

Notes:

Commands require an authorized sender (allowlist/owner rules still apply).
commands.text or native command registration must be enabled.
limit and summary are stored in local prefs, not the main config.
/tts audio generates a one-off audio reply (does not toggle TTS on).

Agent tool

The tts tool converts text to speech and returns a MEDIA: path. When the result is Telegram-compatible, the tool includes [[audio_as_voice]] so Telegram sends a voice bubble.

Gateway RPC

Gateway methods:

tts.status
tts.enable
tts.disable
tts.convert
tts.setProvider
tts.providers

11 KiB Raw Blame History Unescape Escape