feat: add edge tts fallback provider

2026-01-25 01:05:23 +00:00
parent 6a7a1d7085
commit fc0e303e05
11 changed files with 466 additions and 32 deletions
--- a/docs/channels/googlechat.md
+++ b/docs/channels/googlechat.md
@@ -32,7 +32,7 @@ Status: ready for DMs + spaces via Google Chat API webhooks (HTTP only).
   - Under **Connection settings**, select **HTTP endpoint URL**.
   - Under **Triggers**, select **Use a common HTTP endpoint URL for all triggers** and set it to your gateway's public URL followed by `/googlechat`.
     - *Tip: Run `clawdbot status` to find your gateway's public URL.*
-   - Under **Visibility**, check **Make this Chat app available to specific people and groups in <Your Domain>**.
+   - Under **Visibility**, check **Make this Chat app available to specific people and groups in &lt;Your Domain&gt;**.
   - Enter your email address (e.g. `user@example.com`) in the text box.
   - Click **Save** at the bottom.
 6) **Enable the app status**:
--- a/docs/tts.md
+++ b/docs/tts.md
@@ -8,21 +8,37 @@ read_when:

 # Text-to-speech (TTS)

-Clawdbot can convert outbound replies into audio using ElevenLabs or OpenAI.
+Clawdbot can convert outbound replies into audio using ElevenLabs, OpenAI, or Edge TTS.
 It works anywhere Clawdbot can send audio; Telegram gets a round voice-note bubble.

 ## Supported services

 - **ElevenLabs** (primary or fallback provider)
 - **OpenAI** (primary or fallback provider; also used for summaries)
+- **Edge TTS** (primary or fallback provider; uses `node-edge-tts`, default when no API keys)

-## Required keys
+### Edge TTS notes

-At least one of:
+Edge TTS uses Microsoft Edge's online neural TTS service via the `node-edge-tts`
+library. It's a hosted service (not local), uses Microsoft’s endpoints, and does
+not require an API key. `node-edge-tts` exposes speech configuration options and
+output formats, but not all options are supported by the Edge service. citeturn2search0
+
+Because Edge TTS is a public web service without a published SLA or quota, treat it
+as best-effort. If you need guaranteed limits and support, use OpenAI or ElevenLabs.
+Microsoft's Speech REST API documents a 10‑minute audio limit per request; Edge TTS
+does not publish limits, so assume similar or lower limits. citeturn0search3
+
+## Optional keys
+
+If you want OpenAI or ElevenLabs:
 - `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
 - `OPENAI_API_KEY`

-If both are configured, the selected provider is used first and the other is a fallback.
+Edge TTS does **not** require an API key. If no API keys are found, Clawdbot defaults
+to Edge TTS (unless disabled via `messages.tts.edge.enabled=false`).
+
+If multiple providers are configured, the selected provider is used first and the others are fallback options.
 Auto-summary uses the configured `summaryModel` (or `agents.defaults.model.primary`),
 so that provider must also be authenticated if you enable summaries.

@@ -32,12 +48,17 @@ so that provider must also be authenticated if you enable summaries.
 - [OpenAI Audio API reference](https://platform.openai.com/docs/api-reference/audio)
 - [ElevenLabs Text to Speech](https://elevenlabs.io/docs/api-reference/text-to-speech)
 - [ElevenLabs Authentication](https://elevenlabs.io/docs/api-reference/authentication)
+- [node-edge-tts](https://github.com/SchneeHertz/node-edge-tts)
+- [Microsoft Speech output formats](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech#audio-outputs)

 ## Is it enabled by default?

 No. TTS is **disabled** by default. Enable it in config or with `/tts on`,
 which writes a local preference override.

+Edge TTS **is** enabled by default once TTS is on, and is used automatically
+when no OpenAI or ElevenLabs API keys are available.
+
 ## Config

 TTS config lives under `messages.tts` in `clawdbot.json`.
@@ -94,6 +115,41 @@ Full schema is in [Gateway configuration](/gateway/configuration).
 }
 ```

+### Edge TTS primary (no API key)
+
+```json5
+{
+  messages: {
+    tts: {
+      enabled: true,
+      provider: "edge",
+      edge: {
+        enabled: true,
+        voice: "en-US-MichelleNeural",
+        lang: "en-US",
+        outputFormat: "audio-24khz-48kbitrate-mono-mp3",
+        rate: "+10%",
+        pitch: "-5%"
+      }
+    }
+  }
+}
+```
+
+### Disable Edge TTS
+
+```json5
+{
+  messages: {
+    tts: {
+      edge: {
+        enabled: false
+      }
+    }
+  }
+}
+```
+
 ### Custom limits + prefs path

 ```json5
@@ -131,7 +187,9 @@ Then run:

 - `enabled`: master toggle (default `false`; local prefs can override).
 - `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
- `provider`: `"elevenlabs"` or `"openai"` (fallback is automatic).
+- `provider`: `"elevenlabs"`, `"openai"`, or `"edge"` (fallback is automatic).
+- If `provider` is **unset**, Clawdbot prefers `openai` (if key), then `elevenlabs` (if key),
+  otherwise `edge`.
 - `summaryModel`: optional cheap model for auto-summary; defaults to `agents.defaults.model.primary`.
  - Accepts `provider/model` or a configured model alias.
 - `modelOverrides`: allow the model to emit TTS directives (on by default).
@@ -147,6 +205,15 @@ Then run:
 - `elevenlabs.applyTextNormalization`: `auto|on|off`
 - `elevenlabs.languageCode`: 2-letter ISO 639-1 (e.g. `en`, `de`)
 - `elevenlabs.seed`: integer `0..4294967295` (best-effort determinism)
+- `edge.enabled`: allow Edge TTS usage (default `true`; no API key).
+- `edge.voice`: Edge neural voice name (e.g. `en-US-MichelleNeural`).
+- `edge.lang`: language code (e.g. `en-US`).
+- `edge.outputFormat`: Edge output format (e.g. `audio-24khz-48kbitrate-mono-mp3`).
+  - See Microsoft Speech output formats for valid values; not all formats are supported by Edge.
+- `edge.rate` / `edge.pitch` / `edge.volume`: percent strings (e.g. `+10%`, `-5%`).
+- `edge.saveSubtitles`: write JSON subtitles alongside the audio file.
+- `edge.proxy`: proxy URL for Edge TTS requests.
+- `edge.timeoutMs`: request timeout override (ms).

 ## Model-driven overrides (default on)

@@ -167,7 +234,7 @@ Here you go.
 ```

 Available directive keys (when enabled):
- `provider` (`openai` | `elevenlabs`)
+- `provider` (`openai` | `elevenlabs` | `edge`)
 - `voice` (OpenAI voice) or `voiceId` (ElevenLabs)
 - `model` (OpenAI TTS model or ElevenLabs model id)
 - `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
@@ -225,8 +292,15 @@ These override `messages.tts.*` for that host.
  - 48kHz / 64kbps is a good voice-note tradeoff and required for the round bubble.
 - **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
  - 44.1kHz / 128kbps is the default balance for speech clarity.
+- **Edge TTS**: uses `edge.outputFormat` (default `audio-24khz-48kbitrate-mono-mp3`).
+  - `node-edge-tts` accepts an `outputFormat`, but not all formats are available
+    from the Edge service. citeturn2search0
+  - Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus). citeturn1search0
+  - Telegram `sendVoice` accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need
+    guaranteed Opus voice notes. citeturn1search1
+  - If the configured Edge output format fails, Clawdbot retries with MP3.

-This is not configurable; Telegram expects Opus for voice-note UX.
+OpenAI/ElevenLabs formats are fixed; Telegram expects Opus for voice-note UX.

 ## Auto-TTS behavior