Merge remote-tracking branch 'origin/main' into upstream-preview-nix-2025-12-20

2026-01-01 09:15:28 +01:00
parent 0f7029583c 14e9077584
commit ad9a9d8d35
163 changed files with 10867 additions and 1712 deletions
--- a/docs/browser.md
+++ b/docs/browser.md
@@ -193,6 +193,7 @@ Notes:
 - The arm default timeout is **2 minutes** (clamped to max 2 minutes); pass `timeoutMs` if you need shorter.
 - `snapshot` defaults to `ai`; `aria` returns an accessibility tree for debugging.
 - `click`/`type` require `ref` from `snapshot --format ai`; use `evaluate` for rare CSS selector one-offs.
+- Avoid `wait` by default; use it only in exceptional cases when there is no reliable UI state to wait on.

 ## Security & privacy notes

--- a/docs/camera.md
+++ b/docs/camera.md
@@ -35,6 +35,7 @@ All camera access is gated behind **user-controlled settings**.
    - `format: "jpg"`
    - `base64: "<...>"`
    - `width`, `height`
+  - Payload guard: photos are recompressed to keep the base64 payload under 5 MB.

 - `camera.clip`
  - Params:
@@ -90,6 +91,10 @@ If permissions are missing, the app will prompt when possible; if denied, `camer

 Like `canvas.*`, the Android node only allows `camera.*` commands in the **foreground**. Background invocations return `NODE_BACKGROUND_UNAVAILABLE`.

+### Payload guard
+
+Photos are recompressed to keep the base64 payload under 5 MB.
+
 ## macOS app

 ### User setting (default off)
@@ -116,6 +121,7 @@ clawdis nodes camera clip --node <id> --no-audio

 Notes:
 - `clawdis nodes camera snap` defaults to `maxWidth=1600` unless overridden.
+- Photo payloads are recompressed to keep base64 under 5 MB.

 ## Safety + practical limits

--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -195,6 +195,28 @@ Controls inbound/outbound prefixes and timestamps.
 }
 ```

+### `talk`
+
+Defaults for Talk mode (macOS/iOS/Android). Voice IDs fall back to `ELEVENLABS_VOICE_ID` or `SAG_VOICE_ID` when unset.
+`apiKey` falls back to `ELEVENLABS_API_KEY` (or the gateway’s shell profile) when unset.
+`voiceAliases` lets Talk directives use friendly names (e.g. `"voice":"Clawd"`).
+
+```json5
+{
+  talk: {
+    voiceId: "elevenlabs_voice_id",
+    voiceAliases: {
+      Clawd: "EXAVITQu4vr4xnSDxMaL",
+      Roger: "CwhRBWXzGAHq8TQ4Fs17"
+    },
+    modelId: "eleven_v3",
+    outputFormat: "mp3_44100_128",
+    apiKey: "elevenlabs_api_key",
+    interruptOnSpeech: true
+  }
+}
+```
+
 ### `agent`

 Controls the embedded agent runtime (model/thinking/verbose/timeouts).
@@ -237,6 +259,8 @@ Controls the embedded agent runtime (model/thinking/verbose/timeouts).
 If `modelAliases` is configured, you may also use the alias key (e.g. `Opus`).
 If you omit the provider, CLAWDIS currently assumes `anthropic` as a temporary
 deprecation fallback.
+Z.AI models are available as `zai/<model>` (e.g. `zai/glm-4.7`) and require
+`ZAI_API_KEY` (or legacy `Z_AI_API_KEY`) in the environment.

 `agent.heartbeat` configures periodic heartbeat runs:
 - `every`: duration string (`ms`, `s`, `m`, `h`); default unit minutes. Omit or set
@@ -445,6 +469,20 @@ Defaults:
 }
 ```

+### `ui` (Appearance)
+
+Optional accent color used by the native apps for UI chrome (e.g. Talk Mode bubble tint).
+
+If unset, clients fall back to a muted light-blue.
+
+```json5
+{
+  ui: {
+    seamColor: "#FF4500" // hex (RRGGBB or #RRGGBB)
+  }
+}
+```
+
 ### `gateway` (Gateway server mode + bind)

 Use `gateway.mode` to explicitly declare whether this machine should run the Gateway.
--- a/docs/mac/logging.md
+++ b/docs/mac/logging.md
@@ -7,11 +7,12 @@ read_when:
 # Logging (macOS)

 ## Rolling diagnostics file log (Debug pane)
-Clawdis can write a local, rotating diagnostics log to disk (useful when macOS unified logging is impractical during iterative repros).
+Clawdis routes macOS app logs through swift-log (unified logging by default) and can write a local, rotating file log to disk when you need a durable capture.

- Enable: **Debug pane → Diagnostics log → “Write rolling diagnostics log (JSONL)”**
+- Verbosity: **Debug pane → Logs → App logging → Verbosity**
+- Enable: **Debug pane → Logs → App logging → “Write rolling diagnostics log (JSONL)”**
 - Location: `~/Library/Logs/Clawdis/diagnostics.jsonl` (rotates automatically; old files are suffixed with `.1`, `.2`, …)
- Clear: **Debug pane → Diagnostics log → “Clear”**
+- Clear: **Debug pane → Logs → App logging → “Clear”**

 Notes:
 - This is **off by default**. Enable only while actively debugging.
--- a/docs/mac/menu-bar.md
+++ b/docs/mac/menu-bar.md
@@ -8,6 +8,7 @@ read_when:
 ## What is shown
 - We surface the current agent work state in the menu bar icon and in the first status row of the menu.
 - Health status is hidden while work is active; it returns when all sessions are idle.
+- The “Nodes” block in the menu lists **devices** only (gateway bridge nodes via `node.list`), not client/presence entries.

 ## State model
 - Sessions: events arrive with `runId` (session key). The “main” session is the key `main`; if absent, we fall back to the most recently updated session.
--- a/docs/talk.md
+++ b/docs/talk.md
@@ -0,0 +1,79 @@
+---
+summary: "Talk mode: continuous speech conversations with ElevenLabs TTS"
+read_when:
+  - Implementing Talk mode on macOS/iOS/Android
+  - Changing voice/TTS/interrupt behavior
+---
+# Talk Mode
+
+Talk mode is a continuous voice conversation loop:
+1) Listen for speech
+2) Send transcript to the model (main session, chat.send)
+3) Wait for the response
+4) Speak it via ElevenLabs (streaming playback)
+
+## Behavior (macOS)
+- **Always-on overlay** while Talk mode is enabled.
+- **Listening → Thinking → Speaking** phase transitions.
+- On a **short pause** (silence window), the current transcript is sent.
+- Replies are **written to WebChat** (same as typing).
+- **Interrupt on speech** (default on): if the user starts talking while the assistant is speaking, we stop playback and note the interruption timestamp for the next prompt.
+
+## Voice directives in replies
+The assistant may prefix its reply with a **single JSON line** to control voice:
+
+```json
+{"voice":"<voice-id>","once":true}
+```
+
+Rules:
+- First non-empty line only.
+- Unknown keys are ignored.
+- `once: true` applies to the current reply only.
+- Without `once`, the voice becomes the new default for Talk mode.
+- The JSON line is stripped before TTS playback.
+
+Supported keys:
+- `voice` / `voice_id` / `voiceId`
+- `model` / `model_id` / `modelId`
+- `speed`, `rate` (WPM), `stability`, `similarity`, `style`, `speakerBoost`
+- `seed`, `normalize`, `lang`, `output_format`, `latency_tier`
+- `once`
+
+## Config (clawdis.json)
+```json5
+{
+  "talk": {
+    "voiceId": "elevenlabs_voice_id",
+    "modelId": "eleven_v3",
+    "outputFormat": "mp3_44100_128",
+    "apiKey": "elevenlabs_api_key",
+    "interruptOnSpeech": true
+  }
+}
+```
+
+Defaults:
+- `interruptOnSpeech`: true
+- `voiceId`: falls back to `ELEVENLABS_VOICE_ID` / `SAG_VOICE_ID` (or first ElevenLabs voice when API key is available)
+- `modelId`: defaults to `eleven_v3` when unset
+- `apiKey`: falls back to `ELEVENLABS_API_KEY` (or gateway shell profile if available)
+- `outputFormat`: defaults to `pcm_44100` on macOS/iOS and `pcm_24000` on Android (set `mp3_*` to force MP3 streaming)
+
+## macOS UI
+- Menu bar toggle: **Talk**
+- Config tab: **Talk Mode** group (voice id + interrupt toggle)
+- Overlay:
+  - **Listening**: cloud pulses with mic level
+  - **Thinking**: sinking animation
+  - **Speaking**: radiating rings
+  - Click cloud: stop speaking
+  - Click X: exit Talk mode
+
+## Notes
+- Requires Speech + Microphone permissions.
+- Uses `chat.send` against session key `main`.
+- TTS uses ElevenLabs streaming API with `ELEVENLABS_API_KEY` and incremental playback on macOS/iOS/Android for lower latency.
+- `stability` for `eleven_v3` is validated to `0.0`, `0.5`, or `1.0`; other models accept `0..1`.
+- `latency_tier` is validated to `0..4` when set.
+- Android supports `pcm_16000`, `pcm_22050`, `pcm_24000`, and `pcm_44100` output formats for low-latency AudioTrack streaming.
--- a/docs/test.md
+++ b/docs/test.md
@@ -7,3 +7,16 @@ read_when:

 - `pnpm test:force`: Kills any lingering gateway process holding the default control port, then runs the full Vitest suite with an isolated gateway port so server tests don’t collide with a running instance. Use this when a prior gateway run left port 18789 occupied.
 - `pnpm test:coverage`: Runs Vitest with V8 coverage. Global thresholds are 70% lines/branches/functions/statements. Coverage excludes integration-heavy entrypoints (CLI wiring, gateway/telegram bridges, webchat static server) to keep the target focused on unit-testable logic.
+
+## Model latency bench (local keys)
+
+Script: `scripts/bench-model.ts`
+
+Usage:
+- `source ~/.profile && pnpm tsx scripts/bench-model.ts --runs 10`
+- Optional env: `MINIMAX_API_KEY`, `MINIMAX_BASE_URL`, `MINIMAX_MODEL`, `ANTHROPIC_API_KEY`
+- Default prompt: “Reply with a single word: ok. No punctuation or extra text.”
+
+Last run (2025-12-31, 20 runs):
+- minimax median 1279ms (min 1114, max 2431)
+- opus median 2454ms (min 1224, max 3170)
--- a/docs/tools.md
+++ b/docs/tools.md
@@ -51,6 +51,7 @@ Notes:
 - Uses `browser.controlUrl` unless `controlUrl` is passed explicitly.
 - `snapshot` defaults to `ai`; use `aria` for the accessibility tree.
 - `act` requires `ref` from `snapshot --format ai`; use `evaluate` for rare CSS selector needs.
+- Avoid `act` → `wait` by default; use it only in exceptional cases (no reliable UI state to wait on).

 ### `clawdis_canvas`
 Drive the node Canvas (present, eval, snapshot, A2UI).