Gateway: finalize WS control plane

This commit is contained in:
Peter Steinberger
2025-12-09 14:41:41 +01:00
parent 9ef1545d06
commit b2e7fb01a9
23 changed files with 5209 additions and 2495 deletions

82
docs/architecture.md Normal file
View File

@@ -0,0 +1,82 @@
# Gateway Architecture (target state)
Last updated: 2025-12-09
## Overview
- A single long-lived **Gateway** process owns all messaging surfaces (WhatsApp via Baileys, Telegram when enabled) and the control/event plane.
- All clients (macOS app, CLI, web UI, automations) connect to the Gateway over one transport: **WebSocket on 127.0.0.1:18789** (tunnel or VPN for remote).
- One Gateway per host; it is the only place that is allowed to open a WhatsApp session. All sends/agent runs go through it.
## Components and flows
- **Gateway (daemon)**
- Maintains Baileys/Telegram connections.
- Exposes a typed WS API (req/resp + server push events).
- Validates every inbound frame against JSON Schema; rejects anything before a mandatory `hello`.
- **Clients (mac app / CLI / web admin)**
- One WS connection per client.
- Send requests (`health`, `status`, `send`, `agent`, `system-presence`, toggles) and subscribe to events (`tick`, `agent`, `presence`, `shutdown`).
- **Agent runner (Tau/Pi process)**
- Spawned by the Gateway on demand for `agent` calls; streams events back over the same WS connection.
- **WebChat**
- Serves static assets locally.
- Holds a single WS connection to the Gateway for control/data; all sends/agent runs go through the Gateway WS.
- Remote use goes through the same SSH/Tailscale tunnel as other clients.
## Connection lifecycle (single client)
```
Client Gateway
| |
|------- hello ----------->|
|<------ hello-ok ---------| (or hello-error + close)
| (hello-ok carries snapshot: presence + health)
| |
|<------ event:presence ---| (deltas)
|<------ event:tick -------| (keepalive/no-op)
| |
|------- req:agent ------->|
|<------ res:agent --------| (ack: {runId,status:"accepted"})
|<------ event:agent ------| (streaming)
|<------ res:agent --------| (final: {runId,status,summary})
| |
```
## Wire protocol (summary)
- Transport: WebSocket, text frames with JSON payloads.
- First frame must be `hello {type:"hello", minProtocol, maxProtocol, client:{name,version,platform,mode,instanceId}, caps, auth?, locale?, userAgent? }`.
- Server replies `hello-ok {type:"hello-ok", protocol:<chosen>, server:{version,commit,host,connId}, features:{methods,events}, snapshot:{presence:[...], health:{...}, stateVersion:{presence,health}, uptimeMs}, policy:{maxPayload,maxBufferedBytes,tickIntervalMs} }`
or `hello-error {type:"hello-error", reason, expectedProtocol, minClient }` then closes.
- After handshake:
- Requests: `{type:"req", id, method, params}``{type:"res", id, ok, payload|error}`
- Events: `{type:"event", event:"agent"|"presence"|"tick"|"shutdown", payload, seq?, stateVersion?}`
- If `CLAWDIS_GATEWAY_TOKEN` (or `--token`) is set, `hello.auth.token` must match; otherwise the socket closes with policy violation.
- Presence payload is structured, not free text: `{host, ip, version, mode, lastInputSeconds?, ts, reason?, tags?[], instanceId? }`.
- Agent runs are acked `{runId,status:"accepted"}` then complete with a final res `{runId,status,summary}`; streamed output arrives as `event:"agent"`.
- Protocol versions are bumped on breaking changes; clients must match `minClient`; Gateway chooses within clients min/max.
- Idempotency keys are required for side-effecting methods (`send`, `agent`) to safely retry; server keeps a short-lived dedupe cache.
- Policy in `hello-ok` communicates payload/queue limits and tick interval.
## Type system and codegen
- Source of truth: TypeBox (or ArkType) definitions in `protocol/` on the server.
- Build step emits JSON Schema.
- Clients:
- TypeScript: uses the same TypeBox types directly.
- Swift: generated `Codable` models via quicktype from the JSON Schema.
- Validation: AJV on the server for every inbound frame; optional client-side validation for defensive programming.
## Invariants
- Exactly one Gateway controls a single Baileys session per host. No fallbacks to ad-hoc direct Baileys sends.
- Handshake is mandatory; any non-JSON or non-hello first frame is a hard close.
- All methods and events are versioned; new fields are additive; breaking changes increment `protocol`.
- No event replay: on seq gaps, clients must refresh (`health` + `system-presence`) and continue; presence is bounded via TTL/max entries.
## Remote access
- Preferred: Tailscale or VPN; alternate: SSH tunnel `ssh -N -L 18789:127.0.0.1:18789 user@host`.
- Same protocol over the tunnel; same handshake. If a shared token is configured, clients must send it in `hello.auth.token` even over the tunnel.
## Operations snapshot
- Start: `clawdis gateway` (foreground, logs to stdout).
Supervise with launchd/systemd for restarts.
- Health: request `health` over WS; also surfaced in `hello-ok.health`.
- Metrics/logging: keep outside this spec; gateway should expose Prometheus text or structured logs separately.
## Migration notes
- This architecture supersedes the legacy stdin RPC and the ad-hoc TCP control port. New clients should speak only the WS protocol. Legacy compatibility is intentionally dropped.

126
docs/gateway.md Normal file
View File

@@ -0,0 +1,126 @@
# Gateway (daemon) runbook
Last updated: 2025-12-09
## What it is
- The always-on process that owns the single Baileys/Telegram connection and the control/event plane.
- Replaces the legacy `relay` command. CLI entry point: `clawdis gateway`.
- Runs until stopped; exits non-zero on fatal errors so the supervisor restarts it.
## How to run (local)
```bash
clawdis gateway --port 18789
```
- Binds WebSocket control plane to `127.0.0.1:<port>` (default 18789).
- Logs to stdout; use launchd/systemd to keep it alive and rotate logs.
- Optional shared secret: pass `--token <value>` or set `CLAWDIS_GATEWAY_TOKEN` to require clients to send `hello.auth.token`.
## Remote access
- Tailscale/VPN preferred; otherwise SSH tunnel:
```bash
ssh -N -L 18789:127.0.0.1:18789 user@host
```
- Clients then connect to `ws://127.0.0.1:18789` through the tunnel.
- If a token is configured, clients must include it in `hello.auth.token` even over the tunnel.
## Protocol (operator view)
- Mandatory first frame from client: `hello {type:"hello", minProtocol, maxProtocol, client:{name,version,platform,mode,instanceId}, caps, auth?, locale?, userAgent? }`.
- Gateway replies `hello-ok {type:"hello-ok", protocol:<chosen>, server:{version,commit,host,connId}, features:{methods,events}, snapshot:{presence[], health, stateVersion, uptimeMs}, policy:{maxPayload,maxBufferedBytes,tickIntervalMs} }` or `hello-error`.
- After handshake:
- Requests: `{type:"req", id, method, params}` → `{type:"res", id, ok, payload|error}`
- Events: `{type:"event", event, payload, seq?, stateVersion?}`
- Structured presence entries: `{host, ip, version, mode, lastInputSeconds?, ts, reason?, tags?[], instanceId? }`.
- `agent` responses are two-stage: first `res` ack `{runId,status:"accepted"}`, then a final `res` `{runId,status:"ok"|"error",summary}` after the run finishes; streamed output arrives as `event:"agent"`.
## Methods (initial set)
- `health` — full health snapshot (same shape as `clawdis health --json`).
- `status` — short summary.
- `system-presence` — current presence list.
- `system-event` — post a presence/system note (structured).
- `send` — send a message via the active provider(s).
- `agent` — run an agent turn (streams events back on same connection).
## Events
- `agent` — streamed tool/output events from the agent run (seq-tagged).
- `presence` — presence updates (deltas with stateVersion) pushed to all connected clients.
- `tick` — periodic keepalive/no-op to confirm liveness.
- `shutdown` — Gateway is exiting; payload includes `reason` and optional `restartExpectedMs`. Clients should reconnect.
## WebChat integration
- WebChat serves static assets locally (default port 18788, configurable).
- The WebChat backend keeps a single WS connection to the Gateway for control/data; all sends and agent runs flow through that connection.
- Remote use goes through the same SSH/Tailscale tunnel; if a gateway token is configured, WebChat must include it during hello.
- macOS app also connects via this WS (one socket); it hydrates presence from the initial snapshot and listens for `presence` events to update the UI.
## Typing and validation
- Server validates every inbound frame with AJV against JSON Schema emitted from the protocol definitions.
- Clients (TS/Swift) consume generated types (TS directly; Swift via quicktype from the JSON Schema).
- Types live in `src/gateway/protocol/*.ts`; regenerate schemas/models with `pnpm protocol:gen` (writes `dist/protocol.schema.json` and `apps/macos/Sources/ClawdisProtocol/Protocol.swift`).
## Connection snapshot
- `hello-ok` includes a `snapshot` with `presence`, `health`, `stateVersion`, and `uptimeMs` plus `policy {maxPayload,maxBufferedBytes,tickIntervalMs}` so clients can render immediately without extra requests.
- `health`/`system-presence` remain available for manual refresh, but are not required at connect time.
## Error codes (res.error shape)
- Errors use `{ code, message, details?, retryable?, retryAfterMs? }`.
- Standard codes:
- `NOT_LINKED` — WhatsApp not authenticated.
- `AGENT_TIMEOUT` — agent did not respond within the configured deadline.
- `INVALID_REQUEST` — schema/param validation failed.
- `UNAVAILABLE` — Gateway is shutting down or a dependency is unavailable.
## Keepalive behavior
- `tick` events (or WS ping/pong) are emitted periodically so clients know the Gateway is alive even when no traffic occurs.
- Send/agent acknowledgements remain separate responses; do not overload ticks for sends.
## Replay / gaps
- Events are not replayed. Clients detect seq gaps and should refresh (`health` + `system-presence`) before continuing. WebChat and macOS clients now auto-refresh on gap.
## Supervision (macOS example)
- Use launchd to keep the daemon alive:
- Program: path to `clawdis`
- Arguments: `gateway`
- KeepAlive: true
- StandardOut/Err: file paths or `syslog`
- On failure, launchd restarts; fatal misconfig should keep exiting so the operator notices.
## Supervision (systemd example)
```
[Unit]
Description=Clawdis Gateway
After=network-online.target
Wants=network-online.target
[Service]
ExecStart=/usr/local/bin/clawdis gateway --port 18789
Restart=on-failure
RestartSec=5
User=clawdis
Environment=CLAWDIS_GATEWAY_TOKEN=
WorkingDirectory=/home/clawdis
[Install]
WantedBy=multi-user.target
```
Enable with `systemctl enable --now clawdis-gateway.service`.
## Operational checks
- Liveness: open WS and send `hello` → expect `hello-ok` (with snapshot).
- Readiness: call `health` → expect `ok: true` and `web.linked=true`.
- Debug: subscribe to `tick` and `presence` events; ensure `status` shows linked/auth age; presence entries show Gateway host and connected clients.
## Safety guarantees
- Only one Gateway per host; all sends/agent calls must go through it.
- No fallback to direct Baileys connections; if the Gateway is down, sends fail fast.
- Non-hello first frames or malformed JSON are rejected and the socket is closed.
- Graceful shutdown: emit `shutdown` event before closing; clients must handle close + reconnect.
## CLI helpers
- `clawdis gw:health` / `gw:status` — request health/status over the Gateway WS.
- `clawdis gw:send --to <num> --message "hi" [--media-url ...]` — send via Gateway (idempotent).
- `clawdis gw:agent --message "hi" [--to ...]` — run an agent turn (waits for final by default).
- `clawdis gw:call <method> --params '{"k":"v"}'` — raw method invoker for debugging.
## Migration guidance
- Retire uses of `clawdis relay` and the legacy TCP control port.
- Update clients to speak the WS protocol with mandatory hello and structured presence.

167
docs/refactor/new-arch.md Normal file
View File

@@ -0,0 +1,167 @@
# New Gateway Architecture Implementation Plan (detailed)
Last updated: 2025-12-09
Goal: replace legacy relay/stdin/TCP control with a single WebSocket Gateway, typed protocol, and first-frame snapshot. No backward compatibility.
---
## Phase 0 — Foundations
- **Naming**: CLI subcommand `clawdis gateway`; internal namespace `Gateway`.
- **Protocol folder**: create `protocol/` for schemas and build artifacts. ✅ `src/gateway/protocol`.
- **Schema tooling**:
- Prefer **TypeBox** (or ArkType) as source-of-truth types. ✅ TypeBox in `schema.ts`.
- `pnpm protocol:gen`:
1) emits JSON Schema (`dist/protocol.schema.json`),
2) runs quicktype → Swift `Codable` models (`apps/macos/Sources/ClawdisProtocol/Protocol.swift`). ✅
- AJV compile step for server validators. ✅
- **CI**: add a job that fails if schema or generated Swift is stale. ✅ `pnpm protocol:check` (runs gen + git diff).
## Phase 1 — Protocol specification
- Frames (WS text JSON, all with explicit `type`):
- `hello {type:"hello", minProtocol, maxProtocol, client:{name,version,platform,mode,instanceId}, caps, auth:{token?}, locale?, userAgent?}`
- `hello-ok {type:"hello-ok", protocol:<chosen>, server:{version,commit,host,connId}, features:{methods,events}, snapshot:{presence[], health, stateVersion:{presence,health}, uptimeMs}, policy:{maxPayload, maxBufferedBytes, tickIntervalMs}}`
- `hello-error {type:"hello-error", reason, expectedProtocol, minClient}`
- `req {type:"req", id, method, params?}`
- `res {type:"res", id, ok, payload?, error?}` where `error` = `{code,message,details?,retryable?,retryAfterMs?}`
- `event {type:"event", event, payload, seq?, stateVersion?}` (presence/tick/shutdown/agent)
- `close` (standard WS close codes; policy uses 1008 for slow consumer/unauthorized, 1012/1001 for restart)
- Payload types:
- `PresenceEntry {host, ip, version, mode, lastInputSeconds?, ts, reason?, tags?[], instanceId?}`
- `HealthSnapshot` (match existing `clawdis health --json` fields)
- `AgentEvent` (streamed tool/output; `{runId, seq, stream, data, ts}`)
- `TickEvent {ts}`
- `ShutdownEvent {reason, restartExpectedMs?}`
- Error codes: `NOT_LINKED`, `AGENT_TIMEOUT`, `INVALID_REQUEST`, `UNAVAILABLE`.
- Error shape: `{code, message, details?, retryable?, retryAfterMs?}`
- Rules:
- First frame must be `type:"hello"`; otherwise close. Add handshake timeout (e.g., 3s) for silent clients.
- Negotiate protocol: server picks within `[minProtocol,maxProtocol]`; if none, send `hello-error`.
- Protocol version bump on breaking changes; `hello-ok` must include `minClient` when needed.
- `stateVersion` increments for presence/health to drop stale deltas.
- Stable IDs: client sends `instanceId`; server issues per-connection `connId` in `hello-ok`; presence entries may include `instanceId` to dedupe reconnects.
- Token-based auth: bearer token in `auth.token`; required except for loopback development.
- Presence is primarily connection-derived; client may add hints (e.g., lastInputSeconds); entries expire via TTL to keep the map bounded (e.g., 5m TTL, max 200 entries).
- Idempotency keys: required for `send` and `agent` to safely retry after disconnects.
- Size limits: bound first-frame size by `maxPayload`; reject early if exceeded.
- Close on any non-JSON or wrong `type` before hello.
- Per-op idempotency keys: client SHOULD supply an explicit key per `send`/`agent`; if omitted, server may derive a scoped key from `instanceId+connId`, but explicit keys are safer across reconnects.
- Locale/userAgent are informational; server may log them for analytics but must not rely on them for access control.
## Phase 2 — Gateway WebSocket server
- New module `src/gateway/server.ts`:
- Bind 127.0.0.1:18789 (configurable).
- On connect: validate `hello`, send `hello-ok` with snapshot, start event pump.
- Per-connection queues with backpressure (bounded; drop oldest non-critical).
- WS-level caps: set `maxPayload` to cap frame size before JSON parse.
- Emit `tick` every N seconds when idle (or WS ping/pong if adequate).
- Emit `shutdown` before exit; then close sockets.
- Methods implemented:
- `health`, `status`, `system-presence`, `system-event`, `send`, `agent`.
- Optional: `set-heartbeats` removed/renamed if heartbeat concept is retired.
- Events implemented:
- `agent`, `presence` (deltas, with `stateVersion`), `tick`, `shutdown`.
- All events include `seq` for loss/out-of-order detection.
- Logging: structured logs on connect/close/error; include client fingerprint.
- Slow consumer policy:
- Per-connection outbound queue limit (bytes/messages). If exceeded, drop non-critical events (presence/tick) or close with a policy violation / retryable code; clients reconnect with backoff.
- Handshake edge cases:
- Close on handshake timeout.
- Close on over-limit first frame (maxPayload).
- Close immediately on non-JSON or wrong `type` before hello.
- Default guardrails: `maxPayload` ~512KB, handshake timeout ~3s, outbound buffered amount cap ~1.5MB (tune as you implement).
- Dedupe cache: bound TTL (~5m) and max size (~1000 entries); evict oldest first (LRU) to prevent memory growth.
## Phase 3 — Gateway CLI entrypoint
- Add `clawdis gateway` command in CLI program:
- Reads config (port, WS options).
- Foreground process; exit non-zero on fatal errors.
- Flags: `--port`, `--no-tick` (optional), `--log-json` (optional).
- System supervision docs for launchd/systemd (see `gateway.md`).
## Phase 4 — Presence/health snapshot & stateVersion
- `hello-ok.snapshot` includes:
- `presence[]` (current list)
- `health` (full snapshot)
- `stateVersion {presence:int, health:int}`
- `uptimeMs`
- `policy {maxPayload, maxBufferedBytes, tickIntervalMs}`
- Emit `presence` deltas with updated `stateVersion.presence`.
- Emit `tick` to indicate liveness when no other events occur.
- Keep `health` method for manual refresh; not required after connect.
- Presence expiry: prune entries older than TTL; enforce a max map size; include `stateVersion` in presence events.
## Phase 5 — Clients migration
- **macOS app**:
- Replace stdio/SSH RPC with WS client (tunneled via SSH/Tailscale for remote). ✅ AgentRPC/ControlChannel now use Gateway WS.
- Implement handshake, snapshot hydration, subscriptions to `presence`, `tick`, `agent`, `shutdown`. ✅ snapshot + presence events broadcast to InstancesStore; agent events still to wire to UI if desired.
- Remove immediate `health/system-presence` fetch on connect. ✅ presence hydrated from snapshot; periodic refresh kept as fallback.
- Handle `hello-error` and retry with backoff if version/token mismatched. ✅ macOS GatewayChannel reconnects with exponential backoff.
- **CLI**:
- Add lightweight WS client helper for `status/health/send/agent` when Gateway is up. ✅ `gw:*` commands use the Gateway over WS.
- Consider a “local only” flag to avoid accidental remote connects. (optional; not needed with tunnel-first model.)
- **WebChat backend**:
- Single WS to Gateway; seed UI from snapshot; forward `presence/tick/agent` to browser. ✅ implemented via `GatewayClient` in `webchat/server.ts`.
- Fail fast if handshake fails; no fallback transports. ✅ (webchat returns gateway unavailable)
## Phase 6 — Send/agent path hardening
- Ensure only the Gateway can open Baileys; no IPC fallback.
- `send` executes in-process; respond with explicit result/error, not via heartbeat.
- `agent` spawns Tau/Pi; respond quickly with `{runId,status:"accepted"}` (ack); stream `event:agent {runId, seq, stream, data, ts}`; final `res:agent {runId, status:"ok"|"error", summary}` completes request (idempotent via key).
- Idempotency: side-effecting methods (`send`, `agent`) accept an idempotency key; keep a short-lived dedupe cache to avoid double-send on client retries. Client retry flow: on timeout/close, retry with same key; Gateway returns cached result when available; cache TTL ~5m and bounded.
- Agent stream ordering: enforce monotonic `seq` per runId; if gap detected by server, terminate stream with error; if detected by client, issue a retry with same idempotency key.
- Send response shape: `{messageId?, toJid?, error?}` and always include `runId` when available for traceability.
## Phase 7 — Keepalive and shutdown semantics
- Keepalive: `tick` events (or WS ping/pong) at fixed interval; clients treat missing ticks as disconnect and reconnect.
- Shutdown: send `event:shutdown {reason, restartExpectedMs?}` then close sockets; clients auto-reconnect.
- Restart semantics: close sockets with a standard retryable close code; on reconnect, `hello-ok` snapshot must be sufficient to rebuild UI without event replay.
- Use a standard close code (e.g., 1012 service restart or 1001 going away) for planned restart; 1008 policy violation for slow consumers.
- Include `policy` in `hello-ok` so clients know the tick interval and buffer limits to tune their expectations.
## Phase 8 — Cleanup and deprecation
- Retire `clawdis rpc` as default path; keep only if explicitly requested (documented as legacy).
- Remove reliance on `src/infra/control-channel.ts` for new clients; mark as legacy or delete after migration. ✅ file removed; mac app now uses Gateway WS.
- Update README, docs (`architecture.md`, `gateway.md`, `webchat.md`) to final shapes; remove `control-api.md` references if obsolete.
- Presence hygiene:
- Presence derived primarily from connection (server-fills host/ip/version/connId/instanceId); allow client hints (e.g., lastInputSeconds).
- Add TTL/expiry; prune to keep map bounded (e.g., 5m TTL, max 200 entries).
## Edge cases and ordering
- Event ordering: all events carry `seq`; clients detect gaps and should re-fetch snapshot (or targeted refresh) on gap.
- Partial handshakes: if client connects and never sends hello, server closes after handshake timeout.
- Garbage/oversize first frame: bounded by `maxPayload`; server closes immediately on parse failure.
- Duplicate delivery on reconnect: clients must send idempotency keys; Gateway dedupe cache prevents double-send/agent execution.
- Snapshot sufficiency: `hello-ok.snapshot` must contain enough to render UI after reconnect without event replay.
- Client reconnect guidance: exponential backoff with jitter; reuse same `instanceId` across reconnects to avoid duplicate presence; resend idempotency keys for in-flight sends/agents; on seq gap, issue `health`/`system-presence` refresh.
- Presence TTL/defaults: set a concrete TTL (e.g., 5 minutes) and prune periodically; cap the presence map size with LRU if needed.
- Replay policy: if seq gap detected, server does not replay; clients must pull fresh `health` + `system-presence` and continue.
## Phase 9 — Testing & validation
- Unit: frame validation, handshake failure, auth/token, stateVersion on presence events, agent stream fanout, send dedupe. ✅
- Integration: connect → snapshot → req/res → streaming agent → shutdown. ✅ Covered in gateway WS tests (hello/health/status/presence, agent ack+final, shutdown broadcast).
- Load: multiple concurrent WS clients; backpressure behavior under burst. ✅ Basic fanout test with 3 clients receiving presence broadcast; heavier soak still recommended.
- Mac app smoke: presence/health render from snapshot; reconnect on tick loss. (Manual: open Instances tab, verify snapshot after connect, induce seq gap by toggling wifi, ensure UI refreshes.)
- WebChat smoke: snapshot seed + event updates; tunnel scenario. ✅ Offline snapshot harness in `src/webchat/server.test.ts` (mock gateway) now passes; live tunnel still recommended for manual.
- Idempotency tests: retry send/agent with same key after forced disconnect; expect deduped result. ✅ send + agent dedupe + reconnect retry covered in gateway tests.
- Seq-gap handling: ✅ clients now detect seq gaps (GatewayClient + mac GatewayChannel) and refresh health/presence (webchat) or trigger UI refresh (mac). Load-test still optional.
## Phase 10 — Rollout
- Version bump; release notes: breaking change to control plane (WS only).
- Ship launchd/systemd templates for `clawdis gateway`.
- Recommend Tailscale/SSH tunnel for remote access; no additional auth layer assumed in this model.
---
- Quick checklist
- [x] Protocol types & schemas (TS + JSON Schema + Swift via quicktype)
- [x] AJV validators wired
- [x] WS server with hello → snapshot → events
- [x] Tick + shutdown events
- [x] stateVersion + presence deltas
- [x] Gateway CLI command
- [x] macOS app WS client (Gateway WS for control; presence events live; agent stream UI pending)
- [x] WebChat WS client
- [x] Remove legacy stdin/TCP paths from default flows (file removed; mac app/CLI on Gateway)
- [x] Tests (unit/integration/load) — unit + integration + basic fanout/reconnect; heavier load/soak optional
- [x] Docs updated and legacy docs flagged