From 323200b5514d34cc45105224fcf73fb6efbc7740 Mon Sep 17 00:00:00 2001 From: Peter Steinberger Date: Sun, 11 Jan 2026 04:46:30 +0000 Subject: [PATCH] test(live): harden gateway probes --- CHANGELOG.md | 2 ++ docs/testing.md | 24 ++++++++++++++----- .../gateway-models.profiles.live.test.ts | 16 +++++++------ 3 files changed, 29 insertions(+), 13 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 81e55bf3c..8a464b7d9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -34,6 +34,7 @@ - Agents: enforce single-writer session locks and drop orphan tool results to prevent tool-call ID failures (MiniMax/Anthropic-compatible APIs). - Docs: make `clawdbot status` the first diagnostic step, clarify `status --deep` behavior, and document `/whoami` + `/id`. - Docs/Testing: clarify live tool+image probes and how to list your testable `provider/model` ids. +- Tests/Live: make gateway bash+read probes resilient to provider formatting while still validating real tool calls. - WhatsApp: detect @lid mentions in groups using authDir reverse mapping + resolve self JID E.164 for mention gating. (#692) — thanks @peschee. - Gateway/Auth: default to token auth on loopback during onboarding, add doctor token generation flow, and tighten audio transcription config to Whisper-only. - Providers: dedupe inbound messages across providers to avoid duplicate LLM runs on redeliveries/reconnects. (#689) — thanks @adam91holt. @@ -73,6 +74,7 @@ - Models/Auth: allow MiniMax API configs without `models.providers.minimax.apiKey` (auth profiles / `MINIMAX_API_KEY`). (#656) — thanks @mneves75. - Agents: avoid duplicate replies when the message tool sends. (#659) — thanks @mickahouan. - Agents: harden Cloud Code Assist tool ID sanitization (toolUse/toolCall/toolResult) and scrub extra JSON Schema constraints. (#665) — thanks @sebslight. +- Agents: sanitize tool results + Cloud Code Assist tool IDs at context-build time (prevents mid-run strict-provider request rejects). - Agents/Tools: resolve workspace-relative Read/Write/Edit paths; align bash default cwd. (#642) — thanks @mukhtharcm. - Discord: include forwarded message snapshots in agent session context. (#667) — thanks @rubyrunsstuff. - Telegram: add `telegram.draftChunk` to tune draft streaming chunking for `streamMode: "block"`. (#667) — thanks @rubyrunsstuff. diff --git a/docs/testing.md b/docs/testing.md index d49a690c5..7a883ab02 100644 --- a/docs/testing.md +++ b/docs/testing.md @@ -122,6 +122,11 @@ Live tests are split into two layers so we can isolate failures: - a real tool invocation works (read probe) - optional extra tool probes (bash+read probe) - OpenAI regression paths (tool-call-only → follow-up) keep working +- Probe details (so you can explain failures quickly): + - `read` probe: the test writes a nonce file in the workspace and asks the agent to `read` it and echo the nonce back. + - `bash+read` probe: the test asks the agent to `bash`-write a nonce into a temp file, then `read` it back. + - image probe: the test attaches a generated PNG (cat + randomized code) and expects the model to return `cat `. + - Implementation reference: `src/gateway/gateway-models.profiles.live.test.ts` and `src/gateway/live-image-probe.ts`. - How to enable: - `CLAWDBOT_LIVE_TEST=1` or `LIVE=1` - `CLAWDBOT_LIVE_GATEWAY=1` (required for this test to run) @@ -211,16 +216,19 @@ Narrow, explicit allowlists are fastest and least flaky: - `LIVE=1 CLAWDBOT_LIVE_GATEWAY=1 CLAWDBOT_LIVE_GATEWAY_ALL_MODELS=1 CLAWDBOT_LIVE_GATEWAY_MODELS="openai/gpt-5.2" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts` - Tool calling across several providers (bash + read probe): - - `LIVE=1 CLAWDBOT_LIVE_GATEWAY=1 CLAWDBOT_LIVE_GATEWAY_ALL_MODELS=1 CLAWDBOT_LIVE_GATEWAY_TOOL_PROBE=1 CLAWDBOT_LIVE_GATEWAY_MODELS="openai/gpt-5.2,anthropic/claude-opus-4-5,google/gemini-3-flash-preview,zai/glm-4.7,minimax/minimax-m2.1" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts` + - `LIVE=1 CLAWDBOT_LIVE_GATEWAY=1 CLAWDBOT_LIVE_GATEWAY_ALL_MODELS=1 CLAWDBOT_LIVE_GATEWAY_TOOL_PROBE=1 CLAWDBOT_LIVE_GATEWAY_MODELS="openai/gpt-5.2,anthropic/claude-opus-4-5,google/gemini-3-flash,zai/glm-4.7,minimax/minimax-m2.1" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts` - Google focus (Gemini API key + Antigravity): - - Gemini (API key): `LIVE=1 CLAWDBOT_LIVE_GATEWAY=1 CLAWDBOT_LIVE_GATEWAY_ALL_MODELS=1 CLAWDBOT_LIVE_GATEWAY_TOOL_PROBE=1 CLAWDBOT_LIVE_GATEWAY_IMAGE_PROBE=1 CLAWDBOT_LIVE_GATEWAY_MODELS="google/gemini-3-flash-preview" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts` + - Gemini (API key): `LIVE=1 CLAWDBOT_LIVE_GATEWAY=1 CLAWDBOT_LIVE_GATEWAY_ALL_MODELS=1 CLAWDBOT_LIVE_GATEWAY_TOOL_PROBE=1 CLAWDBOT_LIVE_GATEWAY_IMAGE_PROBE=1 CLAWDBOT_LIVE_GATEWAY_MODELS="google/gemini-3-flash" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts` - Antigravity (OAuth): `LIVE=1 CLAWDBOT_LIVE_GATEWAY=1 CLAWDBOT_LIVE_GATEWAY_ALL_MODELS=1 CLAWDBOT_LIVE_GATEWAY_TOOL_PROBE=1 CLAWDBOT_LIVE_GATEWAY_IMAGE_PROBE=1 CLAWDBOT_LIVE_GATEWAY_MODELS="google-antigravity/claude-opus-4-5-thinking,google-antigravity/gemini-3-pro-high" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts` Notes: - `google/...` uses the Gemini API (API key). - `google-antigravity/...` uses the Antigravity OAuth bridge (Cloud Code Assist-style agent endpoint). - `google-gemini-cli/...` uses the local Gemini CLI on your machine (separate auth + tooling quirks). +- Gemini API vs Gemini CLI: + - API: Clawdbot calls Google’s hosted Gemini API over HTTP (API key / profile auth); this is what most users mean by “Gemini”. + - CLI: Clawdbot shells out to a local `gemini` binary; it has its own auth and can behave differently (streaming/tool support/version skew). ## Live: model matrix (what we cover) @@ -232,20 +240,20 @@ This is the “common models” run we expect to keep working: - OpenAI (non-Codex): `openai/gpt-5.2` (optional: `openai/gpt-5.1`) - OpenAI Codex: `openai-codex/gpt-5.2` (optional: `openai-codex/gpt-5.2-codex`) - Anthropic: `anthropic/claude-opus-4-5` (or `anthropic/claude-sonnet-4-5`) -- Google (Gemini API): `google/gemini-3-pro-preview` and `google/gemini-3-flash-preview` (avoid older Gemini 2.x models) +- Google (Gemini API): `google/gemini-3-pro` and `google/gemini-3-flash` (avoid older Gemini 2.x models) - Google (Antigravity): `google-antigravity/claude-opus-4-5-thinking` and `google-antigravity/gemini-3-flash` - Z.AI (GLM): `zai/glm-4.7` - MiniMax: `minimax/minimax-m2.1` Run gateway smoke with tools + image: -`LIVE=1 CLAWDBOT_LIVE_GATEWAY=1 CLAWDBOT_LIVE_GATEWAY_TOOL_PROBE=1 CLAWDBOT_LIVE_GATEWAY_IMAGE_PROBE=1 CLAWDBOT_LIVE_GATEWAY_MODELS="openai/gpt-5.2,openai-codex/gpt-5.2,anthropic/claude-opus-4-5,google/gemini-3-pro-preview,google/gemini-3-flash-preview,google-antigravity/claude-opus-4-5-thinking,google-antigravity/gemini-3-flash,zai/glm-4.7,minimax/minimax-m2.1" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts` +`LIVE=1 CLAWDBOT_LIVE_GATEWAY=1 CLAWDBOT_LIVE_GATEWAY_TOOL_PROBE=1 CLAWDBOT_LIVE_GATEWAY_IMAGE_PROBE=1 CLAWDBOT_LIVE_GATEWAY_MODELS="openai/gpt-5.2,openai-codex/gpt-5.2,anthropic/claude-opus-4-5,google/gemini-3-pro,google/gemini-3-flash,google-antigravity/claude-opus-4-5-thinking,google-antigravity/gemini-3-flash,zai/glm-4.7,minimax/minimax-m2.1" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts` ### Baseline: tool calling (Read + optional Bash) Pick at least one per provider family: - OpenAI: `openai/gpt-5.2` (or `openai/gpt-5-mini`) - Anthropic: `anthropic/claude-opus-4-5` (or `anthropic/claude-sonnet-4-5`) -- Google: `google/gemini-3-flash-preview` (or `google/gemini-3-pro-preview`) +- Google: `google/gemini-3-flash` (or `google/gemini-3-pro`) - Z.AI (GLM): `zai/glm-4.7` - MiniMax: `minimax/minimax-m2.1` @@ -263,7 +271,11 @@ Run with `CLAWDBOT_LIVE_GATEWAY_IMAGE_PROBE=1` and include at least one image-ca If you have keys enabled, we also support testing via: - OpenRouter: `openrouter/...` (hundreds of models; use `clawdbot models scan` to find tool+image capable candidates) -- OpenCode Zen: `opencode-zen/...` (requires `OPENCODE_ZEN_API_KEY`) +- OpenCode Zen: `opencode/...` (auth via `OPENCODE_API_KEY` / `OPENCODE_ZEN_API_KEY`) + +More providers you can include in the live matrix (if you have creds/config): +- Built-in: `openai`, `openai-codex`, `anthropic`, `google`, `google-vertex`, `google-antigravity`, `google-gemini-cli`, `zai`, `openrouter`, `opencode`, `xai`, `groq`, `cerebras`, `mistral`, `github-copilot` +- Via `models.providers` (custom endpoints): `minimax` (cloud/API), plus any OpenAI/Anthropic-compatible proxy (LM Studio, vLLM, LiteLLM, etc.) Tip: don’t try to hardcode “all models” in docs. The authoritative list is whatever `discoverModels(...)` returns on your machine + whatever keys are available. diff --git a/src/gateway/gateway-models.profiles.live.test.ts b/src/gateway/gateway-models.profiles.live.test.ts index 49c187260..c08f3a861 100644 --- a/src/gateway/gateway-models.profiles.live.test.ts +++ b/src/gateway/gateway-models.profiles.live.test.ts @@ -387,8 +387,9 @@ describeLive("gateway live (dev agent, profile keys)", () => { sessionKey, idempotencyKey: `idem-${runIdTool}-tool`, message: - `Call the tool named \`read\` (or \`Read\` if \`read\` is unavailable) with JSON arguments {"path":"${toolProbePath}"}. ` + - `Then reply with exactly: ${nonceA} ${nonceB}. No extra text.`, + "Clawdbot live tool probe (local, safe): " + + `use the tool named \`read\` (or \`Read\`) with JSON arguments {"path":"${toolProbePath}"}. ` + + "Then reply with the two nonce values you read (include both).", deliver: false, }, { expectFinal: true }, @@ -404,7 +405,7 @@ describeLive("gateway live (dev agent, profile keys)", () => { } if (EXTRA_TOOL_PROBES) { - const nonceC = `nonceC=${randomUUID()}`; + const nonceC = randomUUID(); const toolWritePath = path.join( tempDir, `write-${runIdTool}.txt`, @@ -416,10 +417,11 @@ describeLive("gateway live (dev agent, profile keys)", () => { sessionKey, idempotencyKey: `idem-${runIdTool}-bash-read`, message: - `Call the tool named \`bash\` (or \`Bash\` if \`bash\` is unavailable) and run: ` + - `mkdir -p "${tempDir}" && printf '%s' '${nonceC}' > "${toolWritePath}" ` + - `Then call the tool named \`read\` (or \`Read\` if \`read\` is unavailable) with JSON arguments: {"path":"${toolWritePath}"} ` + - `Finally reply with exactly: ${nonceC}.`, + "Clawdbot live tool probe (local, safe): " + + "use the tool named `bash` (or `Bash`) to run this command: " + + `mkdir -p "${tempDir}" && printf '%s' '${nonceC}' > "${toolWritePath}". ` + + `Then use the tool named \`read\` (or \`Read\`) with JSON arguments {"path":"${toolWritePath}"}. ` + + "Finally reply including the nonce text you read back.", deliver: false, }, { expectFinal: true },