docs: expand testing guide
This commit is contained in:
142
docs/testing.md
142
docs/testing.md
@@ -8,31 +8,59 @@ read_when:
|
||||
|
||||
# Testing
|
||||
|
||||
Clawdbot has three Vitest suites (unit, e2e, live) plus a couple Docker helpers for “run with my real keys” smoke checks.
|
||||
Clawdbot has three Vitest suites (unit/integration, e2e, live) and a small set of Docker runners.
|
||||
|
||||
This doc is a “how we test” guide:
|
||||
- What each suite covers (and what it deliberately does *not* cover)
|
||||
- Which commands to run for common workflows (local, pre-push, debugging)
|
||||
- How live tests discover credentials and select models/providers
|
||||
- How to add regressions for real-world model/provider issues
|
||||
|
||||
## Quick start
|
||||
|
||||
- Full gate (what we expect before push): `pnpm lint && pnpm build && pnpm test`
|
||||
Most days:
|
||||
- Full gate (expected before push): `pnpm lint && pnpm build && pnpm test`
|
||||
|
||||
When you touch tests or want extra confidence:
|
||||
- Coverage gate: `pnpm test:coverage`
|
||||
- E2E suite: `pnpm test:e2e`
|
||||
- Live suite (opt-in, Clawdbot only): `CLAWDBOT_LIVE_TEST=1 pnpm test:live`
|
||||
- Live suite (opt-in, includes provider live tests too): `LIVE=1 pnpm test:live`
|
||||
|
||||
When debugging real providers/models (requires real creds; skipped by default):
|
||||
- Live suite (models only): `CLAWDBOT_LIVE_TEST=1 pnpm test:live`
|
||||
- Live suite (models + providers): `LIVE=1 pnpm test:live`
|
||||
|
||||
Tip: when you only need one failing case, prefer narrowing live tests via the allowlist env vars described below.
|
||||
|
||||
## Test suites (what runs where)
|
||||
|
||||
Think of the suites as “increasing realism” (and increasing flakiness/cost):
|
||||
|
||||
### Unit / integration (default)
|
||||
|
||||
- Command: `pnpm test`
|
||||
- Config: `vitest.config.ts`
|
||||
- Files: `src/**/*.test.ts`
|
||||
- Scope: pure unit tests + in-process integration tests (gateway server auth, routing, tooling, parsing, config).
|
||||
- Scope:
|
||||
- Pure unit tests
|
||||
- In-process integration tests (gateway auth, routing, tooling, parsing, config)
|
||||
- Deterministic regressions for known bugs
|
||||
- Expectations:
|
||||
- Runs in CI
|
||||
- No real keys required
|
||||
- Should be fast and stable
|
||||
|
||||
### E2E (gateway smoke)
|
||||
|
||||
- Command: `pnpm test:e2e`
|
||||
- Config: `vitest.e2e.config.ts`
|
||||
- Files: `src/**/*.e2e.test.ts`
|
||||
- Scope: multi-instance gateway end-to-end behavior (WebSocket/HTTP/node pairing), heavier networking surface.
|
||||
- Scope:
|
||||
- Multi-instance gateway end-to-end behavior
|
||||
- WebSocket/HTTP surfaces, node pairing, and heavier networking
|
||||
- Expectations:
|
||||
- Runs in CI (when enabled in the pipeline)
|
||||
- No real keys required
|
||||
- More moving parts than unit tests (can be slower)
|
||||
|
||||
### Live (real providers + real models)
|
||||
|
||||
@@ -40,34 +68,86 @@ Clawdbot has three Vitest suites (unit, e2e, live) plus a couple Docker helpers
|
||||
- Config: `vitest.live.config.ts`
|
||||
- Files: `src/**/*.live.test.ts`
|
||||
- Default: **skipped** unless `CLAWDBOT_LIVE_TEST=1` or `LIVE=1`
|
||||
- Scope: “does this provider/model actually work today with real creds”.
|
||||
- Scope:
|
||||
- “Does this provider/model actually work *today* with real creds?”
|
||||
- Catch provider format changes, tool-calling quirks, auth issues, and rate limit behavior
|
||||
- Expectations:
|
||||
- Not CI-stable by design (real networks, real provider policies, quotas, outages)
|
||||
- Costs money / uses rate limits
|
||||
- Prefer running narrowed subsets instead of “everything”
|
||||
|
||||
## Which suite should I run?
|
||||
|
||||
Use this decision table:
|
||||
- Editing logic/tests: run `pnpm test` (and `pnpm test:coverage` if you changed a lot)
|
||||
- Touching gateway networking / WS protocol / pairing: add `pnpm test:e2e`
|
||||
- Debugging “my bot is down” / provider-specific failures / tool calling: run a narrowed `pnpm test:live`
|
||||
|
||||
## Live: model smoke (profile keys)
|
||||
|
||||
Two layers:
|
||||
Live tests are split into two layers so we can isolate failures:
|
||||
- “Direct model” tells us the provider/model can answer at all with the given key.
|
||||
- “Gateway smoke” tells us the full gateway+agent pipeline works for that model (sessions, history, tools, sandbox policy, etc.).
|
||||
|
||||
1. Direct model completion (no gateway):
|
||||
- Test: `src/agents/models.profiles.live.test.ts`
|
||||
- Goal: enumerate discovered models, use `getApiKeyForModel` to pick ones you have creds for, then run a small completion.
|
||||
- Selection:
|
||||
- `CLAWDBOT_LIVE_ALL_MODELS=1` (required to run the suite)
|
||||
- `CLAWDBOT_LIVE_MODELS=all` or comma allowlist (`openai/gpt-5.2,anthropic/claude-opus-4-5,...`)
|
||||
- `CLAWDBOT_LIVE_REQUIRE_PROFILE_KEYS=1` to ensure creds come from the profile store (not ad-hoc env).
|
||||
- Regression hook: OpenAI Responses tool-only → follow-up path (the `reasoning` replay class) is covered here.
|
||||
### Layer 1: Direct model completion (no gateway)
|
||||
|
||||
2. Gateway + dev agent smoke (what “@clawdbot” actually does):
|
||||
- Test: `src/gateway/gateway-models.profiles.live.test.ts`
|
||||
- Goal: spin up an in-process gateway, create/patch a `agent:dev:*` session, iterate models-with-keys, and assert “meaningful” responses.
|
||||
- Covers providers present in your `models.json`/config (e.g. OpenAI, Anthropic, Google Gemini, `google-antigravity`, etc.) as long as a key/profile is available.
|
||||
- Selection:
|
||||
- `CLAWDBOT_LIVE_GATEWAY=1`
|
||||
- `CLAWDBOT_LIVE_GATEWAY_ALL_MODELS=1` (scan all discovered models with available keys)
|
||||
- `CLAWDBOT_LIVE_GATEWAY_MODELS=all` or comma allowlist
|
||||
- Extra regression: for OpenAI Responses/Codex Responses models, force a tool-call-only turn followed by a user question (the exact failure mode that produced `400 … reasoning … required following item`).
|
||||
- Test: `src/agents/models.profiles.live.test.ts`
|
||||
- Goal:
|
||||
- Enumerate discovered models
|
||||
- Use `getApiKeyForModel` to select models you have creds for
|
||||
- Run a small completion per model (and targeted regressions where needed)
|
||||
- How to enable:
|
||||
- `CLAWDBOT_LIVE_TEST=1` or `LIVE=1`
|
||||
- `CLAWDBOT_LIVE_ALL_MODELS=1` (required for this test to run)
|
||||
- How to select models:
|
||||
- `CLAWDBOT_LIVE_MODELS=all` to run everything with keys
|
||||
- or `CLAWDBOT_LIVE_MODELS="openai/gpt-5.2,anthropic/claude-opus-4-5,..."` (comma allowlist)
|
||||
- Where keys come from:
|
||||
- By default: profile store and env fallbacks
|
||||
- Set `CLAWDBOT_LIVE_REQUIRE_PROFILE_KEYS=1` to enforce **profile store** only
|
||||
- Why this exists:
|
||||
- Separates “provider API is broken / key is invalid” from “gateway agent pipeline is broken”
|
||||
- Contains small, isolated regressions (example: OpenAI Responses/Codex Responses reasoning replay + tool-call flows)
|
||||
|
||||
### Layer 2: Gateway + dev agent smoke (what “@clawdbot” actually does)
|
||||
|
||||
- Test: `src/gateway/gateway-models.profiles.live.test.ts`
|
||||
- Goal:
|
||||
- Spin up an in-process gateway
|
||||
- Create/patch a `agent:dev:*` session (model override per run)
|
||||
- Iterate models-with-keys and assert:
|
||||
- “meaningful” response (no tools)
|
||||
- a real tool invocation works (read probe)
|
||||
- optional extra tool probes (bash+read probe)
|
||||
- OpenAI regression paths (tool-call-only → follow-up) keep working
|
||||
- How to enable:
|
||||
- `CLAWDBOT_LIVE_TEST=1` or `LIVE=1`
|
||||
- `CLAWDBOT_LIVE_GATEWAY=1` (required for this test to run)
|
||||
- How to select models:
|
||||
- `CLAWDBOT_LIVE_GATEWAY_ALL_MODELS=1` to scan all discovered models with keys
|
||||
- or set `CLAWDBOT_LIVE_GATEWAY_MODELS="provider/model,provider/model,..."` to narrow quickly
|
||||
- Optional tool-calling stress:
|
||||
- `CLAWDBOT_LIVE_GATEWAY_TOOL_PROBE=1` enables an extra “bash writes file → read reads it back → echo nonce” check.
|
||||
- This is specifically meant to catch tool-calling compatibility issues across providers (formatting, history replay, tool_result pairing, etc.).
|
||||
|
||||
### Recommended live recipes
|
||||
|
||||
Narrow, explicit allowlists are fastest and least flaky:
|
||||
|
||||
- Single model, direct (no gateway):
|
||||
- `CLAWDBOT_LIVE_TEST=1 CLAWDBOT_LIVE_ALL_MODELS=1 CLAWDBOT_LIVE_MODELS="openai/gpt-5.2" pnpm test:live src/agents/models.profiles.live.test.ts`
|
||||
|
||||
- Single model, gateway smoke:
|
||||
- `LIVE=1 CLAWDBOT_LIVE_GATEWAY=1 CLAWDBOT_LIVE_GATEWAY_ALL_MODELS=1 CLAWDBOT_LIVE_GATEWAY_MODELS="openai/gpt-5.2" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts`
|
||||
|
||||
- Tool calling across several providers (bash + read probe):
|
||||
- `LIVE=1 CLAWDBOT_LIVE_GATEWAY=1 CLAWDBOT_LIVE_GATEWAY_ALL_MODELS=1 CLAWDBOT_LIVE_GATEWAY_TOOL_PROBE=1 CLAWDBOT_LIVE_GATEWAY_MODELS="openai/gpt-5.2,anthropic/claude-opus-4-5,google/gemini-flash-latest,zai/glm-4.7,minimax/minimax-m2.1" pnpm test:live src/gateway/gateway-models.profiles.live.test.ts`
|
||||
|
||||
## Credentials (never commit)
|
||||
|
||||
Live tests discover credentials the same way the CLI does:
|
||||
Live tests discover credentials the same way the CLI does. Practical implications:
|
||||
- If the CLI works, live tests should find the same keys.
|
||||
- If a live test says “no creds”, debug the same way you’d debug `clawdbot models list` / model selection.
|
||||
|
||||
- Profile store: `~/.clawdbot/credentials/` (preferred; what “profile keys” means in the tests)
|
||||
- Config: `~/.clawdbot/clawdbot.json` (or `CLAWDBOT_CONFIG_PATH`)
|
||||
@@ -97,5 +177,15 @@ Run docs checks after doc edits: `pnpm docs:list`.
|
||||
|
||||
## Offline regression (CI-safe)
|
||||
|
||||
These are “real pipeline” regressions without real providers:
|
||||
- Gateway tool calling (mock OpenAI, real gateway + agent loop): `src/gateway/gateway.tool-calling.mock-openai.test.ts`
|
||||
- Gateway wizard (WS `wizard.start`/`wizard.next`, writes config + auth enforced): `src/gateway/gateway.wizard.e2e.test.ts`
|
||||
|
||||
## Adding regressions (guidance)
|
||||
|
||||
When you fix a provider/model issue discovered in live:
|
||||
- Add a CI-safe regression if possible (mock/stub provider, or capture the exact request-shape transformation)
|
||||
- If it’s inherently live-only (rate limits, auth policies), keep the live test narrow and opt-in via env vars
|
||||
- Prefer targeting the smallest layer that catches the bug:
|
||||
- provider request conversion/replay bug → direct models test
|
||||
- gateway session/history/tool pipeline bug → gateway live smoke or CI-safe gateway mock test
|
||||
|
||||
Reference in New Issue
Block a user