Docs: frame skills eval gap in testing

This commit is contained in:
George Pickett
2026-01-19 12:23:13 -08:00
committed by Peter Steinberger
parent eb5145c5d1
commit e0e33e12d1

View File

@@ -323,6 +323,22 @@ These are “real pipeline” regressions without real providers:
- Gateway tool calling (mock OpenAI, real gateway + agent loop): `src/gateway/gateway.tool-calling.mock-openai.test.ts`
- Gateway wizard (WS `wizard.start`/`wizard.next`, writes config + auth enforced): `src/gateway/gateway.wizard.e2e.test.ts`
## Agent reliability evals (skills)
We already have a few CI-safe tests that behave like “agent reliability evals”:
- Mock tool-calling through the real gateway + agent loop (`src/gateway/gateway.tool-calling.mock-openai.test.ts`).
- End-to-end wizard flows that validate session wiring and config effects (`src/gateway/gateway.wizard.e2e.test.ts`).
Whats still missing for skills (see [Skills](/tools/skills)):
- **Decisioning:** when skills are listed in the prompt, does the agent pick the right skill (or avoid irrelevant ones)?
- **Compliance:** does the agent read `SKILL.md` before use and follow required steps/args?
- **Workflow contracts:** multi-turn scenarios that assert tool order, session history carryover, and sandbox boundaries.
Future evals should stay deterministic first:
- A scenario runner using mock providers to assert tool calls + order, skill file reads, and session wiring.
- A small suite of skill-focused scenarios (use vs avoid, gating, prompt injection).
- Optional live evals (opt-in, env-gated) only after the CI-safe suite is in place.
## Adding regressions (guidance)
When you fix a provider/model issue discovered in live: