docs: add transcript hygiene reference
This commit is contained in:
@@ -12,6 +12,7 @@ This document explains how Clawdbot manages sessions end-to-end:
|
||||
- **Session routing** (how inbound messages map to a `sessionKey`)
|
||||
- **Session store** (`sessions.json`) and what it tracks
|
||||
- **Transcript persistence** (`*.jsonl`) and its structure
|
||||
- **Transcript hygiene** (provider-specific fixups before runs)
|
||||
- **Context limits** (context window vs tracked tokens)
|
||||
- **Compaction** (manual + auto-compaction) and where to hook pre-compaction work
|
||||
- **Silent housekeeping** (e.g. memory writes that shouldn’t produce user-visible output)
|
||||
@@ -20,6 +21,7 @@ If you want a higher-level overview first, start with:
|
||||
- [/concepts/session](/concepts/session)
|
||||
- [/concepts/compaction](/concepts/compaction)
|
||||
- [/concepts/session-pruning](/concepts/session-pruning)
|
||||
- [/reference/transcript-hygiene](/reference/transcript-hygiene)
|
||||
|
||||
---
|
||||
|
||||
|
||||
94
docs/reference/transcript-hygiene.md
Normal file
94
docs/reference/transcript-hygiene.md
Normal file
@@ -0,0 +1,94 @@
|
||||
---
|
||||
summary: "Reference: provider-specific transcript sanitization and repair rules"
|
||||
read_when:
|
||||
- You are debugging provider request rejections tied to transcript shape
|
||||
- You are changing transcript sanitization or tool-call repair logic
|
||||
- You are investigating tool-call id mismatches across providers
|
||||
---
|
||||
# Transcript Hygiene (Provider Fixups)
|
||||
|
||||
This document describes **provider-specific fixes** applied to transcripts before a run
|
||||
(building model context). These are **in-memory** adjustments used to satisfy strict
|
||||
provider requirements. They do **not** rewrite the stored JSONL transcript on disk.
|
||||
|
||||
Scope includes:
|
||||
- Tool call id sanitization
|
||||
- Tool result pairing repair
|
||||
- Turn validation / ordering
|
||||
- Thought signature cleanup
|
||||
- Image payload sanitization
|
||||
|
||||
If you need transcript storage details, see:
|
||||
- [/reference/session-management-compaction](/reference/session-management-compaction)
|
||||
|
||||
---
|
||||
|
||||
## Where this runs
|
||||
|
||||
All transcript hygiene is centralized in the embedded runner:
|
||||
- Policy selection: `src/agents/transcript-policy.ts`
|
||||
- Sanitization/repair application: `sanitizeSessionHistory` in `src/agents/pi-embedded-runner/google.ts`
|
||||
|
||||
The policy uses `provider`, `modelApi`, and `modelId` to decide what to apply.
|
||||
|
||||
---
|
||||
|
||||
## Global rule: image sanitization
|
||||
|
||||
Image payloads are always sanitized to prevent provider-side rejection due to size
|
||||
limits (downscale/recompress oversized base64 images).
|
||||
|
||||
Implementation:
|
||||
- `sanitizeSessionMessagesImages` in `src/agents/pi-embedded-helpers/images.ts`
|
||||
- `sanitizeContentBlocksImages` in `src/agents/tool-images.ts`
|
||||
|
||||
---
|
||||
|
||||
## Provider matrix (current behavior)
|
||||
|
||||
**OpenAI / OpenAI Codex**
|
||||
- Image sanitization only.
|
||||
- No tool call id sanitization.
|
||||
- No tool result pairing repair.
|
||||
- No turn validation or reordering.
|
||||
- No synthetic tool results.
|
||||
- No thought signature stripping.
|
||||
|
||||
**Google (Generative AI / Gemini CLI / Antigravity)**
|
||||
- Tool call id sanitization: strict alphanumeric.
|
||||
- Tool result pairing repair and synthetic tool results.
|
||||
- Turn validation (Gemini-style turn alternation).
|
||||
- Google turn ordering fixup (prepend a tiny user bootstrap if history starts with assistant).
|
||||
- Antigravity Claude: normalize thinking signatures; drop unsigned thinking blocks.
|
||||
|
||||
**Anthropic / Minimax (Anthropic-compatible)**
|
||||
- Tool result pairing repair and synthetic tool results.
|
||||
- Turn validation (merge consecutive user turns to satisfy strict alternation).
|
||||
|
||||
**Mistral (including model-id based detection)**
|
||||
- Tool call id sanitization: strict9 (alphanumeric length 9).
|
||||
|
||||
**OpenRouter Gemini**
|
||||
- Thought signature cleanup: strip non-base64 `thought_signature` values (keep base64).
|
||||
|
||||
**Everything else**
|
||||
- Image sanitization only.
|
||||
|
||||
---
|
||||
|
||||
## Historical behavior (pre-2026.1.22)
|
||||
|
||||
Before the 2026.1.22 release, Clawdbot applied multiple layers of transcript hygiene:
|
||||
|
||||
- A **transcript-sanitize extension** ran on every context build and could:
|
||||
- Repair tool use/result pairing.
|
||||
- Sanitize tool call ids (including a non-strict mode that preserved `_`/`-`).
|
||||
- The runner also performed provider-specific sanitization, which duplicated work.
|
||||
- Additional mutations occurred outside the provider policy, including:
|
||||
- Stripping `<final>` tags from assistant text before persistence.
|
||||
- Dropping empty assistant error turns.
|
||||
- Trimming assistant content after tool calls.
|
||||
|
||||
This complexity caused cross-provider regressions (notably `openai-responses`
|
||||
`call_id|fc_id` pairing). The 2026.1.22 cleanup removed the extension, centralized
|
||||
logic in the runner, and made OpenAI **no-touch** beyond image sanitization.
|
||||
Reference in New Issue
Block a user