docs: add transcript hygiene reference

2026-01-23 01:34:21 +00:00
parent 17a09cc721
commit 2424404fb4
14 changed files with 120 additions and 146 deletions
--- a/docs/reference/session-management-compaction.md
+++ b/docs/reference/session-management-compaction.md
@@ -12,6 +12,7 @@ This document explains how Clawdbot manages sessions end-to-end:
 - **Session routing** (how inbound messages map to a `sessionKey`)
 - **Session store** (`sessions.json`) and what it tracks
 - **Transcript persistence** (`*.jsonl`) and its structure
+- **Transcript hygiene** (provider-specific fixups before runs)
 - **Context limits** (context window vs tracked tokens)
 - **Compaction** (manual + auto-compaction) and where to hook pre-compaction work
 - **Silent housekeeping** (e.g. memory writes that shouldn’t produce user-visible output)
@@ -20,6 +21,7 @@ If you want a higher-level overview first, start with:
 - [/concepts/session](/concepts/session)
 - [/concepts/compaction](/concepts/compaction)
 - [/concepts/session-pruning](/concepts/session-pruning)
+- [/reference/transcript-hygiene](/reference/transcript-hygiene)

 ---

--- a/docs/reference/transcript-hygiene.md
+++ b/docs/reference/transcript-hygiene.md
@@ -0,0 +1,94 @@
+---
+summary: "Reference: provider-specific transcript sanitization and repair rules"
+read_when:
+  - You are debugging provider request rejections tied to transcript shape
+  - You are changing transcript sanitization or tool-call repair logic
+  - You are investigating tool-call id mismatches across providers
+---
+# Transcript Hygiene (Provider Fixups)
+
+This document describes **provider-specific fixes** applied to transcripts before a run
+(building model context). These are **in-memory** adjustments used to satisfy strict
+provider requirements. They do **not** rewrite the stored JSONL transcript on disk.
+
+Scope includes:
+- Tool call id sanitization
+- Tool result pairing repair
+- Turn validation / ordering
+- Thought signature cleanup
+- Image payload sanitization
+
+If you need transcript storage details, see:
+- [/reference/session-management-compaction](/reference/session-management-compaction)
+
+---
+
+## Where this runs
+
+All transcript hygiene is centralized in the embedded runner:
+- Policy selection: `src/agents/transcript-policy.ts`
+- Sanitization/repair application: `sanitizeSessionHistory` in `src/agents/pi-embedded-runner/google.ts`
+
+The policy uses `provider`, `modelApi`, and `modelId` to decide what to apply.
+
+---
+
+## Global rule: image sanitization
+
+Image payloads are always sanitized to prevent provider-side rejection due to size
+limits (downscale/recompress oversized base64 images).
+
+Implementation:
+- `sanitizeSessionMessagesImages` in `src/agents/pi-embedded-helpers/images.ts`
+- `sanitizeContentBlocksImages` in `src/agents/tool-images.ts`
+
+---
+
+## Provider matrix (current behavior)
+
+**OpenAI / OpenAI Codex**
+- Image sanitization only.
+- No tool call id sanitization.
+- No tool result pairing repair.
+- No turn validation or reordering.
+- No synthetic tool results.
+- No thought signature stripping.
+
+**Google (Generative AI / Gemini CLI / Antigravity)**
+- Tool call id sanitization: strict alphanumeric.
+- Tool result pairing repair and synthetic tool results.
+- Turn validation (Gemini-style turn alternation).
+- Google turn ordering fixup (prepend a tiny user bootstrap if history starts with assistant).
+- Antigravity Claude: normalize thinking signatures; drop unsigned thinking blocks.
+
+**Anthropic / Minimax (Anthropic-compatible)**
+- Tool result pairing repair and synthetic tool results.
+- Turn validation (merge consecutive user turns to satisfy strict alternation).
+
+**Mistral (including model-id based detection)**
+- Tool call id sanitization: strict9 (alphanumeric length 9).
+
+**OpenRouter Gemini**
+- Thought signature cleanup: strip non-base64 `thought_signature` values (keep base64).
+
+**Everything else**
+- Image sanitization only.
+
+---
+
+## Historical behavior (pre-2026.1.22)
+
+Before the 2026.1.22 release, Clawdbot applied multiple layers of transcript hygiene:
+
+- A **transcript-sanitize extension** ran on every context build and could:
+  - Repair tool use/result pairing.
+  - Sanitize tool call ids (including a non-strict mode that preserved `_`/`-`).
+- The runner also performed provider-specific sanitization, which duplicated work.
+- Additional mutations occurred outside the provider policy, including:
+  - Stripping `<final>` tags from assistant text before persistence.
+  - Dropping empty assistant error turns.
+  - Trimming assistant content after tool calls.
+
+This complexity caused cross-provider regressions (notably `openai-responses`
+`call_id|fc_id` pairing). The 2026.1.22 cleanup removed the extension, centralized
+logic in the runner, and made OpenAI **no-touch** beyond image sanitization.