Files
clawdbot/docs/refactor/agent-loop.md
2026-01-05 05:55:02 +01:00

2.6 KiB

summary, read_when
summary read_when
Refactor plan: unify agent lifecycle events and wait semantics
Refactoring agent lifecycle events or wait behavior

Refactor: Agent Loop

Goal: align Clawdis run lifecycle with pi/mom semantics, remove ambiguity between "job" and "agent_end".

Problem

  • Two lifecycles today:
    • job (gateway wrapper) => used by agent.wait + chat final
    • pi-agent agent_end (inner loop) => only logged
  • This can finalize early (job done) while late assistant deltas still arrive.
  • afterMs and timeouts can cause false timeouts in agent.wait.

Reference (mom)

  • Single lifecycle: agent_start/agent_end from pi-agent-core event stream.
  • waitForIdle() resolves on agent_end.
  • No separate job state exposed to clients.

Proposed refactor (breaking allowed)

  1. Replace public job stream with lifecycle stream
    • stream: "lifecycle"
    • data: { phase: "start" | "end" | "error", startedAt, endedAt, error? }
  2. agent.wait waits on lifecycle end/error only
    • remove afterMs
    • return { runId, status, startedAt, endedAt, error? }
  3. Chat final emitted on lifecycle end only
    • deltas still from assistant stream
  4. Centralize run registry
    • one map keyed by runId: sessionKey, startedAt, lastSeq, bufferedText
    • clear on lifecycle end

Implementation outline

  • src/agents/pi-embedded-subscribe.ts
    • emit lifecycle start/end events (translate pi agent_start/agent_end)
  • src/infra/agent-events.ts
    • add "lifecycle" to stream type
  • src/gateway/protocol/schema.ts
    • update AgentEvent schema; update AgentWait params (remove afterMs, add status)
  • src/gateway/server-methods/agent-job.ts
    • rename to agent-wait.ts or similar; wait on lifecycle end/error
  • src/gateway/server-chat.ts
    • finalize on lifecycle end (not job)
  • src/commands/agent.ts
    • stop emitting job externally (keep internal log if needed)

Migration notes (breaking)

  • Update all callers of agent.wait to new response shape.
  • Update tests that expect timeout based on job events.
  • If any UI relies on job state, map lifecycle instead.

Risks

  • If lifecycle events are dropped, wait/chat could hang; add timeout in agent.wait to fail fast.
  • Late deltas after lifecycle end should be ignored; keep seq tracking + drop.

Acceptance

  • One lifecycle visible to clients.
  • agent.wait resolves when agent loop ends, not wrapper completion.
  • Chat final never emits before last assistant delta.

Rollout (if we wanted safety)

  • Gate with config flag agent.lifecycleMode = "legacy"|"refactor".
  • Remove legacy after one release.