Agent Execution Loop
Agent system authors — what actually happens when a Claude Code conversation runs: the query async generator, the 14-step per-turn pipeline, StreamingToolExecutor, retry / recovery / circuit-break paths, all source-grounded.
The gap this chapter fills
Previous chapters covered the static composition: how prompt is assembled, where memory lives, how compaction works, how permissions decide. But one core question went unanswered: how does a conversation actually run?
- When the user hits Enter, what does Claude Code do internally?
- How is the ReAct loop implemented?
- How are multi-turn tool calls scheduled?
- How does recovery work after a turn errors?
- What’s the machinery behind
claude --resumepicking up where you left off?
This chapter reconstructs the agent’s runtime lifecycle from source. Main references: query.ts (1729 lines),
QueryEngine.ts (1295 lines), Task.ts (125 lines), and the query/ directory.
Top level: query() is an async generator
Claude Code’s main loop is a single function:
// query.ts line 219
export async function* query(
params: QueryParams,
): AsyncGenerator<
| StreamEvent
| RequestStartEvent
| Message
| TombstoneMessage
| ToolUseSummaryMessage,
Terminal
> {
const consumedCommandUuids: string[] = []
const terminal = yield* queryLoop(params, consumedCommandUuids)
for (const uuid of consumedCommandUuids) {
notifyCommandLifecycle(uuid, 'completed')
}
return terminal
}
Two key design decisions:
- Async generator instead of
Promise<Response>— the UI layer consumes yielded StreamEvents in real time; the character-by-character streaming output users see flows from here Terminalreturn type — the generator has an explicit exit reason (Terminal), not a black box. Callers can distinguish “normal completion / aborted / hit blocking limit / maxTurns exhausted” etc.
queryLoop is the internal implementation; query is a thin shell wrapping command lifecycle notifications —
if the loop throws or is .return()ed, commands don’t get completed notified (only normal return does). This
is lifecycle asymmetry: started doesn’t guarantee completed.
QueryParams: the full input to one call
export type QueryParams = {
messages: Message[] // Conversation so far
systemPrompt: SystemPrompt // Pre-assembled system prompt
userContext: { [k: string]: string } // CLAUDE.md / currentDate
systemContext: { [k: string]: string } // gitStatus / cacheBreaker
canUseTool: CanUseToolFn // Permission check callback
toolUseContext: ToolUseContext // Tool execution context (mode, allowed tools)
fallbackModel?: string // Fallback model on primary failure
querySource: QuerySource // Call source ID (repl_main_thread / compact / ...)
maxOutputTokensOverride?: number // Single-turn output override
maxTurns?: number // Loop upper bound
skipCacheWrite?: boolean // Skip cache write
taskBudget?: { total: number } // API-side task budget (beta)
deps?: QueryDeps // Injectable deps (for testing)
}
Injectable deps is a clever design (query/deps.ts):
export type QueryDeps = {
callModel: typeof queryModelWithStreaming // LLM call
microcompact: typeof microcompactMessages // Tool result clearing
autocompact: typeof autoCompactIfNeeded // Auto compaction
uuid: () => string // ID generation
}
The source comment explains why: “tests can inject fakes directly instead of spyOn-per-module — the most common mocks (callModel, autocompact) are each spied in 6-8 test files today with module-import-and-spy boilerplate”.
Takeaway for your own agent: the top-level loop function’s dependencies must be injectable — tests inject fakes directly, no spy-on-6-modules boilerplate. The foundation of testability.
Loop state: a 14-field state machine
queryLoop is an infinite while loop with a State object carrying cross-iteration state:
type State = {
messages: Message[] // Conversation history
toolUseContext: ToolUseContext // Tool context
autoCompactTracking: AutoCompactTrackingState | undefined // Compaction circuit breaker
maxOutputTokensRecoveryCount: number // Max-tokens recovery count
hasAttemptedReactiveCompact: boolean // Has reactive compaction been tried?
maxOutputTokensOverride: number | undefined // Output token override
pendingToolUseSummary: Promise<ToolUseSummaryMessage | null> // Async tool-use summary
stopHookActive: boolean | undefined // Stop hook running?
turnCount: number // Current turn
transition: Continue | undefined // Why did the last iteration continue
}
Each field answers one specific question, no redundancy:
transition: why the last iteration didn’t finish, directly driving the next iteration’s handling (comment: “Lets tests assert recovery paths fired without inspecting message contents” — tests can assert “recovery path fired” without inspecting message content)maxOutputTokensRecoveryCount: an independent sub-loop counter — on max-output-tokens error, the loop retries with a larger cap multiple times; this isn’t the globalturnCounthasAttemptedReactiveCompact: each turn can try reactive compact once only, avoiding infinite loopspendingToolUseSummary: tool-use summary runs async — main loop doesn’t wait for it, background generation
Takeaway for your own agent: a state with many fields isn’t inherently bad — the key is that each field has
a specific semantic responsibility. 14 fields × clear semantics beats 4 fields × { [key: string]: any }.
Per-turn 14 steps: the processing pipeline
Each while-loop iteration runs up to these 14 steps (many are conditional and skip):
| # | Step | Source location | Purpose |
|---|---|---|---|
| 1 | State destructure | query.ts lines 311-321 | Pull this turn’s needed fields from State |
| 2 | Skill prefetch | line 331 startSkillDiscoveryPrefetch | Concurrent prefetch of relevant skills, runs during LLM streaming |
| 3 | Yield stream_request_start | line 337 | UI “starting” signal |
| 4 | getMessagesAfterCompactBoundary | line 365 | Only look at messages after the compact boundary; already-compacted skip |
| 5 | applyToolResultBudget | line 379 | Enforce per-message tool-result budget |
| 6 | HISTORY_SNIP (feature-flagged) | line 401 | Strategy-based history clearing |
| 7 | Microcompact | line 414 | Tool result clearing (old Read/Bash results) |
| 8 | Context Collapse (feature-flagged) | line 441 | Alternative context management system |
| 9 | Auto-compact | line 454 | LLM summarization (see Compaction) |
| 10 | Blocking limit check | line 641 | If hitting hard ceiling, yield error and return { reason: 'blocking_limit' } |
| 11 | callModel streaming call | line 659 deps.callModel | Call LLM, stream messages/events |
| 12 | Yield messages | line 708+ | Yield to UI one by one (including tombstone rollback) |
| 13 | Tool execution | StreamingToolExecutor or runTools | Parallel / serial tool execution |
| 14 | Collect toolResults, decide continue | end of loop | needsFollowUp = toolUseBlocks.length > 0 |
Key details:
2. Skill prefetch runs parallel to LLM streaming
const pendingSkillPrefetch = skillPrefetch?.startSkillDiscoveryPrefetch(...)
// ... continue processing
// ... call LLM, streaming receive
// ... skill prefetch runs in background during LLM response
The comment says: “Replaces the blocking assistant_turn path that ran inside getAttachmentMessages (97% of those calls found nothing in prod).” Originally skill discovery blocked; in production 97% of calls found nothing but still blocked the whole turn. Now concurrent, near-zero cost.
4. getMessagesAfterCompactBoundary — compaction boundary protection
After compaction, old messages get replaced with a summary. Here we only take messages after the most recent boundary. Comment: “REPL keeps snipped messages for UI scrollback — project so the compact model doesn’t summarize content that was intentionally removed” — UI’s scrollback keeps “snipped” messages for display, but the model can’t see them (would be re-compacted otherwise).
5. applyToolResultBudget — per-message tool result budget
Enforce per-message budget on aggregate tool result size. Runs BEFORE microcompact — cached MC operates purely by tool_use_id (never inspects content), so content replacement is invisible to it and the two compose cleanly.
Meaning: there’s a per-message total budget for tool results (different tools can have different ceilings);
exceeding replaces content with a placeholder. The ordering is critical — must be before microcompact,
because cached microcompact only inspects tool_use_id not content; the two compose seamlessly.
10. Blocking limit — proactive block before hard ceiling
const { isAtBlockingLimit } = calculateTokenWarningState(
tokenCountWithEstimation(messagesForQuery) - snipTokensFreed,
model,
)
if (isAtBlockingLimit) {
yield createAssistantAPIErrorMessage({ content: PROMPT_TOO_LONG_ERROR_MESSAGE, ... })
return { reason: 'blocking_limit' }
}
When auto-compaction is disabled, this check proactively blocks over-limit — leaves
MANUAL_COMPACT_BUFFER_TOKENS = 3000 for manual /compact. The comment details four cases this gate must skip:
- Just compacted (
compactionResult) — usage counts are stale querySource === 'compact' / 'session_memory'— forked agents would deadlock- Reactive compact enabled — let actual 413 trigger reactive
- Context collapse enabled — collapse manages itself
Takeaway for your own agent: hard-ceiling interception shouldn’t be a global switch — must be able to precisely exempt special call paths. Otherwise those paths deadlock at blocking limit.
13. Streaming Tool Executor — execute while streaming
const useStreamingToolExecution = config.gates.streamingToolExecution
let streamingToolExecutor = useStreamingToolExecution
? new StreamingToolExecutor(...)
: null
Two paths:
- Traditional: LLM finishes streaming → parse tool_use → serial / parallel execute → get results
- Streaming (
StreamingToolExecutor): tool_use block starts executing immediately as the LLM streams it out, not waiting for the full assistant message
Latency drops significantly — for multiple independent tool calls, approaches parallel wall-clock time.
Model fallback + streaming fallback
Line 654’s while (attemptWithFallback) is double-layer fallback logic:
- Model fallback: primary model fails (API error / throttle) → switch to
fallbackModeland retry - Streaming fallback: streaming mode errors (e.g., thinking block exception) → fall back to non-streaming
The source has a particularly intricate but important handling: when streaming fallback fires, the half-streamed assistant messages must be tombstoned — they may have invalid thinking block signatures, and resubmitting would fail the API.
if (streamingFallbackOccured) {
for (const msg of assistantMessages) {
yield { type: 'tombstone' as const, message: msg } // UI / transcript: delete this
}
assistantMessages.length = 0
toolResults.length = 0
// ... discard pending tool results, recreate executor
}
Tombstone messages are UI / transcript “deletion markers” — the messages already streamed can’t be recalled from the client, but tombstones tell downstream “this is void.”
Takeaway for your own agent: the retract mechanism for streaming output must exist. LLMs streaming halfway and finding an issue can’t “take back” already-streamed characters — you need an explicit void marker.
Termination: 6 Terminal reasons
From all return { reason: ... } branches in query.ts, the loop can terminate with these reasons:
| Reason | Condition | Meaning |
|---|---|---|
blocking_limit | Hit hard ceiling | Manual compact can’t help |
max_turns | turnCount > maxTurns | User-set turn upper bound |
done | No tool_use this turn | Model believes task complete |
aborted | abortController.signal.aborted | User interrupted |
stop_hook_blocked | Stop hook returned block | User hook blocked continuation |
error | Other exceptions | Unrecoverable error |
Different reasons trigger different follow-ups — “done” shows completion in UI, “aborted” shows “cancelled”, “blocking_limit” prompts manual compact, “max_turns” suggests raising maxTurns.
Task Layer: 7 task types
Top-level agent invocation is wrapped in Task.ts. 7 TaskType values:
export type TaskType =
| 'local_bash' // Local shell task
| 'local_agent' // Locally running subagent
| 'remote_agent' // CCR cloud agent
| 'in_process_teammate' // In-process teammate
| 'local_workflow' // Local workflow
| 'monitor_mcp' // MCP server monitor
| 'dream' // Nightly memory-curation dream job
5 statuses:
export type TaskStatus = 'pending' | 'running' | 'completed' | 'failed' | 'killed'
Task IDs have prefixes (TASK_ID_PREFIXES):
{
local_bash: 'b', // Kept as 'b' for backward compatibility
local_agent: 'a',
remote_agent: 'r',
in_process_teammate: 't',
local_workflow: 'w',
monitor_mcp: 'm',
dream: 'd',
}
Random 8-char suffix from 36^8 ≈ 2.8 trillion combinations — source comment: “sufficient to resist brute-force symlink attacks”.
Why IDs must resist symlink attacks: task output file paths come from ID (getTaskOutputPath(id)). If an
attacker can predict the ID, they can pre-create a symlink pointing to an arbitrary file, making the task’s
stdout write to that file. 36^8 entropy makes this impractical.
Takeaway for your own agent: any system using IDs as file paths must consider ID predictability — “ID not random enough causing race conditions or attacks” is a common production incident.
Resume: claude --resume and session storage
Previously compaction mentioned: “Background jobs that summarize previous
conversations for the claude --resume feature” — resume’s compaction is pre-computed in background.
Mechanism breakdown:
- Session storage: every conversation writes to
~/.claude/projects/<project>/sessions/<sessionId>.jsonl - Background summary agent: after exiting Claude Code, a background job reads the session file and produces a summary
- On resume: read session + summary, reconstruct State, enter queryLoop
Resume is near-instant — summary already computed. The comment: “subscribers can use /stats to view usage
patterns” — usage data persists, visible cross-session.
Full-chain abort path
toolUseContext.abortController is the central cancellation point threaded through the call chain:
User presses Esc
→ abortController.abort()
→ signal.aborted = true
→ LLM stream interrupted (AbortSignal passed to fetch)
→ all in-flight tools receive signal (tool.execute's second arg)
→ each tool's cleanup (bwrap process kill, file lock release, ...)
→ queryLoop checks signal.aborted → return { reason: 'aborted' }
Key design: each layer observes the signal itself, not waiting for upper-layer notification. LLM fetch natively supports AbortSignal; tool execute’s second arg always includes the signal; bash process wait watches the signal — cancellation is broadcast across the chain, not forwarded layer-by-layer.
Takeaway for your own agent: AbortController must thread from entry point to every leaf operation. Half-assed abort support is worse than none — users think they cancelled but something’s still running.
QueryEngine: the level of a single LLM call
QueryEngine.ts (1295 lines) is the low level of the callModel function, dedicated to streaming consumption
of a single LLM call:
- Parse SSE events (content_block_start / content_block_delta / content_block_stop / message_delta / …)
- Build assistant messages
- Handle max_tokens / stop_reason / various API errors
- Streaming fallback (discussed above)
- Thinking block special handling (signature verification)
- Usage tracking (including cache read / cache creation token counts separately)
This layer’s complexity comes from assembling the API’s “low-level event stream” into “high-level assistant messages” while staying cancel-safe / error-safe / partial-state-safe. Not a toy — production-grade streaming API consumer logic is at least 1000+ lines.
Takeaways for building your own agent
- Main loop as async generator, not
Promise<Response>— real-time streaming to UI is baseline for agent products - Terminal return value must carry reason — different termination reasons drive different UX; opaque undefined can’t support that
- Dependency injection: high-frequency mocks (callModel, microcompact, autocompact) as a
depsobject — otherwise tests must spy 6-8 modules - State fields with clear semantics: 14 fields each answering one question beats an
anybucket - Per-turn processing is an explicit pipeline: snip → microcompact → collapse → autocompact → blocking check → model call → tools. The order is a design choice; annotate why this order
- Concurrent prefetch: skill / memory / cache params can run in background during LLM streaming — manage
with
Promise.allorusing disposablefor lifecycle - Streaming tool executor: tool_use block starts executing as LLM streams it, not waiting for the full assistant message — significantly lower latency for multi-tool scenarios
- Tombstone messages as explicit “streamed but void” markers — streaming output can’t be client-side retracted; need an explicit void marker
- Model fallback + streaming fallback are two layers — handling API errors vs streaming exceptions
- AbortController threads to leaves — each layer watches the signal itself, not forwarded layer-by-layer. Half-assed cancel is worse than none
- Task IDs need enough entropy — when IDs become filenames, consider symlink attacks. 36^8 is Claude Code’s choice
- Resume’s compaction pre-computed in background — UX-wise resume is instant, not “started when exit happened”
Further reading
- Claude Code source:
query.ts,QueryEngine.ts,Task.ts,query/{deps,stopHooks,tokenBudget,config}.ts - Compaction — steps 7-9 of the per-turn pipeline detailed here
- Execution Environment — worktree / remote implementations behind the Task types
- Permissions —
canUseTool’s decision path