Bill Structure Analysis

Decomposition of a $0.19 agent task across five source categories (system prompt, tool descriptions, history, tool results, output), with cost comparison across models.

Regarding the data: All token counts and dollar figures on this page are a synthetic example, derived from Claude Sonnet 4.6’s public pricing (input $3/MTok, cache write $3.75/MTok, cache read $0.30/MTok, output $15/MTok) and the magnitudes of a typical 12-step task. The purpose is to illustrate bill structure — absolute values are not a quote. Agent product workloads exhibit comparable source distribution; absolute values vary within a factor of 3-5 with task complexity.

Task scenario

Sample task: a user requests the agent to triage Gmail inbox and classify the most recent 20 emails by project. The agent completes the task in 12 steps — 4 extract calls for email metadata, 8 click actions through a browser automation tool to apply labels.

Rationale for selecting 12 steps: this length falls in the medium-task range. Tasks of 1-2 steps exhibit anomalously high system prompt share (startup overhead is not amortized); tasks of dozens of steps exhibit significantly higher history share (cumulative base grows). The conclusions in this section apply most directly to medium-length tasks.

Itemized cost table

Cost composition — one 12-step task bill Claude Sonnet 4.6 · synthetic example · $0.194 total · history dominates Conversation history Model output Tool results System+tools 41% 23% 22% 13% <1% Sources: Conversation history = prior turns re-sent as input; Tool results = function returns (enter history); System+tools = static prefix (cached after first step); Sandbox+storage = compute + R2 (negligible).

Each row represents the cumulative usage and resulting cost for one token category:

SourceToken usageUnit priceCostShare
System prompt (cache hits)2.5K write + 27.5K hitswrite $3.75 / read $0.30$0.0189%
Tool descriptions (cache hits)1.2K write + 13.2K hitswrite $3.75 / read $0.30$0.0084%
Conversation history (uncached)26.4K$3.00 / MTok$0.07941%
Tool results (uncached)14.4K$3.00 / MTok$0.04322%
Model output3.0K$15 / MTok$0.04523%
Sandbox + storage1 session, ~1MB R2amortized~$0.001<1%
Total$0.194100%

“Cache hits” refers to Anthropic Prompt Caching’s mechanism for matching identical prefixes on repeated transmission. System prompt and tool descriptions are static prefixes re-sent on every step; after the initial cache write, the subsequent 11 steps all hit.

Three core observations

History is the largest cost item

41% of the bill is consumed by re-transmitting historical conversation. Every additional step replays all prior user messages, assistant messages, and tool results as input. After 12 steps, accumulated history input reaches 26.4K — approximately 9× the model output and 10× the system prompt.

Cumulative cost grows O(n²) with step count: step N requires re-sending the entire history of steps 1 through N-1, summing to 1 + 2 + ... + (n-1). Nearly every mature agent product has a prompt compaction mechanism as a dedicated subsystem precisely because of this property (reference implementation: compaction) — without compaction, bills scale with the square of task length.

Cache hits on static prefixes yield significant savings

Without cache, the system prompt cost would be $3 × 30K / 1M = $0.09; with cache, actual cost is $0.018 — approximately 80% savings. Tool descriptions follow the same pattern.

prompt-system prepends all per-step repeated content for this reason — to enable Anthropic / OpenAI prompt cache hits from step two onward. Cache hit rate is a continuous variable rather than a binary switch: the savings from raising hit rate from 60% to 80% typically exceed those from manually shortening prompts, because the percentage applies to the entire static prefix, whereas manual reduction affects only specific segments.

Output unit price is 5× that of input

$15 / $3 = 5. Effective agent prompts therefore favor reduced reasoning output and increased tool invocation — fewer long-form analyses, more direct tool calls; tools return condensed data rather than letting the model paraphrase.

Common agent product tool design patterns follow this principle — the browser tool uses short uids instead of long CSS selectors, extract returns markdown rather than raw HTML by default, read_file paginates rather than returning full files to the model. Each design compresses input and indirectly compresses output — less visible content yields less paraphrasing.

Cost comparison across models on the same task

Same task, three models — 15× cost spread 12-step inbox triage · synthetic example · routing the wrong task to Opus multiplies the bill Haiku 4.5 ~$0.065 ~3× cheaper than Sonnet Sonnet 4.6 $0.194 baseline · default for most tasks Opus 4.7 ~$0.970 ~5× Sonnet · ~15× Haiku — reserve for deep reasoning only Estimates assume identical token usage. Smaller models often need more steps; real gap is closer to 2× Haiku ↔ Sonnet, ~4× Sonnet ↔ Opus.

Only one multiplicative adjustment exists in this table: model selection. Other optimizations are additive (fewer tokens, expanded cache scope); switching models proportionally rescales the entire table.

ModelInputCache writeCache readOutputThis task (est.)
Claude Haiku 4.5$1.00$1.25$0.10$5.00~$0.065
Claude Sonnet 4.6$3.00$3.75$0.30$15.00$0.194
Claude Opus 4.7$15.00$18.75$1.50$75.00~$0.970

Estimates assume identical token usage across the three models. In practice, smaller models typically require more steps to complete the same task (weaker reasoning, more circuitous paths), so Haiku’s actual cost is closer to 1/2 of Sonnet rather than 1/3; Opus is the converse — more accurate single inferences with fewer steps, but substantially higher per-step cost.

The tier system is fundamentally a layered model budget: Lite defaults to Haiku for routine tasks; Ultra defaults to Opus for complex scenarios requiring deep reasoning. Routing a Haiku-capable task to Opus multiplies the bill by approximately 15.

Further reading

Was this page helpful?