Bill Structure Analysis
Decomposition of a $0.19 agent task across five source categories (system prompt, tool descriptions, history, tool results, output), with cost comparison across models.
Regarding the data: All token counts and dollar figures on this page are a synthetic example, derived from Claude Sonnet 4.6’s public pricing (input $3/MTok, cache write $3.75/MTok, cache read $0.30/MTok, output $15/MTok) and the magnitudes of a typical 12-step task. The purpose is to illustrate bill structure — absolute values are not a quote. Agent product workloads exhibit comparable source distribution; absolute values vary within a factor of 3-5 with task complexity.
Task scenario
Sample task: a user requests the agent to triage Gmail inbox and classify the most recent 20 emails by project. The agent completes the task in 12 steps — 4 extract calls for email metadata, 8 click actions through a browser automation tool to apply labels.
Rationale for selecting 12 steps: this length falls in the medium-task range. Tasks of 1-2 steps exhibit anomalously high system prompt share (startup overhead is not amortized); tasks of dozens of steps exhibit significantly higher history share (cumulative base grows). The conclusions in this section apply most directly to medium-length tasks.
Itemized cost table
Each row represents the cumulative usage and resulting cost for one token category:
| Source | Token usage | Unit price | Cost | Share |
|---|---|---|---|---|
| System prompt (cache hits) | 2.5K write + 27.5K hits | write $3.75 / read $0.30 | $0.018 | 9% |
| Tool descriptions (cache hits) | 1.2K write + 13.2K hits | write $3.75 / read $0.30 | $0.008 | 4% |
| Conversation history (uncached) | 26.4K | $3.00 / MTok | $0.079 | 41% |
| Tool results (uncached) | 14.4K | $3.00 / MTok | $0.043 | 22% |
| Model output | 3.0K | $15 / MTok | $0.045 | 23% |
| Sandbox + storage | 1 session, ~1MB R2 | amortized | ~$0.001 | <1% |
| Total | — | — | $0.194 | 100% |
“Cache hits” refers to Anthropic Prompt Caching’s mechanism for matching identical prefixes on repeated transmission. System prompt and tool descriptions are static prefixes re-sent on every step; after the initial cache write, the subsequent 11 steps all hit.
Three core observations
History is the largest cost item
41% of the bill is consumed by re-transmitting historical conversation. Every additional step replays all prior user messages, assistant messages, and tool results as input. After 12 steps, accumulated history input reaches 26.4K — approximately 9× the model output and 10× the system prompt.
Cumulative cost grows O(n²) with step count: step N requires re-sending the entire history of steps 1 through N-1, summing to 1 + 2 + ... + (n-1). Nearly every mature agent product has a prompt compaction mechanism as a dedicated subsystem precisely because of this property (reference implementation: compaction) — without compaction, bills scale with the square of task length.
Cache hits on static prefixes yield significant savings
Without cache, the system prompt cost would be $3 × 30K / 1M = $0.09; with cache, actual cost is $0.018 — approximately 80% savings. Tool descriptions follow the same pattern.
prompt-system prepends all per-step repeated content for this reason — to enable Anthropic / OpenAI prompt cache hits from step two onward. Cache hit rate is a continuous variable rather than a binary switch: the savings from raising hit rate from 60% to 80% typically exceed those from manually shortening prompts, because the percentage applies to the entire static prefix, whereas manual reduction affects only specific segments.
Output unit price is 5× that of input
$15 / $3 = 5. Effective agent prompts therefore favor reduced reasoning output and increased tool invocation — fewer long-form analyses, more direct tool calls; tools return condensed data rather than letting the model paraphrase.
Common agent product tool design patterns follow this principle — the browser tool uses short uids instead of long CSS selectors, extract returns markdown rather than raw HTML by default, read_file paginates rather than returning full files to the model. Each design compresses input and indirectly compresses output — less visible content yields less paraphrasing.
Cost comparison across models on the same task
Only one multiplicative adjustment exists in this table: model selection. Other optimizations are additive (fewer tokens, expanded cache scope); switching models proportionally rescales the entire table.
| Model | Input | Cache write | Cache read | Output | This task (est.) |
|---|---|---|---|---|---|
| Claude Haiku 4.5 | $1.00 | $1.25 | $0.10 | $5.00 | ~$0.065 |
| Claude Sonnet 4.6 | $3.00 | $3.75 | $0.30 | $15.00 | $0.194 |
| Claude Opus 4.7 | $15.00 | $18.75 | $1.50 | $75.00 | ~$0.970 |
Estimates assume identical token usage across the three models. In practice, smaller models typically require more steps to complete the same task (weaker reasoning, more circuitous paths), so Haiku’s actual cost is closer to 1/2 of Sonnet rather than 1/3; Opus is the converse — more accurate single inferences with fewer steps, but substantially higher per-step cost.
The tier system is fundamentally a layered model budget: Lite defaults to Haiku for routine tasks; Ultra defaults to Opus for complex scenarios requiring deep reasoning. Routing a Haiku-capable task to Opus multiplies the bill by approximately 15.
Further reading
- Cost cap configuration and ROI estimation: controls-and-roi
- History runaway mechanism and compaction strategy implementation: compaction, memory-system
- Real data from your own environment: Grafana’s
agent_token_usagepanel (operations/dashboards)