What's your biggest LLM cost multiplier?

Question

"Tokens per request" has been a misleading cost model for us in production. The real drivers seem to be multipliers: retries/429s, tool fanout, P95 context growth, and safety passes.What&rsquo;s been the biggest cost multiplier in your prod LLM systems, and what policies worked (caps, degraded mode, fallback, hard fail)?

teilom · Accepted Answer

If you&rsquo;re trying to estimate before prod, logging these 4 things in a pilot gets you 80% there: - tokens/run (in+out) - tool calls/run (and fanout) - retry rate (timeouts/429s) - context length over turns (P50/P95)Fanout &times; retries is the classic &ldquo;bill exploder&rdquo;, and P95 context growth is the stealth one. The point of &ldquo;budget as contract&rdquo; is deciding in advance what happens at limit (degraded mode / fallback / partial answer / hard fail), not discovering it from the invoice.

teilom · Answer

Background note I wrote (framing + &ldquo;budget as contract&rdquo;): https://github.com/teilomillet/enzu/blob/main/docs/BUDGETS_A...

zhug3 · Answer

In my experience the biggest multiplier isn't any single variable it's the interaction between them. Fanout × retries × context growth compounds in ways that linear cost models completely miss.
The fix that worked for us: treat budget as a hard constraint, not a target. When you're approaching limit, degrade gracefully (shorter context, fewer tool calls, fallback to smaller model) rather than letting costs explode and cleaning up later.
Also worth tracking: the 90th percentile request often costs 10x the median. A handful of pathological queries can dominate your bill. Capping max tokens per request is crude but effective.

rishabhpoddar · Answer

- Tool calling: This is unavoidable, but I try structure the tools such that the total tool calling for an input is minimised.
- Using UUIDs in the prompt (which can happen if you serialise a data structure that contains UUIDs into a prompt): Just don't use UUIDs, or if you must, then map them onto unique numbers (in memory) before adding them to a prompt
- Putting everything in one LLM chat history: Use sub agents with their own chat history, and discard it after sub agent finishes.
- Structure your system prompt to maximize input cache tokens: You can do this by putting all the variable parts of the system prompt towards the end if it, if possible.