LLMCost ControlOpen Source

Why I Built CostPlan: Deterministic Economics for Probabilistic Systems

February 13, 20268 min

Why I Built CostPlan

The Mismatch Nobody Talks About

Here's something that should bother more people than it does:

LLMs are non-deterministic. You send the same prompt twice, you get different outputs, different token counts, different costs. They're recursive — an agent can call itself. They're parallelizable — ten calls can fire at once. And they're capable of runaway loops where a confused agent retries the same failing action indefinitely.

But billing is deterministic. It's real. It's enforced. It's financially consequential.

That mismatch — probabilistic systems, deterministic billing — is a reliability gap. And right now, almost nobody has infrastructure to close it.

The Problem I Kept Seeing

I kept seeing the same pattern play out:

  1. Developer builds an agent loop. The agent calls an LLM, processes the response, decides what to do next, calls the LLM again.
  2. During development, costs are low. A few cents per run. No big deal.
  3. The agent goes into production — or gets left running overnight — or hits a retry loop on a bad API response.
  4. The developer wakes up to a $200 bill. Or $800. Or worse.

This isn't hypothetical. Autonomous AI agents cost users up to $300/day in API fees. Users report burning through 1–3 million tokens in minutes.

The common advice? "Be more careful." "Set billing alerts." "Use a cheaper model."

That's not engineering. That's hope.

Why Billing Alerts Don't Work

Billing alerts tell you after you've already spent the money. They're monitoring, not enforcement. The difference matters:

  • Monitoring says: "You spent $50 today." You read this email tomorrow morning.
  • Enforcement says: "This call would put you over $50. Rejected." The call never executes.

If your agent is in a loop burning $5/minute, a billing alert that fires every hour isn't going to save you. You need the call to fail before the money is spent.

The Design: A Transparent Proxy

CostPlan is a transparent HTTP proxy that sits between your application and the LLM API. Your app thinks it's talking to Anthropic or OpenAI. It's actually talking to CostPlan, which enforces budget limits and forwards the request.

Your Agent  ->  CostPlan Proxy  ->  Anthropic/OpenAI API
                    |
              Budget check
              (reject if over)
                    |
              Forward request
                    |
              Track actual cost
              from response

The integration is two lines:

costplan proxy --per-call 1.00 --session 10.00
export ANTHROPIC_BASE_URL=http://localhost:8080

No code changes to your agent. No SDK to integrate. No config files to write. Set an environment variable and start the proxy.

When the budget runs out, the proxy returns HTTP 429. Your agent gets an error and stops. That's the circuit breaker.

The Hard Technical Problems

Building a budget proxy for LLM APIs sounds simple. It's not. Here's what made it interesting:

Streaming Responses

Claude Code and most modern LLM integrations use Server-Sent Events (SSE) for streaming. The response comes as a sequence of chunks over seconds or minutes. You can't buffer the whole response and then decide the cost — the client needs chunks in real-time.

So CostPlan forwards every SSE chunk immediately (zero additional latency) while simultaneously parsing the event stream. It extracts token usage from two specific events:

  • message_start — contains input token counts (including cache tokens)
  • message_delta — the final event, contains output token count

The cost is calculated after the stream ends, and the session budget is updated. If this call put you over budget, the next call gets rejected. You can't un-send a stream that's already flowing, but you can prevent the next one.

Cache-Aware Pricing

Anthropic charges different rates for different types of input tokens:

  • Regular input tokens: full price
  • Cache read tokens: approximately 10% of input price (you're reusing cached context)
  • Cache creation tokens: approximately 125% of input price (you're writing to cache)

If you ignore cache tokens, your cost tracking is wrong — potentially by a large margin for agents that use prompt caching heavily (which is most of them, including Claude Code). CostPlan tracks all four token types from the SSE stream and prices them correctly.

Pre-Check vs. Post-Track

There's a fundamental tension: you want to reject over-budget calls before they execute, but you don't know the exact cost until after you get the response (because you don't know how many output tokens the model will generate).

CostPlan handles this with a two-phase approach:

  1. Pre-check: Estimate input cost from message content length (conservative heuristic). If the estimate alone exceeds the remaining budget, reject immediately with 429.
  2. Post-track: After the response (or stream) completes, calculate actual cost from real token counts and update the session budget.

The pre-check catches obviously over-budget calls. The post-track catches everything else. If a call's actual cost exceeds the remaining budget, the session locks and subsequent calls are rejected.

Thread Safety and Concurrency

Agent frameworks often fire multiple LLM calls concurrently. The budget state needs to be safe under concurrent access — both for the proxy (async requests hitting the same budget) and the SDK (threaded agent loops).

CostPlan uses asyncio.Lock in the proxy and threading.Lock in the SDK. Budget mutations are atomic. Two concurrent calls can't both "fit" in a budget that only has room for one.

What It's Not

CostPlan is not a dashboard. It's not an analytics platform. It's not a cost optimization tool.

It's an enforcement layer. A circuit breaker. It guarantees that your LLM workflow won't spend more than you allowed. That's a strong invariant, and strong invariants are what make systems reliable.

Two Integration Paths

Path A: SDK — for when you control the code.

from costplan import BudgetedLLM, BudgetExceededError

llm = BudgetedLLM(
    provider="anthropic",
    model="claude-sonnet-4-20250514",
    per_call_budget=0.50,
    session_budget=5.00,
)

try:
    response = llm.generate("Summarize this document.")
except BudgetExceededError:
    print("Budget hit. Stopping.")

Thread-safe. Async variant available. Dynamic max_tokens from remaining budget.

Path B: Proxy — for when you don't control the code (Claude Code, or any third-party agent).

costplan proxy --per-call 1.00 --session 10.00
export ANTHROPIC_BASE_URL=http://localhost:8080

Zero code changes. Works with anything that talks to the OpenAI or Anthropic API.

The Takeaway

LLMs are powerful, unpredictable, and expensive. The tooling around them assumes you'll be careful. But "careful" doesn't scale to autonomous agents, CI pipelines, and overnight batch jobs.

CostPlan makes one simple guarantee: your workflow won't spend more than you said it could. That's it. Deterministic economics for probabilistic systems.


CostPlan is open source. GitHub