CostPlan – LLM Cost Enforcement Proxy — Case Study

The Setup

You're running an autonomous coding agent — let's say Moltbot, or Claude Code, or your own agent loop. The agent reads a codebase, plans changes, writes code, runs tests, reads errors, fixes bugs, and repeats. Every iteration is an API call. Every API call costs money.

Here's what a typical agent session looks like in terms of token usage:

Iteration	Input Tokens	Output Tokens	Cost
1	8,000	2,000	$0.09
2	15,000	3,000	$0.17
3	28,000	4,000	$0.30
4	45,000	5,000	$0.47
5	70,000	6,000	$0.71
...	...	...	...
20	200,000	8,000	$1.88

Notice the pattern: input tokens grow every iteration because the agent includes prior context (conversation history, file contents, error logs) in each call. By iteration 20, you're sending 200k input tokens per call.

Over 20 iterations on Claude Sonnet 4, that session costs about $7.50. That's one task. A productive developer might run 30–40 sessions a day. That's $225–$300/day in API costs.

Now imagine the agent hits a bug it can't fix. It tries the same approach, gets the same error, retries with a slightly different prompt, gets the same error again. Ten iterations burned on a single failing test. That's $15 wasted in a loop.

The Problem

There's no built-in way to say: "Stop after $10."

The Anthropic API doesn't have a budget parameter. max_tokens limits output length, not cost. There's no session concept. Each API call is independent — the server doesn't know or care that you've already spent $50 today.

Billing alerts exist, but they're:

Delayed — minutes to hours behind real spend
Non-blocking — the alert doesn't stop the next call
Account-level — no per-agent or per-session granularity

If your agent is burning $5/minute in a retry loop, a billing alert that fires hourly won't help.

The Fix: Two Lines

Start the CostPlan proxy with a session budget:

costplan proxy --per-call 1.00 --session 10.00 --port 8080

Point your agent at it:

export ANTHROPIC_BASE_URL=http://localhost:8080

That's it. No code changes. No configuration files. The agent doesn't know or care that it's talking to a proxy instead of the real API.

What Happens Now

Normal Operation

The proxy forwards every request to Anthropic, streaming SSE chunks back in real-time. Zero added latency on the response stream. After each response completes, the proxy calculates the actual cost (including cache read and cache creation tokens) and deducts it from the session budget.

Your agent works exactly as before — until it shouldn't.

Budget Approaching

The proxy tracks remaining budget and returns it in response headers:

X-Budget-Remaining: 2.35
X-Budget-Session-Total: 10.00

If your agent framework reads headers (most don't, but yours could), you can implement graceful wind-down: "You have $2 left, wrap up."

Budget Exceeded

When the session budget is exhausted, the next API call gets:

HTTP 429 Too Many Requests

The agent receives an error. Most agent frameworks handle API errors by stopping or retrying with backoff — either way, the spending stops.

This is the circuit breaker. The session cannot spend more than $10.

Per-Call Protection

The --per-call 1.00 flag catches a different failure mode: a single gigantic call. If your agent accidentally sends a 500k-token context (maybe it read an entire repository into the prompt), the proxy estimates the cost before forwarding and rejects it if over the per-call limit.

The expensive call never reaches Anthropic. You're not charged.

The Numbers

Here's the same 20-iteration scenario, with CostPlan enforcing a $10 session budget:

Iteration	Cumulative Cost	Budget Remaining	Status
1	$0.09	$9.91	Forwarded
5	$1.74	$8.26	Forwarded
10	$5.20	$4.80	Forwarded
13	$8.45	$1.55	Forwarded
14	$9.80	$0.20	Forwarded
15	—	$0.20	Rejected (429)

The agent gets 14 productive iterations. The 15th is rejected because the estimated cost exceeds the remaining $0.20. Total spend: $9.80 instead of unbounded.

If the agent was in a retry loop (iterations 10–20 all hitting the same bug), the circuit breaker would have tripped at iteration 13 or 14 instead of letting it burn through $7+ more on hopeless retries.

Cache-Aware Accuracy

A naive cost tracker would look at input tokens and output tokens and multiply by the standard rates. But Anthropic's actual pricing has four token types:

Token Type	Rate (Claude Sonnet 4)	Description
Input	$0.003/1K	Regular input tokens
Output	$0.015/1K	Generated output tokens
Cache Read	$0.0003/1K	Reusing cached context (90% discount)
Cache Creation	$0.00375/1K	Writing new context to cache (25% surcharge)

Agents that use prompt caching (including Claude Code) might have 80% of their input tokens come from cache reads. If you price those at the full input rate, your cost tracker reports 5x the actual cost — which means your budget limit triggers way too early, killing productive sessions.

CostPlan parses cache token counts from the SSE stream and prices each type correctly. The budget reflects real spend, not estimates.

When To Use This

You should use CostPlan if:

You're running autonomous agents (Moltbot, Claude Code, custom loops)
You're letting LLMs run unattended (CI/CD, batch jobs, overnight tasks)
You're building multi-agent systems where total cost is hard to predict
You want to give team members API access without unlimited spend

You don't need CostPlan if:

You're making a few manual API calls per day
Your LLM usage is already predictable and low-cost
You're fine with billing alerts and manual monitoring

The Broader Point

The AI ecosystem has great tools for building agents. It has almost no tools for constraining agents. The assumption seems to be that developers will be careful, that usage will be reasonable, that loops will terminate.

In production, that assumption costs real money.

CostPlan is infrastructure, not a feature. It's the ulimit for LLM spend — boring, invisible, and exactly the thing you wish you'd had the morning after your agent burned $200 overnight.

CostPlan is open source and takes two lines to set up. Get started on GitHub

CostPlan – LLM Cost Enforcement Proxy