Agent SDK Budget Gate: How to Calculate Token, Time, and Cost Limits Before Every Tool Call

hero

The Answer at a Glance

A budget gate in the Agent SDK is a guard function that runs immediately before each tool call and inspects three numbers: tokens consumed so far, seconds elapsed since session start, and estimated cumulative API cost. If any of the three values reaches its preset ceiling, the function raises a BudgetExceeded signal that terminates the agent loop right away. Because the gate sits in a middleware position rather than inside the agent body itself, it can be reused across projects without touching core logic.

Why This Matters Now

Agents running repetitive loops can exhaust budgets faster than you expect. Calling a search tool twenty times in a row or repeatedly summarizing a large document can burn millions of tokens in a single session. The Anthropic documentation explicitly lists preventing infinite loops as a key design principle for agent systems. Beyond cost, uncapped execution time degrades user experience and can trigger cascading timeouts in downstream services.

Think of the budget gate like a safety barrier on a construction site. Without it, workers can unknowingly walk into a hazardous zone. With it, everyone must pause and verify status before proceeding.

Step-by-Step Implementation

At session start, initialize a BudgetState object with three fields. Example: budget = BudgetState(max_tokens=50000, max_seconds=120, max_usd=0.50)
Write a wrapper function check_budget(budget, estimated_tokens) that reads the current timestamp to compute elapsed seconds and multiplies estimated tokens by the model unit price to project cost.
Inside check_budget, evaluate conditions in this order: if budget.used_tokens + estimated_tokens > budget.max_tokens: raise BudgetExceeded('token'). Checking tokens first, then time, then cost ensures you exit on the fastest-triggering condition.
Register check_budget with the before_tool_call event hook in the agent loop. The Claude Agent SDK supports this hook point, so the gate runs without making any real API call.
Place a BudgetExceeded catch handler at the top of the loop and serialize partial results there. Returning 'stopped due to budget limit, partial results below' is far more useful than returning nothing.

Real-World Examples

Two patterns appear most often in production.

First, manage per-model unit prices in a dictionary: PRICE_PER_1K = {'claude-3-5-sonnet': 0.003, 'claude-opus-4': 0.015}. At call time, look up the price by model name. This way the gate logic never needs to change when you switch models.

Second, adjust the time ceiling dynamically based on user tier: budget.max_seconds = 300 if context.user_tier == 'premium' else 60. Set this once at session initialization and it applies for the entire loop.

Also log the burn rate as a percentage: spent_pct = round(budget.used_tokens / budget.max_tokens * 100, 1). Including this in structured logs lets you aggregate averages over time and calibrate your limits against real usage data.

Common Mistakes

The most frequent mistake is counting only input tokens and ignoring output tokens. When the model reads a tool response, it consumes tokens too. The budget gate must accumulate both input and output tokens for accurate tracking.

The second mistake is setting limits so tight that normal tasks get cut off. Start by collecting actual consumption logs for two to three weeks, then set the limit at roughly 130% of the P95 value.

The third mistake is treating BudgetExceeded as a plain error and returning nothing to the user. Always return partial results up to the point of interruption. This dramatically improves perceived reliability.

Checklist

BudgetState initialized with all three ceilings: tokens, seconds, and cost
Evaluation order is tokens first, then time, then cost
Both input and output tokens are added to the counter
BudgetExceeded handler returns partial results rather than an empty response
Model unit price table is kept up to date
Burn rate is logged in structured format
Limit values are periodically recalibrated using real measurement data

FAQ

Q. How do you estimate output tokens before a tool call?

Output tokens cannot be known exactly in advance. In practice, maintain a rolling average of output tokens from the previous three calls and use that as the estimate. Adding a 1.2 multiplier as a safety buffer accounts for variance.

Q. Does the budget gate add noticeable overhead?

The gate function performs only arithmetic and comparison operations, so its cost is in the microsecond range. Compared to the hundreds of milliseconds required for a real API round trip, the overhead is negligible.

Q. How do you share a budget across multiple parallel agents?

Store BudgetState in a shared store such as an in-memory counter or an external key-value store and use atomic increment operations to update it. The design pattern is identical to the single-agent case; only the storage layer moves outside the process.

Wrapping Up

A budget gate is the seatbelt of your agent system. Things will usually work fine without it, but when something goes wrong you will be very glad it is there. Three components — a BudgetState object, a check function, and an exception handler — add almost no code complexity while preventing cost overruns and infinite loops at the source. Start by collecting burn-rate logs, then refine your limits with real data, and the gate will become more precise over time.

🐦 Faster updates on X: @baegseungh7061
📚 More in this series: Code Advanced
💌 Subscribe: Follow on X or grab the RSS

Seunghyeon's Agentic Lab

이 블로그 검색