Multi-resource rate limiting for LLM APIs. Reserve tokens before you call, refund what you don't use, stay under the limit across workers.
Works with any LLM provider and any client library — token-throttle limits the rate, not the client.
pip install "token-throttle[redis,tiktoken]>=0.5.0,<0.6.0" # OpenAI + Redis (recommended)
pip install "token-throttle[redis]>=0.5.0,<0.6.0" # Any provider + Redis
pip install "token-throttle>=0.5.0,<0.6.0" # Any provider + in-memoryfrom openai import AsyncOpenAI
from token_throttle import create_openai_redis_rate_limiter
client = AsyncOpenAI()
limiter = create_openai_redis_rate_limiter(
redis_client, rpm=10_000, tpm=2_000_000,
)
# 1. Reserve capacity (blocks until available)
request = dict(model="openai/gpt-4.1", messages=[{"role": "user", "content": "Hi"}])
reservation = await limiter.acquire_capacity_for_request(**request, extra_usage=None)
# 2. Make the API call
response = await client.chat.completions.create(**request)
# 3. Refund unused tokens
await limiter.refund_capacity_from_response(reservation, response)from token_throttle import RateLimiter, Quota, UsageQuotas, RedisBackendBuilder
from token_throttle import PerModelConfig
limiter = RateLimiter(
lambda model: PerModelConfig(
quotas=UsageQuotas([
Quota(metric="requests", limit=1_000, per_seconds=60),
Quota(metric="input_tokens", limit=80_000, per_seconds=60),
Quota(metric="output_tokens", limit=20_000, per_seconds=60),
]),
),
backend=RedisBackendBuilder(redis_client),
)
# Works with Anthropic, Gemini, local models — anything
reservation = await limiter.acquire_capacity(
model="claude-sonnet-4-20250514",
usage={"requests": 1, "input_tokens": 500, "output_tokens": 4_000},
)
response = await call_your_llm(...) # Use whatever client you want
await limiter.refund_capacity(
actual_usage={"requests": 1, "input_tokens": 480, "output_tokens": 1_200},
reservation=reservation,
)
# Unused 2,800 output tokens returned to the poolThe problem: You're running parallel LLM calls (batch processing, agents, multiple services sharing a key). Simple rate limiters waste throughput because they reserve worst-case tokens and never give them back. You hit 429s or crawl at half capacity.
The solution: Reserve before you call, refund after. Actual usage is tracked, not estimated maximums.
| Feature | Details |
|---|---|
| Multi-resource limits | Limit requests, tokens, input/output tokens — simultaneously, each with its own quota |
| Multiple time windows | e.g., 1,000 req/min AND 10,000 req/day on the same resource |
| Reserve & refund | Reserve max expected usage upfront, refund the difference after the call completes |
| Distributed | Redis backend with atomic locks — safe across workers and processes |
| Per-model quotas | Different limits per model via model_family; the built-in OpenAI helper auto-groups date-suffixed variants (e.g. gpt-4o-20241203 → gpt-4o) |
| Pluggable | Bring your own backend (ships with Redis and in-memory). Sync and async APIs |
| Observability | Callbacks for wait-start, wait-end, consume, refund, and missing-state events |
token-throttle implements a token bucket algorithm (capacity refills linearly over time, capped at the quota limit).
- Acquire — blocks until enough capacity is available, then atomically reserves it
- Call — make your API request with any client
- Refund — report actual usage; unused tokens return to the pool immediately
The Redis backend uses sorted locking to prevent deadlocks when acquiring multiple resource buckets simultaneously.
from token_throttle import Quota, UsageQuotas, SecondsIn
quotas = UsageQuotas([
Quota(metric="requests", limit=2_000, per_seconds=SecondsIn.MINUTE),
Quota(metric="tokens", limit=3_000_000, per_seconds=SecondsIn.MINUTE),
Quota(metric="requests", limit=10_000_000, per_seconds=SecondsIn.DAY),
])per_seconds accepts integer seconds. Use SecondsIn.MINUTE (60), SecondsIn.HOUR (3600), SecondsIn.DAY (86400), or any integer.
def get_config(model_name: str) -> PerModelConfig:
if model_name.startswith("gpt"):
return PerModelConfig(
quotas=UsageQuotas([
Quota(metric="requests", limit=10_000, per_seconds=60),
Quota(metric="tokens", limit=2_000_000, per_seconds=60),
]),
usage_counter=OpenAIUsageCounter(), # auto-counts tokens from messages
model_family=openai_model_family_getter(model_name),
)
# ... other providers
limiter = RateLimiter(get_config, backend=RedisBackendBuilder(redis_client))# Distributed (multiple workers/processes)
from token_throttle import RedisBackendBuilder
backend = RedisBackendBuilder(redis_client)
# Single process (no Redis needed)
from token_throttle import MemoryBackendBuilder
backend = MemoryBackendBuilder()Both backends are available in sync (SyncRedisBackendBuilder, SyncMemoryBackendBuilder) and async variants.
Adjust bucket limits at runtime without rebuilding the limiter — useful for
adaptive rate limiting (e.g., reacting to x-ratelimit-* response headers):
# After at least one acquire/record call for this model:
await limiter.set_max_capacity(
model="gpt-4o",
metric="tokens",
per_seconds=60,
value=5000,
)For Redis backends the new limit is written to Redis, so all processes sharing the same Redis see the change within ~1 second.
from token_throttle import SyncRateLimiter, SyncMemoryBackendBuilder
limiter = SyncRateLimiter(get_config, backend=SyncMemoryBackendBuilder())
reservation = limiter.acquire_capacity(model="gpt-4.1", usage={"requests": 1, "tokens": 500})
response = call_llm_sync(...)
limiter.refund_capacity(actual_usage={"requests": 1, "tokens": 320}, reservation=reservation)- Originally a rewrite of openlimit