One env var. 60–80% cheaper dev loops. A localhost proxy that caches identical OpenAI and Anthropic API calls on disk and replays them for free. Works with every existing tool.
You're iterating on a prompt. You run the same call 40 times tweaking wording. That's 40× the spend on identical requests.
Or: you have a long-running script that re-fetches the same tool definitions every run during development. Or: your test suite calls the API.
llm-cache-proxy sits on localhost, speaks the OpenAI and Anthropic REST
protocols, and caches every successful response in a single SQLite file.
Identical requests (same method, path, body) get served from disk —
no network call, no spend.
It works with every tool because you only change one env var:
export OPENAI_BASE_URL=http://127.0.0.1:9001/openai/v1
export ANTHROPIC_BASE_URL=http://127.0.0.1:9001/anthropicCursor, Claude Code, your scripts, your tests, the OpenAI Python SDK, the Anthropic SDK — they all start using the cache automatically.
pip install llm-cache-proxy
llm-cache-proxy
# or
uvx llm-cache-proxyDefault port: 9001. Default cache: ~/.cache/llm-cache-proxy/cache.db.
In whatever shell launches your tool / script:
# OpenAI
export OPENAI_BASE_URL=http://127.0.0.1:9001/openai/v1
# Anthropic
export ANTHROPIC_BASE_URL=http://127.0.0.1:9001/anthropic
# now run anything — Cursor, your script, pytest, etc.Responses include a X-LLM-Cache: HIT|MISS header so you can see what
happened.
Bypass the cache for a single request:
curl -H "x-llm-cache-bypass: 1" http://127.0.0.1:9001/openai/v1/chat/completions ...curl http://127.0.0.1:9001/stats{
"hits": 312,
"misses": 87,
"bytes_served_from_cache": 4_182_404,
"entries": 87,
"cached_response_bytes": 1_205_211,
"by_model": {"gpt-4o": 41, "claude-sonnet-4": 46}
}Clear the cache:
curl -X DELETE http://127.0.0.1:9001/cache
curl -X DELETE http://127.0.0.1:9001/stats| Env var | Default | Description |
|---|---|---|
LLM_CACHE_PORT |
9001 |
Listen port. |
LLM_CACHE_HOST |
127.0.0.1 |
Listen host. |
LLM_CACHE_DIR |
~/.cache/llm-cache-proxy |
Where to put the SQLite file. |
LLM_CACHE_TTL |
0 |
TTL in seconds (0 = forever). |
LLM_CACHE_TIMEOUT |
300 |
Upstream request timeout. |
OPENAI_UPSTREAM |
https://api.openai.com |
Override the upstream. |
ANTHROPIC_UPSTREAM |
https://api.anthropic.com |
Override the upstream. |
Per-request:
- Header
x-llm-cache-bypass: 1— skip both read and write for this call. - Header
x-llm-cache-extra-key: <string>— add an extra dimension to the cache key (e.g., a user id, a session id).
sha256(method + "|" + path + "|" + body + optional extra_key)
Method + path + body is enough to make identical requests collide
deterministically. Headers are not included in the default key (so API
key rotation doesn't invalidate the cache) — set x-llm-cache-extra-key
if you want extra dimensions.
Only 2xx responses are cached. Errors always go through.
- Streaming responses: when the upstream returns
text/event-stream, the full SSE body is captured and replayed verbatim on cache hit. That works but you lose per-token streaming feel. - Tool / function-calling responses cache fine — the whole completion object is one entry.
- Don't expose this proxy to the public internet — it has no auth and your API key flows through it.
- mcp-rec — VCR for MCP servers (similar idea, MCP layer).
- ai-first-scraper — clean Markdown for LLM agents.
MIT © yubinkim444