ThinkBudget

A proxy for self-hosted LLMs that caps how much the model thinks per query.

Reasoning models burn thousands of tokens on internal monologue before answering. A greeting gets the same GPU time as a proof. ThinkBudget fixes that.

"What is 2+2?"     → TRIVIAL  →   32 thinking tokens  → $0.000002
"Debug this race    → COMPLEX  → 2048 thinking tokens  → $0.000180
 condition..."

It classifies the query, sets a token budget, forwards to your backend, and tracks GPU cost. No LLM call for classification. Under a millisecond.

The problem

DeepSeek-R1, Qwen-QwQ, and similar models wrap their reasoning in <think> tags. Without a budget, every query gets the same treatment. "Hello" triggers 4,000 tokens of internal deliberation. You pay for all of it.

Several papers describe adaptive reasoning (Ares, AutoThink, Conformal Thinking). None ship a tool you can deploy.

How it works

┌─────────────┐     ┌──────────────────┐     ┌─────────────────┐
│ Application │────▶│   ThinkBudget    │────▶│  vLLM / SGLang  │
│ (any client)│◀────│   Proxy          │◀────│  / Ollama       │
└─────────────┘     └──────────────────┘     └─────────────────┘

Classify. Heuristic scorer reads the query. No model call. Assigns a tier.
Budget. Sets a thinking token cap: 32, 128, 512, 2048, or 8192.
Forward. Sends the request to your backend with the cap applied.
Enforce. During streaming, injects </think> when the budget runs out.
Track. Samples GPU power via pynvml. Records cost per query in dollars and joules.

Quick start

Install

# Install the CLI
uv tool install thinkbudget

# Or add to a project as a dependency
uv add thinkbudget

# With GPU monitoring
uv tool install 'thinkbudget[gpu]'

# One-off run without installing
uvx thinkbudget classify "Hello"

Don't have uv? Install it or use pip: pip install thinkbudget

Run

thinkbudget serve \
  --backend-url http://localhost:8000 \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
  --gpu-cost 0.39 \
  --gpu-name "RTX 4090"

Point your app at http://localhost:9100 instead of the backend:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:9100/v1", api_key="unused")

# 32-token thinking budget
client.chat.completions.create(
    model="thinkbudget",
    messages=[{"role": "user", "content": "Hello"}],
)

# 2048-token thinking budget
client.chat.completions.create(
    model="thinkbudget",
    messages=[{"role": "user", "content": "Design a distributed task queue with at-least-once delivery"}],
)

Every response carries cost headers:

X-ThinkBudget-Tier: complex
X-ThinkBudget-Budget: 2048
X-ThinkBudget-Thinking-Tokens: 1847
X-ThinkBudget-Cost: $0.00018200
X-ThinkBudget-Energy-J: 42.3100

Dashboard

http://localhost:9100/dashboard shows a live feed: queries, tiers, budgets, tokens used, GPU power, cost per query, cumulative savings.

Classify only

# With thinkbudget installed
thinkbudget classify "What is the capital of France?"
# Tier: SIMPLE | Budget: 128 tokens

# Or as a one-off
uvx thinkbudget classify "Prove the halting problem is undecidable"
# Tier: COMPLEX | Budget: 2,048 tokens

Tiers

Tier	Budget	Example
TRIVIAL	32	"Hello"
SIMPLE	128	"What is the capital of France?"
MODERATE	512	"Compare REST vs GraphQL"
COMPLEX	2,048	"Debug this race condition"
DEEP	8,192	"Prove the halting problem is undecidable"

All budgets are configurable. Per tier, per user, per session.

Deploy on CloudRift

Built for CloudRift GPU instances.

git clone https://github.com/Siddhant-K-code/ThinkBudget
cd ThinkBudget
./cloudrift/run.sh

This starts vLLM with DeepSeek-R1-Distill-Qwen-7B and ThinkBudget in front of it. One command.

For a different model or GPU:

MODEL=Qwen/QwQ-32B GPU_COST=0.65 GPU_NAME="RTX 5090" ./cloudrift/run.sh

Docker works too:

docker build -t thinkbudget .
docker run --gpus all -p 9100:9100 -p 8000:8000 thinkbudget

Benchmarks

85 queries across all five tiers. Run them against your backend with and without ThinkBudget:

python benchmarks/run_benchmark.py --mode compare \
  --backend-url http://localhost:8000 \
  --proxy-url http://localhost:9100 \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

Generate a markdown report:

python benchmarks/compare.py \
  benchmarks/results/summary_baseline_*.json \
  benchmarks/results/summary_budgeted_*.json \
  benchmarks/results/report.md

API

Proxy

Endpoint	Method	What it does
`/v1/chat/completions`	POST	Proxies with budget control
`/v1/models`	GET	Lists backend models
`/health`	GET	Status and GPU info

Dashboard

Endpoint	Method	What it does
`/api/stats`	GET	Totals and distributions
`/api/history`	GET	Recent queries
`/api/gpu`	GET	Live GPU metrics
`/api/classify`	POST	Classify without forwarding

Environment variables

Variable	Default
`THINKBUDGET_BACKEND_URL`	`http://localhost:8000`
`THINKBUDGET_MODEL`	`default`
`THINKBUDGET_PORT`	`9100`
`THINKBUDGET_GPU_COST_PER_HOUR`	`0.39`
`THINKBUDGET_GPU_NAME`	`RTX 4090`

Structure

src/thinkbudget/
  classifier.py      # Scores query complexity. No LLM call.
  budget.py           # Sets and enforces token budgets.
  gpu_monitor.py      # Reads GPU power via pynvml. Computes cost.
  proxy.py            # OpenAI-compatible proxy. Streams with enforcement.
  dashboard.py        # Serves the live dashboard.
  models.py           # Data types.
  config.py           # Loads config from file or env.
  cli.py              # Entry point.

benchmarks/
  datasets/           # 85 queries, 5 tiers.
  run_benchmark.py    # Runs baseline vs budgeted.
  compare.py          # Generates comparison report.

cloudrift/
  run.sh              # One-command GPU deploy.

See ARCHITECTURE.md for the full design.

Part of a stack

ThinkBudget fits between Distill and LLMTraceFX:

Distill       → fewer input tokens (clean context)
ThinkBudget   → fewer thinking tokens (adaptive budget)
LLMTraceFX    → faster inference (GPU kernel profiling)

License

AGPL-3.0

Siddhant Khare

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
cloudrift		cloudrift
dashboard		dashboard
src/thinkbudget		src/thinkbudget
tests		tests
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ThinkBudget

The problem

How it works

Quick start

Install

Run

Dashboard

Classify only

Tiers

Deploy on CloudRift

Benchmarks

API

Proxy

Dashboard

Environment variables

Structure

Part of a stack

License

About

Uh oh!

Releases 1

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ThinkBudget

The problem

How it works

Quick start

Install

Run

Dashboard

Classify only

Tiers

Deploy on CloudRift

Benchmarks

API

Proxy

Dashboard

Environment variables

Structure

Part of a stack

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages