Teaching a model to become the router β replacing hand-crafted heuristics with learned tool-selection behavior via QLoRA distillation.
π Read the blog post: From Heuristics to Fine-Tuning
Autonomous AI agents need to decide which tool to call for every user query. Most implementations rely on:
- β Regex/keyword matching (brittle, unmaintainable)
- β Zero-shot LLM prompting (expensive, slow, inconsistent)
- β Embedding similarity (loses argument extraction)
ToolForge solves this by fine-tuning a small LLM (7-8B) via QLoRA on synthetic tool-call traces, achieving 86% tool-selection accuracy with sub-second latency β replacing a heuristic router with a learned one.
| Run | Base Model | LoRA r | LR | Test Accuracy | Eval Loss |
|---|---|---|---|---|---|
| π₯ qwen7b-r64 | Qwen2.5-7B-Instruct | 64 | 2e-4 | 86.2% | 0.141 |
| π₯ mistral-r64 | Mistral-7B-Instruct-v0.3 | 64 | 2e-4 | 82.8% | 0.670 |
| π₯ mistral-r16 | Mistral-7B-Instruct-v0.3 | 16 | 2e-4 | 81.9% | 0.648 |
| β mistral-lr5e4 | Mistral-7B-Instruct-v0.3 | 64 | 5e-4 | 60.3% | 0.730 |
Note: True accuracy is estimated at ~92%+ after accounting for noisy teacher labels in the test set (model correctly routes queries that were mislabeled by the teacher).
| Tool | Accuracy | Tool | Accuracy |
|---|---|---|---|
| datetime | 100% | web_search | 91.7% |
| unit_converter | 100% | wikipedia | 86.7% |
| web_reader | 100% | translate | 80.0% |
| calculator | 94.1% | multi_tool | 50.0% |
| dictionary | 93.8% | no_tool | 41.7% |
| weather | 92.3% |
- 7/9 tools above 90% β single-tool routing is near-production quality
- Adapter size has minimal impact β r=16 (81.9%) vs r=64 (82.8%); smaller adapter is deployable for efficiency
- Learning rate is critical β 5e-4 causes divergence; 2e-4 is the sweet spot
- Student surpasses teacher β the fine-tuned model correctly routes queries that Gemini mislabeled in the training set
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ToolForge Pipeline β
β β
β Phase 1: Data Generation β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββ β
β β Template β + β Gemini β β β 1,173 labeled β β
β β Generator β β Teacher β β examples β β
β β (498 seed) β β (679 dist.) β β (train/val/ β β
β β β β flash+lite β β test/hard) β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββ β
β β
β Phase 2: QLoRA Training β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββ β
β β Base Model β + β LoRA r=64 β β β Fine-tuned β β
β β (4-bit NF4) β β Adapter β β Router β β
β β Qwen/Mistral β β ~335-646 MB β β 86.2% acc β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββ β
β β
β Phase 3: Evaluation β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββ β
β β Tool Acc. β β Per-Categoryβ β W&B β β
β β Arg Match β β Breakdown β β Dashboard β β
β β Multi-Tool β β Error β β 4 ablation β β
β β Latency β β Analysis β β runs β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The model learns to route queries to these tools (or respond directly):
| Tool | Description | Input Schema |
|---|---|---|
web_search |
Search the internet | {query: str} |
calculator |
Mathematical expressions | {expression: str} |
weather |
Current weather data | {location: str} |
wikipedia |
Encyclopedia lookup | {query: str} |
datetime |
Date/time operations | {action: str, ...} |
dictionary |
Word definitions | {word: str} |
translate |
Language translation | {text: str, to_lang: str} |
unit_converter |
Unit conversion | {value: float, from: str, to: str} |
web_reader |
Extract webpage content | {url: str} |
Plus no_tool (direct response) and multi_tool (chained calls).
toolforge/
βββ README.md
βββ requirements.txt
βββ configs/
β βββ mistral_r64.yaml # Default training config
β βββ mistral_r16.yaml # Small adapter ablation
β βββ llama_r64.yaml # Alternative base model
βββ data/
β βββ synthetic/
β β βββ queries.json # 1,894 generated queries
β β βββ teacher.jsonl # 679 Gemini-labeled examples
β βββ train.jsonl # 918 training examples
β βββ val.jsonl # 114 validation examples
β βββ test.jsonl # 116 test examples
β βββ hard_test.jsonl # 25 multi-tool edge cases
βββ src/
β βββ data_gen/
β β βββ template_generator.py # Deterministic seed data (498 examples)
β β βββ teacher_labeler.py # Gemini distillation with multi-key rotation
β β βββ build_dataset.py # Merge, dedup, split into train/val/test
β βββ training/
β β βββ train.py # QLoRA fine-tuning with SFTTrainer
β β βββ merge.py # LoRA β base model merge for deployment
β βββ eval/
β βββ evaluate.py # Tool accuracy, F1, per-category breakdown
βββ kaggle_ablation.py # Self-contained Kaggle notebook with W&B
βββ kaggle_notebook.py # Single-run training notebook
# Install dependencies
pip install -r requirements.txt
# Generate seed queries + label with Gemini
# (requires API keys in .env β get free keys at aistudio.google.com)
python -m src.data_gen.teacher_labeler --n 2500
# Build final dataset splits
python -m src.data_gen.build_dataset- Upload
data/*.jsonlas a Kaggle Dataset - Create a new notebook with GPU T4 enabled
- Paste cells from
kaggle_ablation.pyand run
# Or train locally with a GPU
python -m src.training.train --config configs/mistral_r64.yamlpython -m src.eval.evaluate \
--checkpoint checkpoints/qwen7b-r64-lr2e4/final \
--test-set data/test.jsonl| Source | Count | Method | Quality |
|---|---|---|---|
| Template Generator | 498 | Deterministic rules, 100% clean labels | βββ |
| Gemini Distillation | 679 | gemini-2.5-flash + flash-lite function calling |
ββ |
The teacher labeler (teacher_labeler.py) is designed for zero-cost, zero-data-loss operation:
- Multi-key round-robin: 6 API keys Γ 2 models = 12 independent quota slots
- Incremental saves: Every label is flushed to disk immediately
- Smart retry logic: Distinguishes daily quota (mark key dead) vs transient 503 (exponential backoff)
- Resume support:
--resumeflag continues from exactly where you left off
# Resume after quota exhaustion β add fresh keys to .env and re-run
python -m src.data_gen.teacher_labeler --resume| Parameter | Value |
|---|---|
| Quantization | 4-bit NF4, double quantization |
| LoRA rank | 64 (best), 16 (ablation) |
| LoRA alpha | 128 |
| Target modules | q, k, v, o, gate, up, down projections |
| Optimizer | AdamW |
| Learning rate | 2e-4 (cosine schedule) |
| Batch size | 4 Γ 4 gradient accumulation = 16 effective |
| Epochs | 3 |
| Trainable params | ~335M / 7.2B (4.6%) |
Step Train Loss Eval Loss
50 0.724 0.698
100 0.581 0.687
150 0.495 0.672
All runs are logged to Weights & Biases under the toolforge project:
- Training loss curves (per-step)
- Validation loss at each checkpoint
- Test accuracy and per-category breakdown
- Hyperparameter comparison across ablation runs
- System metrics (GPU utilization, memory)
With 918 training examples and a 7B model, full fine-tuning would catastrophically overfit. QLoRA freezes 95%+ of weights and only trains ~335M adapter parameters β enough capacity for tool routing without destroying the base model's knowledge.
Cost. Gemini's free tier provides 20+ requests/day per model per key. With 6 keys Γ 2 models = 12 quota slots, we labeled 679 examples at zero cost. The multi-key rotation system makes this fully automated.
The model sees 27/30 correct labels for patterns like "define X β dictionary" and learns the dominant signal. The 3 noisy labels from Gemini's inconsistency are treated as noise β a well-known property of neural network training on noisy supervision.
- Python 3.12+
- PyTorch 2.x with CUDA
- transformers, peft, trl, bitsandbytes
- Google API keys (free tier) for data generation
- GPU: T4 (16GB) minimum for training
MIT