Skip to content

ayushh0110/toolforge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”§ ToolForge: Fine-Tuning Small LLMs for Autonomous Tool Routing

Teaching a model to become the router β€” replacing hand-crafted heuristics with learned tool-selection behavior via QLoRA distillation.

Python 3.12 PyTorch HuggingFace W&B

πŸ“– Read the blog post: From Heuristics to Fine-Tuning


🎯 Problem

Autonomous AI agents need to decide which tool to call for every user query. Most implementations rely on:

  • ❌ Regex/keyword matching (brittle, unmaintainable)
  • ❌ Zero-shot LLM prompting (expensive, slow, inconsistent)
  • ❌ Embedding similarity (loses argument extraction)

ToolForge solves this by fine-tuning a small LLM (7-8B) via QLoRA on synthetic tool-call traces, achieving 86% tool-selection accuracy with sub-second latency β€” replacing a heuristic router with a learned one.


πŸ“Š Results

Ablation Study (4 runs, W&B tracked)

Run Base Model LoRA r LR Test Accuracy Eval Loss
πŸ₯‡ qwen7b-r64 Qwen2.5-7B-Instruct 64 2e-4 86.2% 0.141
πŸ₯ˆ mistral-r64 Mistral-7B-Instruct-v0.3 64 2e-4 82.8% 0.670
πŸ₯‰ mistral-r16 Mistral-7B-Instruct-v0.3 16 2e-4 81.9% 0.648
❌ mistral-lr5e4 Mistral-7B-Instruct-v0.3 64 5e-4 60.3% 0.730

Note: True accuracy is estimated at ~92%+ after accounting for noisy teacher labels in the test set (model correctly routes queries that were mislabeled by the teacher).

Per-Tool Accuracy (Best Model β€” Qwen2.5-7B)

Tool Accuracy Tool Accuracy
datetime 100% web_search 91.7%
unit_converter 100% wikipedia 86.7%
web_reader 100% translate 80.0%
calculator 94.1% multi_tool 50.0%
dictionary 93.8% no_tool 41.7%
weather 92.3%

Key Findings

  • 7/9 tools above 90% β€” single-tool routing is near-production quality
  • Adapter size has minimal impact β€” r=16 (81.9%) vs r=64 (82.8%); smaller adapter is deployable for efficiency
  • Learning rate is critical β€” 5e-4 causes divergence; 2e-4 is the sweet spot
  • Student surpasses teacher β€” the fine-tuned model correctly routes queries that Gemini mislabeled in the training set

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    ToolForge Pipeline                        β”‚
β”‚                                                             β”‚
β”‚  Phase 1: Data Generation                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚   Template    β”‚ + β”‚   Gemini     β”‚ β†’ β”‚  1,173 labeled β”‚  β”‚
β”‚  β”‚  Generator    β”‚   β”‚  Teacher     β”‚   β”‚   examples     β”‚  β”‚
β”‚  β”‚  (498 seed)   β”‚   β”‚  (679 dist.) β”‚   β”‚  (train/val/   β”‚  β”‚
β”‚  β”‚              β”‚   β”‚  flash+lite  β”‚   β”‚   test/hard)   β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                             β”‚
β”‚  Phase 2: QLoRA Training                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Base Model   β”‚ + β”‚  LoRA r=64   β”‚ β†’ β”‚   Fine-tuned   β”‚  β”‚
β”‚  β”‚  (4-bit NF4)  β”‚   β”‚  Adapter     β”‚   β”‚   Router       β”‚  β”‚
β”‚  β”‚  Qwen/Mistral β”‚   β”‚  ~335-646 MB β”‚   β”‚   86.2% acc    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                             β”‚
β”‚  Phase 3: Evaluation                                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Tool Acc.    β”‚   β”‚  Per-Categoryβ”‚   β”‚   W&B          β”‚  β”‚
β”‚  β”‚  Arg Match    β”‚   β”‚  Breakdown   β”‚   β”‚   Dashboard    β”‚  β”‚
β”‚  β”‚  Multi-Tool   β”‚   β”‚  Error       β”‚   β”‚   4 ablation   β”‚  β”‚
β”‚  β”‚  Latency      β”‚   β”‚  Analysis    β”‚   β”‚   runs         β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ› οΈ The 9 Tools

The model learns to route queries to these tools (or respond directly):

Tool Description Input Schema
web_search Search the internet {query: str}
calculator Mathematical expressions {expression: str}
weather Current weather data {location: str}
wikipedia Encyclopedia lookup {query: str}
datetime Date/time operations {action: str, ...}
dictionary Word definitions {word: str}
translate Language translation {text: str, to_lang: str}
unit_converter Unit conversion {value: float, from: str, to: str}
web_reader Extract webpage content {url: str}

Plus no_tool (direct response) and multi_tool (chained calls).


πŸ“ Project Structure

toolforge/
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ mistral_r64.yaml            # Default training config
β”‚   β”œβ”€β”€ mistral_r16.yaml            # Small adapter ablation
β”‚   └── llama_r64.yaml              # Alternative base model
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ synthetic/
β”‚   β”‚   β”œβ”€β”€ queries.json            # 1,894 generated queries
β”‚   β”‚   └── teacher.jsonl           # 679 Gemini-labeled examples
β”‚   β”œβ”€β”€ train.jsonl                 # 918 training examples
β”‚   β”œβ”€β”€ val.jsonl                   # 114 validation examples
β”‚   β”œβ”€β”€ test.jsonl                  # 116 test examples
β”‚   └── hard_test.jsonl             # 25 multi-tool edge cases
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ data_gen/
β”‚   β”‚   β”œβ”€β”€ template_generator.py   # Deterministic seed data (498 examples)
β”‚   β”‚   β”œβ”€β”€ teacher_labeler.py      # Gemini distillation with multi-key rotation
β”‚   β”‚   └── build_dataset.py        # Merge, dedup, split into train/val/test
β”‚   β”œβ”€β”€ training/
β”‚   β”‚   β”œβ”€β”€ train.py                # QLoRA fine-tuning with SFTTrainer
β”‚   β”‚   └── merge.py                # LoRA β†’ base model merge for deployment
β”‚   └── eval/
β”‚       └── evaluate.py             # Tool accuracy, F1, per-category breakdown
β”œβ”€β”€ kaggle_ablation.py              # Self-contained Kaggle notebook with W&B
└── kaggle_notebook.py              # Single-run training notebook

πŸš€ Quick Start

1. Generate Training Data

# Install dependencies
pip install -r requirements.txt

# Generate seed queries + label with Gemini
# (requires API keys in .env β€” get free keys at aistudio.google.com)
python -m src.data_gen.teacher_labeler --n 2500

# Build final dataset splits
python -m src.data_gen.build_dataset

2. Train on Kaggle (Free GPU)

  1. Upload data/*.jsonl as a Kaggle Dataset
  2. Create a new notebook with GPU T4 enabled
  3. Paste cells from kaggle_ablation.py and run
# Or train locally with a GPU
python -m src.training.train --config configs/mistral_r64.yaml

3. Evaluate

python -m src.eval.evaluate \
    --checkpoint checkpoints/qwen7b-r64-lr2e4/final \
    --test-set data/test.jsonl

πŸ”¬ Data Pipeline

Two-Source Strategy

Source Count Method Quality
Template Generator 498 Deterministic rules, 100% clean labels ⭐⭐⭐
Gemini Distillation 679 gemini-2.5-flash + flash-lite function calling ⭐⭐

Crash-Proof Distillation

The teacher labeler (teacher_labeler.py) is designed for zero-cost, zero-data-loss operation:

  • Multi-key round-robin: 6 API keys Γ— 2 models = 12 independent quota slots
  • Incremental saves: Every label is flushed to disk immediately
  • Smart retry logic: Distinguishes daily quota (mark key dead) vs transient 503 (exponential backoff)
  • Resume support: --resume flag continues from exactly where you left off
# Resume after quota exhaustion β€” add fresh keys to .env and re-run
python -m src.data_gen.teacher_labeler --resume

βš™οΈ Training Details

QLoRA Configuration

Parameter Value
Quantization 4-bit NF4, double quantization
LoRA rank 64 (best), 16 (ablation)
LoRA alpha 128
Target modules q, k, v, o, gate, up, down projections
Optimizer AdamW
Learning rate 2e-4 (cosine schedule)
Batch size 4 Γ— 4 gradient accumulation = 16 effective
Epochs 3
Trainable params ~335M / 7.2B (4.6%)

Training Curves (Mistral-7B, r=64)

Step   Train Loss   Eval Loss
  50     0.724        0.698
 100     0.581        0.687
 150     0.495        0.672

πŸ“ˆ Experiment Tracking

All runs are logged to Weights & Biases under the toolforge project:

  • Training loss curves (per-step)
  • Validation loss at each checkpoint
  • Test accuracy and per-category breakdown
  • Hyperparameter comparison across ablation runs
  • System metrics (GPU utilization, memory)

🧠 Key Technical Decisions

Why QLoRA over full fine-tuning?

With 918 training examples and a 7B model, full fine-tuning would catastrophically overfit. QLoRA freezes 95%+ of weights and only trains ~335M adapter parameters β€” enough capacity for tool routing without destroying the base model's knowledge.

Why Gemini as teacher instead of GPT-4?

Cost. Gemini's free tier provides 20+ requests/day per model per key. With 6 keys Γ— 2 models = 12 quota slots, we labeled 679 examples at zero cost. The multi-key rotation system makes this fully automated.

Why the student outperforms the teacher's labels

The model sees 27/30 correct labels for patterns like "define X β†’ dictionary" and learns the dominant signal. The 3 noisy labels from Gemini's inconsistency are treated as noise β€” a well-known property of neural network training on noisy supervision.


πŸ“‹ Requirements

  • Python 3.12+
  • PyTorch 2.x with CUDA
  • transformers, peft, trl, bitsandbytes
  • Google API keys (free tier) for data generation
  • GPU: T4 (16GB) minimum for training

πŸ“ License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages