Orchestration for Multi-Agent Systems

Comparing Deep Learning vs Classical ML for Agent Coordination

Krishna Kishore Buddi

Problem Statement

When building AI systems that use multiple specialized agents, determining which agent should handle which task remains a challenging coordination problem. Current approaches rely on either basic rule-based systems that frequently fail or expensive API calls for every routing decision. This project addresses the fundamental question: does neural network complexity provide meaningful advantages over classical machine learning for agent coordination tasks?

Motivation

The field lacks systematic comparison between neural orchestration and classical ML approaches for multi-agent coordination. While many assume deep learning is necessary for effective orchestration, no rigorous empirical study has validated this assumption. Our research fills this gap by implementing both approaches and conducting controlled experiments to determine whether neural complexity justifies the additional implementation and computational costs.

Research Question

Does neural network complexity provide meaningful performance gains over classical machine learning approaches for agent coordination tasks in multi-domain environments?

System Architecture

Task Generation → Task Representation → Feature Construction
                                              ↓
                                    ┌─────────┴─────────┐
                                    ↓                   ↓
                          Neural Network        Tree Models
                                    ↓                   ↓
                                    └─────────┬─────────┘
                                              ↓
                                    Agent Execution
                                              ↓
                                    Fuzzy Evaluation
                                    (Completeness, Relevance, Confidence)
                                              ↓
                                    Supervision Signal → Training Loop

Implementation

Neural Orchestration System

We implemented a feedforward neural network with three hidden layers (256, 128, 64 neurons) that learns optimal agent selection through supervised learning. The network takes task descriptions, agent profiles, and contextual information as input, then predicts which agent should handle each task.

The system includes a fuzzy evaluation module that scores agent performance on three weighted components:

Completeness (30%): How thoroughly the agent handled the task

Relevance (50%): How well the agent's skills matched task requirements

Confidence (20%): How reliably the agent performed

These fuzzy scores generate training labels for the neural network, creating a supervision signal based on actual agent performance rather than arbitrary rules.

Agent Profiles

Each agent in our system has three key characteristics:

Skill Vector (8 dimensions): Quantifies what the agent is capable of doing

Domain Expertise (4 dimensions): Indicates where the agent specializes (emergency, document, analytics, creative)

Reliability Score: Affects execution consistency through noise modeling

We implemented six specialized agents:

EmergencyBot: Expert in urgent, time-sensitive tasks (skills focused on dimensions 0-1)

DocumentBot: Specialized in document processing and structured content (skills focused on dimensions 4-7)

AnalyticsBot: Expert in data analysis and statistical processing (skills focused on dimensions 2-3)

CreativeBot: Specialized in creative content and innovative solutions (skills focused on dimensions 6-7)

QuickBot: Fast generalist with highest reliability (0.85 vs 0.78 for others)

GeneralistBot: Balanced mediocre performance across all domains

Classical ML Baselines

We trained multiple classical ML models on identical features and data:

Random Forest: 300 tree ensemble with max depth 20

XGBoost: Gradient boosting with 200 estimators

Ensemble: Soft voting combination of Random Forest and XGBoost

Additional Baselines

We implemented three simple strategies for comparison:

Random Selection: Uniform random agent selection (theoretical accuracy ~16.7%)

Round-Robin: Cyclic rotation through agents

Static-Best: Always select the historically best-performing agent

Task Domains

We created five task types to evaluate model performance across different scenarios:

Emergency Tasks (20%): Require fast response and emergency handling skills

Document Tasks (20%): Need strong writing and formatting abilities

Analytics Tasks (20%): Require data processing and statistical skills

Creative Tasks (20%): Need innovative thinking and artistic capabilities

General Tasks (20%): Balanced requirements with no particular specialization

Experimental Setup

Training Data: 5000 synthetic tasks distributed evenly across five domains

Test Data: 1000 held-out tasks for evaluation

Random Seed: Fixed at 42 for reproducibility

Validation Split: 15% of training data for early stopping and hyperparameter tuning

Batch Size: 32 tasks per training batch

Training Procedure: All models trained on identical data with fuzzy evaluation providing supervision

Results

Overall Performance

Model	Accuracy	Avg Quality	Train Time	Inference Time	KL(p*\|\|q)	CE(p*,q)
Ensemble	61.0%	0.389	0s	59ms	0.954	2.744
XGBoost	60.9%	0.389	5.6s	0.67ms	2.131	3.920
Attention	60.4%	0.390	295s	0.47ms	0.001	1.790
RandomForest	60.2%	0.388	2.2s	40ms	0.858	2.647
Neural	60.1%	0.389	179s	0.16ms	0.001	1.790
StaticBest	29.2%	0.335	0s	0ms	-	-
Random	17.7%	0.309	0s	0µs	-	-
RoundRobin	15.8%	0.306	0s	0µs	-	-

Key Findings

Similar Accuracy Across ML Models: All machine learning models achieved 60-61% accuracy, clustering within 1% of each other. This suggests that for this coordination task, model architecture matters less than feature quality and training data.

Classical ML Competitive with Neural: Random Forest and XGBoost matched neural network accuracy while training 50-80 times faster (2.2s vs 179s for neural network). This challenges the assumption that neural complexity is necessary for effective orchestration.

Significant Improvement Over Baselines: All ML approaches achieved 2-3x improvement over the best baseline (Static-Best at 29.2%), demonstrating that learned orchestration provides substantial value over simple heuristics.

Dramatic Difference in Probability Calibration: While accuracy metrics were similar, probability calibration revealed a massive gap between neural and tree-based models. Neural and attention models achieved KL divergence of 0.001, while tree-based models ranged from 0.858 to 2.131. This represents a difference of three orders of magnitude in calibration quality.

Inference Speed Varies Dramatically: Neural network inference (0.16ms) was 250x faster than Random Forest (40ms), suggesting neural models are better suited for latency-critical production environments despite longer training times.

Understanding KL Divergence in Multi-Agent Orchestration

KL divergence (Kullback-Leibler divergence) measures how well a model's predicted probability distribution matches the oracle's probability distribution. In our system, the oracle distribution comes from applying temperature-scaled softmax to the fuzzy evaluation scores of all agents on each task.

Why KL Divergence Matters:

When an orchestrator makes a prediction, it doesn't just pick one agent. It outputs a probability distribution over all six agents. For example, it might say "EmergencyBot has 70% probability of being best, AnalyticsBot 20%, and others 10%." KL divergence tells us whether these probabilities reflect reality.

Low KL divergence (neural models: 0.001): The model's probability distribution closely matches the oracle's distribution. When the model is 90% confident, it's right about 90% of the time. When it's uncertain (say 40-60% split between two agents), that uncertainty is justified.

High KL divergence (tree models: 0.858-2.131): The model's probabilities don't match reality. Tree-based models tend to be overconfident, often outputting probabilities like 95-5 even when the true situation is more uncertain like 60-40. This happens because decision trees inherently produce sharp splits.

Practical Implications:

For production deployments where you need to know how confident the orchestrator is, this difference is critical. Consider these scenarios:

Scenario 1 - Confident Decision: The task clearly requires AnalyticsBot. A well-calibrated model (neural/attention) outputs {Analytics: 0.92, others: 0.08}. A poorly calibrated model (XGBoost) outputs {Analytics: 0.98, others: 0.02}. Both pick the right agent, but the neural model's probabilities are more honest about the remaining uncertainty.

Scenario 2 - Uncertain Decision: The task could reasonably be handled by either EmergencyBot or QuickBot. A well-calibrated model outputs {Emergency: 0.52, Quick: 0.45, others: 0.03}, correctly reflecting the uncertainty. A poorly calibrated model outputs {Emergency: 0.89, Quick: 0.08, others: 0.03}, appearing certain when it shouldn't be.

The second scenario is where calibration matters most. When an orchestrator needs to route tasks but isn't sure, downstream systems need accurate uncertainty estimates to make good decisions. Should we route to the second-best agent if the first is busy? Should we ask for human review? These decisions require honest probability estimates.

Why Neural Networks Calibrate Better:

Neural networks trained with soft labels (probability distributions) using KL divergence loss learn to match the entire distribution, not just pick the maximum. The continuous optimization process with gradient descent naturally learns well-calibrated probabilities.

Tree-based models make binary splits at each node, creating sharp decision boundaries. Even when using probability mode (counting training samples in leaf nodes), they tend toward overconfidence. The ensemble averaging in Random Forest helps slightly (KL: 0.858) but doesn't solve the fundamental issue.

Cross-Entropy Relationship:

Cross-entropy CE(p*,q) measures the average number of bits needed to encode the oracle distribution p* using the model's distribution q. Lower is better. Neural models achieved CE of 1.790 compared to 2.647-3.920 for tree models. This confirms the same pattern: neural models represent the oracle distribution much more efficiently.

Per-Domain Performance

Analytics Domain (68% accuracy): Easiest for all models due to strong agent specialization (AnalyticsBot clearly dominates)

Creative Domain (62% accuracy): Moderate difficulty with some overlap between specialist and generalist agents

Emergency Domain (60% accuracy): Complicated by competition between EmergencyBot (specialized) and QuickBot (high reliability)

General Domain (59% accuracy): Difficult because all agents are reasonably viable with no clear specialist

Document Domain (57% accuracy): Hardest domain due to DocumentBot's less distinctive skill profile

Error Analysis

Confusion matrix analysis revealed:

AnalyticsBot and CreativeBot: Correctly selected 200+ times each (strong diagonal elements)

EmergencyBot: Correctly selected 187 times with some confusion with AnalyticsBot

DocumentBot: Only correctly selected 22 times, frequently confused with other agents

QuickBot: Almost never selected even when optimal (0 correct selections)

This suggests models learned to favor strong specialists but struggle with edge cases where generalists or highly reliable agents might be better choices.

Installation

cd multi-agent-orchestration
pip install -r requirements.txt

Running Experiments

Complete comparison of all models:

python main.py --mode comparison

Train individual models:

python main.py --mode neural      # Neural network only
python main.py --mode attention   # Attention model only
python main.py --mode rf          # Random Forest only

Custom configuration:

python main.py --config custom_config.yaml

Project Structure

multi-agent-orchestration/
├── README.md
├── requirements.txt
├── config.yaml
├── data.py              # Task generation across five domains
├── agents.py            # Six specialized agent implementations
├── models.py            # All five orchestrator implementations
├── evaluation.py        # Fuzzy evaluation system
├── train.py             # Training loops with early stopping
├── baselines.py         # Simple baseline strategies
├── experiments.py       # Experimental pipeline orchestration
├── visualize.py         # Result visualization
├── main.py              # Command-line interface
└── streamlit_app.py     # Interactive dashboard

Configuration

The config.yaml file controls all experimental parameters:

Data Generation: Task counts, dimensions, domain distributions

Agent Profiles: Skills, expertise, reliability for each agent

Neural Network: Architecture, learning rate, epochs, dropout

Attention Network: Number of heads, hidden dimensions

Random Forest: Tree count, max depth, splitting criteria

XGBoost: Estimators, learning rate, regularization

Fuzzy Evaluation: Component weights (completeness, relevance, confidence)

Training: Batch size, validation split, early stopping patience

Evaluation Metrics

Accuracy Metrics

Agent Selection Accuracy: Percentage of tasks where the model selected the oracle agent

Per-Domain Breakdown: Accuracy separated by emergency, document, analytics, creative, and general tasks

Confusion Matrix: Visualization of prediction patterns showing which agents get confused

Task Quality Scores: Average fuzzy evaluation scores when following model recommendations

Efficiency Metrics

Training Time: Duration required to train each model from scratch

Inference Speed: Latency for making a single routing decision

Probability Calibration Metrics

KL Divergence KL(p||q):* Measures how well the model's predicted distribution q matches the oracle distribution p*. Lower values indicate better calibration. Neural models achieved 0.001 while tree models ranged from 0.858 to 2.131.

Cross-Entropy CE(p,q):* Measures the average bits needed to encode the oracle distribution using the model's distribution. Lower values are better. Neural models achieved 1.790 compared to 2.647-3.920 for tree models.

Soft Oracle Accuracy: Probability mass the model assigns to the oracle's top choice. Higher is better. XGBoost led with 55.1%, while neural models achieved 17.7%.

Expected Quality E[Quality|q]: Expected fuzzy quality score under the model's probability distribution. Measures whether the model's uncertainties align with actual quality variations.

Statistical Analysis

Ablation Studies: Testing whether fuzzy evaluation components contribute to performance

Significance Testing: Determining whether performance differences are statistically meaningful

Error Analysis: Detailed examination of failure modes and edge cases

Interactive Dashboard

Launch the Streamlit dashboard for interactive result exploration:

streamlit run streamlit_app.py

The dashboard provides:

Summary tables comparing all models

Side-by-side confusion matrices

Per-domain performance breakdowns

Training curve visualizations

Task-level exploration tool for understanding individual predictions

Reproducing Results

To reproduce the exact results from our experiments:

python main.py --mode comparison --seed 42
cat results/metrics/metrics.csv

Fixed random seed ensures reproducible task generation and agent execution.

Testing Individual Modules

Each module can be tested independently:

python data.py        # Test task generation
python agents.py      # Test agent execution  
python evaluation.py  # Test fuzzy evaluation

Conclusions

Our systematic comparison reveals that classical machine learning approaches remain highly competitive with neural networks for multi-agent orchestration tasks in terms of accuracy. However, probability calibration tells a different story entirely.

The Accuracy-Calibration Tradeoff

All ML models achieved similar accuracy (60-61%), but their probability calibration differed by three orders of magnitude. Neural networks achieved KL divergence of 0.001 while tree-based models ranged from 0.858 to 2.131. This massive gap has important practical implications.

For applications where you only care about picking the right agent most of the time, Random Forest or XGBoost are excellent choices. They train 50-80 times faster than neural networks and achieve virtually identical accuracy.

For applications where you need to know how confident the orchestrator is, neural networks are essential. Their well-calibrated probabilities enable downstream decision-making systems to appropriately handle uncertainty. When the orchestrator outputs 55% confidence versus 95% confidence, you need those numbers to mean something real.

Model Selection Guidelines

Choose Neural Networks when:

Inference latency is critical (0.16ms vs 40ms for Random Forest)

Well-calibrated probabilities are essential for downstream systems

You need accurate uncertainty quantification

You're building systems that make decisions based on confidence levels

Choose Random Forest when:

Fast training is important for rapid iteration (2.2s vs 179s for neural)

Accuracy is the only metric that matters

Interpretability is valuable

You want simpler deployment without deep learning frameworks

Choose XGBoost when:

You need the absolute best single-model accuracy (60.9%)

You can tolerate moderate training times (5.6s)

You want feature importance analysis

Choose Ensemble when:

Maximum accuracy is critical (61.0%)

You can afford the inference overhead (59ms)

You want robustness from model diversity

The Calibration Lesson

Perhaps the most important finding from this research is that accuracy alone is insufficient for evaluating orchestration systems. Two models can have identical accuracy (60.1% vs 60.2%) while having completely different calibration quality (KL: 0.001 vs 0.858). If your application needs to reason about uncertainty, calibration metrics like KL divergence are just as important as accuracy.

This insight generalizes beyond multi-agent orchestration. Any machine learning system that outputs probabilities for downstream decision-making should measure and optimize calibration, not just accuracy.

References

This project builds on concepts from neural orchestration frameworks for multi-agent systems. Our contribution is the systematic comparison of neural approaches against classical machine learning methods, along with the development of a comprehensive fuzzy evaluation framework for generating supervision signals and rigorous analysis of probability calibration in orchestration systems.

License

Academic project for educational purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
README.md		README.md
agents.py		agents.py
baselines.py		baselines.py
config.yaml		config.yaml
data.py		data.py
evaluation.py		evaluation.py
experiment.log		experiment.log
experiments.py		experiments.py
main.py		main.py
models.py		models.py
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py
train.py		train.py
visualize.py		visualize.py

Folders and files

Latest commit

History

Repository files navigation

Orchestration for Multi-Agent Systems

Comparing Deep Learning vs Classical ML for Agent Coordination

Problem Statement

Motivation

Research Question

System Architecture

Implementation

Neural Orchestration System

Agent Profiles

Classical ML Baselines

Additional Baselines

Task Domains

Experimental Setup

Results

Overall Performance

Key Findings

Understanding KL Divergence in Multi-Agent Orchestration

Per-Domain Performance

Error Analysis

Installation

Running Experiments

Project Structure

Configuration

Evaluation Metrics

Accuracy Metrics

Efficiency Metrics

Probability Calibration Metrics

Statistical Analysis

Interactive Dashboard

Reproducing Results

Testing Individual Modules

Conclusions

The Accuracy-Calibration Tradeoff

Model Selection Guidelines

The Calibration Lesson

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages