Agent-Scaling: Multi-Agent Scaling with Heterogeneous LLM Ensembles

This repository provides the codebase for studying how scaling the number of heterogeneous LLM agents with diverse reasoning personas improves collective performance through debate and voting mechanisms. We introduce the K* metric (effective diversity) based on embedding eigenvalue entropy to quantify semantic diversity among agents, and show that persona-guided multi-agent collaboration yields consistent gains across reasoning benchmarks.

Project Structure

.
├── src/                          # Core source code
│   ├── main.py                   # Main orchestration: debate/voting loop
│   ├── evaluator.py              # Answer extraction & scoring (math, MCQ)
│   ├── data/                     # Dataset loaders
│   │   ├── data_utils.py         # Central data router
│   │   ├── gsm8k.py              # Grade School Math 8K
│   │   ├── arc.py                # ARC-Challenge / ARC-Easy
│   │   ├── hellaswag.py          # HellaSwag
│   │   ├── truthfulqa.py         # TruthfulQA
│   │   ├── winogrande.py         # WinoGrande
│   │   ├── mmlu_pro_medicine.py  # MMLU-Pro Medicine
│   │   └── mmlu_formal_logic.py  # MMLU Formal Logic
│   └── model/                    # Model wrappers
│       ├── model_utils.py        # Agent factory, persona definitions, unified engine
│       ├── llama.py              # LLaMA (v2/v3) wrapper via HuggingFace
│       ├── qwen.py               # Qwen wrapper via HuggingFace
│       ├── openai_compat.py      # OpenAI-compatible API client (vLLM, etc.)
│       └── azure_openai.py       # Azure OpenAI wrapper
├── scripts/                      # Experiment runner scripts
│   ├── add*.sh                   # Heterogeneous multi-agent experiments
│   ├── add*_noperspn.sh          # Same experiments without personas
│   └── ablation*.sh              # Ablation studies (persona impact, agent count)
├── K_star_analysis/              # K* diversity metric computation
│   ├── analysis.py               # Core N* (effective diversity) from embeddings
│   ├── analysis_improved.py      # Extended metrics: N*_conditioned, N*_weighted, Delta-N*
│   ├── exp2_embedding_robustness.py  # Cross-embedding-model robustness validation
│   ├── analysis.sh               # Runner for analysis.py
│   └── analysis_improved.sh      # Runner for analysis_improved.py
└── .gitignore

Installation

Requirements

Python >= 3.9
CUDA-compatible GPU(s)

Dependencies

pip install torch transformers accelerate peft
pip install sentence-transformers datasets
pip install openai numpy pandas scipy tqdm

Quick Start

1. Homogeneous Multi-Agent Voting

python src/main.py \
    --data gsm8k \
    --num_agents 5 \
    --model qwen2.5-7b \
    --solver vote \
    --debate_rounds 0

2. Heterogeneous Multi-Agent Debate with Personas

python src/main.py \
    --data formal_logic \
    --num_agents 4 \
    --agent_models "llama3.1-8b,qwen2.5-7b,mistral-7b,qwen3-8b" \
    --multi_persona \
    --solver debate \
    --debate_rounds 3 \
    --use_vllm \
    --vllm_base_urls "http://127.0.0.1:8001/v1,http://127.0.0.1:8002/v1,http://127.0.0.1:8003/v1,http://127.0.0.1:8004/v1"

3. Using Azure OpenAI Models

export AZURE_OPENAI_ENDPOINT="https://your-endpoint.openai.azure.com/"
export AZURE_OPENAI_API_KEY_ENV="your-api-key"

python src/main.py \
    --data gsm8k \
    --num_agents 4 \
    --agent_models "gpt-4o,gpt-4o-mini,o3-mini,gpt-4.1" \
    --multi_persona \
    --solver debate \
    --debate_rounds 3

4. Run K* Analysis

python K_star_analysis/analysis.py \
    --jsonl_dir out/history/ \
    --mode round_agent_avg \
    --output_dir analysis/results/

Key Concepts

Multi-Agent Debate

Agents iteratively refine their answers by observing peer responses:

Round 0: Each agent generates an initial answer guided by its persona.
Rounds 1–N: Agents receive peer responses as context and produce refined answers.
Final answer: Determined by majority voting or debate consensus.

Three communication topologies are supported:

Decentralized (default): Full mesh — every agent sees all peers.
Centralized: Hub-and-spoke — one coordinator aggregates.
Sparse: Each agent only sees neighboring agents.

Reasoning Personas

Each agent is assigned a distinct reasoning persona that shapes its problem-solving style. For example, on math tasks:

Persona	Strategy
Conservative Verifier	Step-by-step verification, double-checking
Creative Explorer	Pattern recognition, unconventional shortcuts
Rigorous Formalist	Precise notation, logical completeness
Intuitive Estimator	Rough estimates as sanity checks
Systematic Decomposer	Divide-and-conquer sub-problems

K* Metric (Effective Diversity)

K* measures the effective number of distinct reasoning strategies among agents using embedding-space analysis:

Embed all agent responses with a sentence transformer (e.g., NV-Embed-v2).
Compute eigenvalues of the embedding covariance matrix.
N* = exp(H) where H is the Shannon entropy of the normalized eigenvalue distribution.

Extended metrics include:

N*_conditioned: Diversity within correct vs. incorrect answer groups.
N*_weighted: Correctness-weighted diversity.
Delta-N*: Marginal diversity contribution of each agent.

Baselines

Baseline A: Length-matched neutral padding replaces persona prompts (isolates persona effect).
Baseline B: All agents share a single persona (tests heterogeneity benefit).

Supported Models

Category	Models
Local (HuggingFace)	LLaMA 3.1-8B, Qwen 2.5-7B/32B
vLLM Serving	Any model served via OpenAI-compatible API
Azure OpenAI	GPT-4o, GPT-4o-mini, o1, o3-mini, o3, o4-mini, GPT-4.1 series
OpenAI-Compatible	GPT-5-mini, Gemini-2.5-Flash, etc.

Supported Datasets

Task Type	Datasets
Mathematical Reasoning	GSM8K
Multiple Choice QA	ARC, HellaSwag, TruthfulQA, WinoGrande
Domain-Specific	MMLU-Pro Medicine, MMLU Formal Logic

Running Large-Scale Experiments

The scripts/ directory contains bash scripts for running experiment sweeps:

# Heterogeneous agents across multiple agent counts
bash scripts/add.sh

# Same experiment without personas (control)
bash scripts/add_noperspn.sh

# Ablation: persona vs. no-persona across models
bash scripts/ablation.sh

These scripts manage multi-GPU scheduling and run experiments across varying agent counts (2, 4, 8, 12, 16) with both debate and voting solvers.

Key Arguments

Argument	Description	Default
`--data`	Dataset name	(required)
`--num_agents`	Number of agents	5
`--agent_models`	Comma-separated model list (heterogeneous)	`""`
`--multi_persona`	Enable diverse personas per agent	`False`
`--baseline_a`	Baseline A: neutral padding	`False`
`--baseline_b`	Baseline B: shared persona	`False`
`--solver`	Aggregation method: `vote` or `debate`	`vote`
`--debate_rounds`	Number of debate rounds	4
`--sparse`	Sparse communication topology	`False`
`--centralized`	Centralized communication topology	`False`
`--use_vllm`	Use vLLM backend	`False`
`--temperature`	Sampling temperature	0
`--top_p`	Nucleus sampling threshold	0.9
`--load_in_4bit`	4-bit quantization (bitsandbytes)	`False`
`--load_in_8bit`	8-bit quantization (bitsandbytes)	`False`

License

This project is for research purposes.

Citation

If you use Agent-Scaling in your research, please cite:

@misc{agentscaling2025,
    title={Understanding Agent Scaling in LLM-Based Multi-Agent Systems via Diversity},
    author={Yingxuan Yang, Chengrui Qu, Muning Wen, Laixi Shi, Ying Wen, Weinan Zhang, Adam Wierman, Shangding Gu},
    year = {2025},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/SafeRL-Lab/Agent-Scaling}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
K_star_analysis		K_star_analysis
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent-Scaling: Multi-Agent Scaling with Heterogeneous LLM Ensembles

Project Structure

Installation

Requirements

Dependencies

Quick Start

1. Homogeneous Multi-Agent Voting

2. Heterogeneous Multi-Agent Debate with Personas

3. Using Azure OpenAI Models

4. Run K* Analysis

Key Concepts

Multi-Agent Debate

Reasoning Personas

K* Metric (Effective Diversity)

Baselines

Supported Models

Supported Datasets

Running Large-Scale Experiments

Key Arguments

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

SafeRL-Lab/Agent-Scaling

Folders and files

Latest commit

History

Repository files navigation

Agent-Scaling: Multi-Agent Scaling with Heterogeneous LLM Ensembles

Project Structure

Installation

Requirements

Dependencies

Quick Start

1. Homogeneous Multi-Agent Voting

2. Heterogeneous Multi-Agent Debate with Personas

3. Using Azure OpenAI Models

4. Run K* Analysis

Key Concepts

Multi-Agent Debate

Reasoning Personas

K* Metric (Effective Diversity)

Baselines

Supported Models

Supported Datasets

Running Large-Scale Experiments

Key Arguments

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages