A multi-LLM deliberation system where multiple LLMs collaboratively answer questions through peer review and synthesis. Available as a Python library, MCP server, or HTTP API.
Instead of asking a single LLM for answers, this MCP server:
- Stage 1: Sends your question to multiple LLMs in parallel (GPT, Claude, Gemini, Grok, etc.)
- Stage 2: Each LLM reviews and ranks the other responses (anonymized to prevent bias)
- Stage 3: A Chairman LLM synthesizes all responses into a final, high-quality answer
Deploy your own LLM Council instance:
| Platform | Deploy | Best For |
|---|---|---|
| Railway | Production, webhooks | |
| Render | Evaluation, free tier |
Required Environment Variables:
OPENROUTER_API_KEY- Your OpenRouter API keyLLM_COUNCIL_API_TOKEN- A secure token for API authentication (generate withopenssl rand -hex 16)
Note: Railway is recommended for n8n integration (no cold-start). Render Free tier spins down after 15 minutes which may cause webhook timeouts.
For detailed deployment instructions, see the Deployment Guide.
This project is a derivative work based on the original llm-council by Andrej Karpathy.
Karpathy's original README stated:
"I'm not going to support it in any way, it's provided here as is for other people's inspiration and I don't intend to improve it. Code is ephemeral now and libraries are over, ask your LLM to change it in whatever way you like."
...the irony of producing a derivative work that packages the core concept for broader use via the Model Context Protocol!
pip install "llm-council-core[mcp]"For core library only (no MCP server):
pip install llm-council-coreThe council supports three gateway options for accessing LLMs:
| Gateway | Best For | API Keys Needed |
|---|---|---|
| OpenRouter (default) | Easiest setup, 100+ models via single key | OPENROUTER_API_KEY |
| Requesty | BYOK mode, analytics, cost tracking | REQUESTY_API_KEY + provider keys |
| Direct | Maximum control, direct provider APIs | Provider keys (Anthropic, OpenAI, Google) |
Quick Start (OpenRouter):
# Sign up at openrouter.ai and get your API key
export OPENROUTER_API_KEY="sk-or-v1-..."Direct Provider Access:
# Use your existing provider API keys directly
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export GOOGLE_API_KEY="..."
export LLM_COUNCIL_DEFAULT_GATEWAY=directRequesty with BYOK:
export REQUESTY_API_KEY="..."
export ANTHROPIC_API_KEY="sk-ant-..." # Your own key, routed through Requesty
export LLM_COUNCIL_DEFAULT_GATEWAY=requestyChoose one of these options (in order of recommendation):
Store keys encrypted in your OS keychain:
# Install with keychain support
pip install "llm-council-core[mcp,secure]"
# Store key securely (prompts for key, no echo)
llm-council setup-key
# For CI/CD automation, pipe from stdin:
echo "$OPENROUTER_API_KEY" | llm-council setup-key --stdinSet in your shell profile (~/.zshrc, ~/.bashrc):
# OpenRouter (default gateway)
export OPENROUTER_API_KEY="sk-or-v1-..."
# Or use direct provider APIs
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export GOOGLE_API_KEY="..."Create a .env file (ensure it's in .gitignore):
# For OpenRouter
echo "OPENROUTER_API_KEY=sk-or-v1-..." > .env
# Or for direct APIs
cat > .env << 'EOF'
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
GOOGLE_API_KEY=...
LLM_COUNCIL_DEFAULT_GATEWAY=direct
EOFSecurity Note: Never put API keys in command-line arguments or JSON config files that might be committed to version control.
You can customize which models participate in the council using three methods (in priority order):
# Comma-separated list of council models
export LLM_COUNCIL_MODELS="openai/gpt-4,anthropic/claude-3-opus,google/gemini-pro"
# Chairman model (synthesizes final response)
export LLM_COUNCIL_CHAIRMAN="anthropic/claude-3-opus"Create llm_council.yaml in your project root or ~/.config/llm-council/llm_council.yaml:
council:
# Tier configuration (ADR-022)
tiers:
default: high
pools:
quick:
models:
- openai/gpt-4o-mini
- anthropic/claude-3-5-haiku-20241022
timeout_seconds: 30
balanced:
models:
- openai/gpt-4o
- anthropic/claude-3-5-sonnet-20241022
timeout_seconds: 90
high:
models:
- openai/gpt-4o
- anthropic/claude-opus-4-6
- google/gemini-3-pro
timeout_seconds: 180
# Triage configuration (ADR-020)
triage:
enabled: false
wildcard:
enabled: true
prompt_optimization:
enabled: true
# Gateway configuration (ADR-023, ADR-025a)
gateways:
default: openrouter
fallback:
enabled: true
chain: [openrouter, ollama] # Can use Ollama as fallback
# Provider-specific configuration
providers:
ollama:
enabled: true
base_url: http://localhost:11434
timeout_seconds: 120.0
hardware_profile: recommended # minimum|recommended|professional|enterprise
openrouter:
enabled: true
base_url: https://openrouter.ai/api/v1/chat/completions
# Webhook notifications (ADR-025a)
webhooks:
enabled: false # Opt-in
timeout_seconds: 5.0
max_retries: 3
https_only: true
default_events:
- council.complete
- council.error
observability:
log_escalations: truePriority: YAML config > Environment variables > Defaults
Create ~/.config/llm-council/config.json:
{
"council_models": [
"openai/gpt-4-turbo",
"anthropic/claude-3-opus",
"google/gemini-pro",
"meta-llama/llama-3-70b-instruct"
],
"chairman_model": "anthropic/claude-3-opus",
"synthesis_mode": "consensus",
"exclude_self_votes": true,
"style_normalization": false,
"max_reviewers": null
}If you don't configure anything, these defaults are used:
- Council: GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5, Grok 4
- Chairman: Gemini 3 Pro
- Mode: consensus
- Self-vote exclusion: enabled
Finding Models:
- OpenRouter: openrouter.ai/models
- Anthropic: docs.anthropic.com/models
- OpenAI: platform.openai.com/docs/models
- Google: ai.google.dev/gemini-api/docs/models
# First, store your API key securely (one-time setup)
llm-council setup-key
# Then add the MCP server (key is read from keychain or environment)
claude mcp add --transport stdio llm-council --scope user -- llm-councilThen in Claude Code:
Consult the LLM council about best practices for error handling
First ensure your API key is available (via keychain, environment variable, or .env file).
Add to ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"llm-council": {
"command": "llm-council"
}
}
}Note: No
envblock needed—the key is resolved from your system keychain or environment automatically.
Any MCP client can use the server by running:
llm-councilAsk the LLM council a question and get synthesized guidance.
Arguments:
query(string, required): The question to ask the councilconfidence(string, optional): Response quality level (default: "high")"quick": Fast models (mini/flash/haiku), ~30 seconds - fast responses for simple questions"balanced": Mid-tier models (GPT-4o, Sonnet), ~90 seconds - good balance of speed and quality"high": Full council (Opus, GPT-4o), ~180 seconds - comprehensive deliberation"reasoning": Deep thinking models (GPT-5.2, o1, DeepSeek-R1), ~600 seconds - complex reasoning
include_details(boolean, optional): Include individual model responses and rankings (default: false)verdict_type(string, optional): Type of verdict to render (default: "synthesis")"synthesis": Free-form natural language synthesis"binary": Go/no-go decision (approved/rejected) with confidence score"tie_breaker": Chairman resolves deadlocked decisions
include_dissent(boolean, optional): Extract minority opinions from Stage 2 (default: false)
Example:
Use consult_council to ask: "What are the trade-offs between microservices and monolithic architecture?"
Example with confidence level:
Use consult_council with confidence="quick" to ask: "What's the syntax for a Python list comprehension?"
Example with Jury Mode (binary verdict):
Use consult_council with verdict_type="binary" and include_dissent=true to ask: "Should we approve this PR that adds caching?"
Verify the council is working before expensive operations. Returns API connectivity status, configured models, and estimated response times.
Arguments: None
Returns:
api_key_configured: Whether an API key was foundkey_source: Where the key came from ("environment", "keychain", or "config_file")council_size: Number of models in the councilestimated_duration: Expected response times for each confidence levelready: Whether the council is ready to accept queries
Example:
Run council_health_check to verify the LLM council is working
The council uses a multi-stage process inspired by ensemble methods and peer review:
User Query
↓
┌─────────────────────────────────────────────┐
│ STAGE 1: Independent Responses │
│ • All council models queried in parallel │
│ • No knowledge of other responses │
│ • Graceful degradation if some fail │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ STAGE 1.5: Style Normalization (optional) │
│ • Rewrites responses in neutral style │
│ • Removes AI preambles and fingerprints │
│ • Strengthens anonymization │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ STAGE 2: Anonymous Peer Review │
│ • Responses labeled A, B, C (randomized) │
│ • XML sandboxing prevents prompt injection │
│ • JSON-structured rankings with scores │
│ • Self-votes excluded from aggregation │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ STAGE 3: Chairman Synthesis │
│ • Receives all responses + rankings │
│ • Consensus mode: single best answer │
│ • Debate mode: highlights disagreements │
└─────────────────────────────────────────────┘
↓
Final Response + Metadata
This approach helps surface diverse perspectives, identify consensus, and produce more balanced, well-reasoned answers.
By default, each model's vote for its own response is excluded from the aggregate rankings. This prevents self-preference bias.
export LLM_COUNCIL_EXCLUDE_SELF_VOTES=true # defaultConsensus Mode (default): Chairman synthesizes a single best answer.
Debate Mode: Chairman highlights areas of agreement, key disagreements, and trade-offs between perspectives.
export LLM_COUNCIL_MODE=debateOptional preprocessing that rewrites all responses in a neutral style before peer review. This strengthens anonymization by removing stylistic "fingerprints" that might allow models to recognize each other.
export LLM_COUNCIL_STYLE_NORMALIZATION=true
export LLM_COUNCIL_NORMALIZER_MODEL=google/gemini-2.0-flash-001 # fast/cheapFor councils with more than 5 models, you can limit the number of reviewers per response to reduce API costs (O(N²) → O(N×k)):
export LLM_COUNCIL_MAX_REVIEWERS=3The council includes built-in reliability features for long-running operations:
Tiered Timeouts: Graceful degradation under time pressure:
- Per-model soft deadline: 15s (start planning fallback)
- Per-model hard deadline: 25s (abandon slow model)
- Global synthesis trigger: 40s (must start synthesis)
- Response deadline: 50s (must return something)
Partial Results: If some models timeout, the council returns results from the models that responded, with a clear warning indicating which models were excluded.
Confidence Levels: Use the confidence parameter to trade off speed vs. thoroughness:
quick: ~20-30 seconds (fastest models)balanced: ~45-60 seconds (most models)high: ~60-90 seconds (full council, default)
Progress Feedback: During deliberation, progress updates show which models have responded and which are still pending:
✓ claude-opus-4.6 (1/4) | waiting: gpt-5.1, gemini-3-pro, grok-4
✓ gemini-3-pro (2/4) | waiting: gpt-5.1, grok-4
Transform the council from a "summary generator" to a "decision engine" with structured verdicts.
Verdict Types:
| Mode | Output | Use Case |
|---|---|---|
synthesis (default) |
Free-form synthesis | General questions, exploration |
binary |
approved/rejected + confidence | CI/CD gates, PR reviews, policy checks |
tie_breaker |
Chairman decides on deadlock | Contentious decisions |
Binary Verdict Mode:
Returns structured go/no-go decisions with confidence scores:
# MCP Tool
result = await consult_council(
query="Should we approve this architectural change?",
verdict_type="binary"
)
# Returns:
{
"verdict": "approved", # or "rejected"
"confidence": 0.75, # 0.0-1.0 based on council agreement
"rationale": "Council agreed the change improves modularity..."
}Tie-Breaker Mode:
When the council is deadlocked (top scores within 0.1 of each other), the chairman casts the deciding vote:
result = await consult_council(
query="Option A vs Option B for caching strategy?",
verdict_type="binary"
)
# If deadlocked, returns:
{
"verdict": "approved",
"confidence": 0.60,
"rationale": "Chairman resolved deadlock based on...",
"deadlocked": true # Flag indicates tie-breaker was needed
}Constructive Dissent:
Extract minority opinions from Stage 2 peer reviews:
result = await consult_council(
query="Should we migrate to microservices?",
verdict_type="binary",
include_dissent=True
)
# Returns:
{
"verdict": "approved",
"confidence": 0.70,
"rationale": "Majority supports migration...",
"dissent": "Minority perspective: One reviewer noted scalability concerns with current team size."
}Dissent Algorithm:
- Collect all scores from Stage 2 peer reviews
- Calculate median and standard deviation per response
- Identify outliers (score < median - 1.5 × std)
- Extract evaluation text from outliers
- Format as minority perspective
Example: CI/CD Gate Integration:
import asyncio
from llm_council.council import run_full_council
from llm_council.verdict import VerdictType
async def review_pull_request(pr_diff: str, pr_description: str):
"""Use LLM Council as a PR review gate."""
query = f"""
Review this pull request and determine if it should be approved.
Description: {pr_description}
Changes:
{pr_diff}
"""
stage1, stage2, stage3, metadata = await run_full_council(
query,
verdict_type=VerdictType.BINARY,
include_dissent=True
)
verdict = metadata.get("verdict", {})
if verdict.get("verdict") == "approved" and verdict.get("confidence", 0) >= 0.7:
return True, verdict.get("rationale")
else:
return False, f"{verdict.get('rationale')}\n\nDissent: {verdict.get('dissent', 'None')}"Environment Variables:
| Variable | Description | Default |
|---|---|---|
LLM_COUNCIL_VERDICT_TYPE |
Default verdict type | synthesis |
LLM_COUNCIL_DEADLOCK_THRESHOLD |
Borda score difference for deadlock | 0.1 |
LLM_COUNCIL_DISSENT_THRESHOLD |
Std deviations below median for outlier | 1.5 |
LLM_COUNCIL_MIN_BORDA_SPREAD |
Minimum spread to surface dissent | 0.0 |
By default, reviewers provide a single 1-10 holistic score. With rubric scoring enabled, reviewers score each response on five dimensions:
| Dimension | Weight | Description |
|---|---|---|
| Accuracy | 35% | Factual correctness, no hallucinations |
| Relevance | 10% | Addresses the actual question asked |
| Completeness | 20% | Covers all aspects of the question |
| Conciseness | 15% | Efficient communication, no padding |
| Clarity | 20% | Well-organized, easy to understand |
Accuracy Ceiling: When enabled (default), accuracy acts as a ceiling on the overall score:
- Accuracy < 5: Score caps at 4.0 (40%) — "significant errors or worse"
- Accuracy 5-6: Score caps at 7.0 (70%) — "mixed accuracy"
- Accuracy ≥ 7: No ceiling — "mostly accurate or better"
This prevents well-written hallucinations from ranking well.
Scoring Anchors: Each score level has defined behavioral meaning (see ADR-016):
- 9-10: Excellent (completely accurate, comprehensive, crystal clear)
- 7-8: Good (mostly accurate, covers main points)
- 5-6: Mixed (some errors or gaps)
- 3-4: Poor (significant issues)
- 1-2: Failing (fundamentally flawed)
# Enable rubric scoring
export LLM_COUNCIL_RUBRIC_SCORING=true
# Customize weights (must sum to 1.0)
export LLM_COUNCIL_WEIGHT_ACCURACY=0.40
export LLM_COUNCIL_WEIGHT_RELEVANCE=0.10
export LLM_COUNCIL_WEIGHT_COMPLETENESS=0.20
export LLM_COUNCIL_WEIGHT_CONCISENESS=0.10
export LLM_COUNCIL_WEIGHT_CLARITY=0.20When enabled, a pass/fail safety check runs before rubric scoring to filter harmful content:
| Pattern | Description |
|---|---|
| dangerous_instructions | Weapons, explosives, harmful devices |
| weapon_making | Firearm/weapon construction |
| malware_hacking | Unauthorized access, malware |
| self_harm | Self-harm encouragement |
| pii_exposure | Personal information leakage |
Responses that fail safety checks are capped at score 0 regardless of other dimension scores. Educational/defensive content is context-aware and allowed.
# Enable safety gate
export LLM_COUNCIL_SAFETY_GATE=true
# Customize score cap (default: 0)
export LLM_COUNCIL_SAFETY_SCORE_CAP=0.0When enabled, the council reports per-session bias indicators from peer review scoring:
| Bias Type | Description | Detection Threshold |
|---|---|---|
| Length-Score Correlation | Do longer responses score higher? | |r| > 0.3 |
| Reviewer Calibration | Are some reviewers harsh/generous relative to peers? | Mean ± 1 std from median |
| Position Bias | Does presentation order affect scores? | Variance > 0.5 |
Output: The metadata includes a bias_audit object with:
length_score_correlation: Pearson correlation coefficientlength_bias_detected: Boolean flagposition_score_variance: Variance of mean scores by presentation positionposition_bias_detected: Boolean flag (derived from anonymization labels A, B, C...)harsh_reviewers/generous_reviewers: Lists of biased reviewersoverall_bias_risk: "low", "medium", or "high"
Important Limitations: With only 4-5 models per session, these metrics have limited statistical power:
- Length correlation with n=5 data points can only detect extreme biases
- Position bias from a single ordering cannot distinguish position effects from quality differences
- Reviewer calibration is relative to the current session only
These are per-session indicators (red flags for extreme anomalies), not statistically robust proof of systematic bias. Interpret with appropriate caution.
# Enable bias auditing
export LLM_COUNCIL_BIAS_AUDIT=true
# Customize thresholds (optional)
export LLM_COUNCIL_LENGTH_CORRELATION_THRESHOLD=0.3
export LLM_COUNCIL_POSITION_VARIANCE_THRESHOLD=0.5For statistically meaningful bias detection across multiple sessions, enable bias persistence:
# Enable bias persistence (stores metrics locally)
export LLM_COUNCIL_BIAS_PERSISTENCE=true
# Consent level: 0=off, 1=local only (default), 2=anonymous, 3=enhanced, 4=research
export LLM_COUNCIL_BIAS_CONSENT=1
# Store path (default: ~/.llm-council/bias_metrics.jsonl)
export LLM_COUNCIL_BIAS_STORE=~/.llm-council/bias_metrics.jsonlGenerate cross-session bias reports:
# Text report (default)
llm-council bias-report
# JSON output
llm-council bias-report --format json
# Include detailed reviewer profiles
llm-council bias-report --verbose
# Limit to last 50 sessions
llm-council bias-report --sessions 50Statistical confidence tiers:
| Sessions | Confidence | UI Treatment |
|---|---|---|
| N < 10 | Insufficient | "Collecting data..." |
| 10-19 | Preliminary | Warning shown |
| 20-49 | Moderate | CIs displayed |
| N >= 50 | High | Full analysis |
Quantify the reliability and quality of council outputs with three core metrics:
| Metric | Range | Description |
|---|---|---|
| Consensus Strength Score (CSS) | 0.0-1.0 | Agreement among council members in Stage 2 rankings |
| Deliberation Depth Index (DDI) | 0.0-1.0 | Thoroughness of the deliberation process |
| Synthesis Attribution Score (SAS) | 0.0-1.0 | How well synthesis traces back to source responses |
CSS Interpretation:
| Score | Interpretation | Action |
|---|---|---|
| 0.85+ | Strong consensus | High confidence in synthesis |
| 0.70-0.84 | Moderate consensus | Note minority views |
| 0.50-0.69 | Weak consensus | Consider include_dissent=true |
| <0.50 | Significant disagreement | Use debate mode |
SAS Components:
winner_alignment: Similarity to top-ranked responsesmax_source_alignment: Best match to any responsehallucination_risk: 1 - max_source_alignmentgrounded: True if synthesis traces to sources (threshold: 0.6)
Usage:
Quality metrics are automatically included in metadata:
stage1, stage2, stage3, metadata = await run_full_council(query)
quality = metadata.get("quality_metrics", {})
core = quality.get("core", {})
print(f"Consensus: {core['consensus_strength']:.2f}")
print(f"Depth: {core['deliberation_depth']:.2f}")
print(f"Grounded: {core['synthesis_attribution']['grounded']}")MCP Tool Output:
The consult_council MCP tool displays quality metrics with visual bars:
### Quality Metrics
- **Consensus Strength**: 0.82 [████████░░]
- **Deliberation Depth**: 0.74 [███████░░░]
- **Synthesis Grounded**: ✓ (alignment: 0.89)
Environment Variables:
# Enable/disable quality metrics (default: true)
export LLM_COUNCIL_QUALITY_METRICS=true
# Quality tier: core (OSS), standard, enterprise
export LLM_COUNCIL_QUALITY_TIER=coreThe gateway layer provides an abstraction over LLM API requests with multiple gateway options:
Available Gateways:
| Gateway | Description | Key Features |
|---|---|---|
OpenRouterGateway |
Routes through OpenRouter | 100+ models, single API key |
RequestyGateway |
Routes through Requesty | BYOK support, analytics |
DirectGateway |
Direct provider APIs | Anthropic, OpenAI, Google |
OllamaGateway |
Local LLMs via Ollama | Air-gapped, cost-free, privacy-first |
Core Features:
- Circuit Breaker: Prevents cascading failures (CLOSED → OPEN → HALF_OPEN → CLOSED)
- Fallback Chains: Automatic retry with secondary gateways on failure
- Per-Gateway Metrics: Track failure counts, latency, and health status
- BYOK Support: Bring Your Own Key for Requesty and Direct gateways
Basic Usage:
from llm_council.gateway import (
GatewayRouter, GatewayRequest, CanonicalMessage, ContentBlock,
OpenRouterGateway, RequestyGateway, DirectGateway
)
# Single gateway (OpenRouter)
router = GatewayRouter()
# Multi-gateway with fallback
router = GatewayRouter(
gateways={
"openrouter": OpenRouterGateway(),
"requesty": RequestyGateway(byok_enabled=True, byok_keys={"anthropic": "sk-ant-..."}),
"direct": DirectGateway(provider_keys={"openai": "sk-..."}),
},
default_gateway="openrouter",
fallback_chains={"openrouter": ["requesty", "direct"]}
)
request = GatewayRequest(
model="openai/gpt-4o",
messages=[CanonicalMessage(role="user", content=[ContentBlock(type="text", text="Hello")])]
)
response = await router.complete(request)Enable the gateway layer:
export LLM_COUNCIL_USE_GATEWAY=true
# Optional: Configure specific gateways
export REQUESTY_API_KEY=your-requesty-key # For Requesty
export ANTHROPIC_API_KEY=sk-ant-... # For Direct (Anthropic)
export OPENAI_API_KEY=sk-... # For Direct (OpenAI)
export GOOGLE_API_KEY=... # For Direct (Google)Circuit Breaker Behavior:
- Default: 5 failures to trip the circuit
- Recovery timeout: 60 seconds
- Half-open state allows test requests to check recovery
- Open circuits are skipped in fallback chain
The gateway layer is currently opt-in (default: disabled) for backward compatibility.
The triage layer provides query classification, model selection optimizations, and confidence-gated routing:
- Confidence-Gated Fast Path: Routes simple queries to a single model, escalating to full council when confidence is low
- Shadow Council Sampling: Random 5% sampling validates fast path quality against full council
- Rollback Monitoring: Automatic rollback when disagreement/escalation rates breach thresholds
- Wildcard Selection: Adds domain-specialized models to the council based on query classification
- Prompt Optimization: Per-model prompt adaptation (Claude gets XML, OpenAI gets Markdown)
- Complexity Classification: Heuristic-based with optional Not Diamond API integration
Domain Categories:
| Domain | Description | Specialist Models |
|---|---|---|
| CODE | Programming, debugging, algorithms | DeepSeek, Codestral |
| REASONING | Math, logic, proofs | o1-preview, DeepSeek-R1 |
| CREATIVE | Stories, poems, fiction | Claude Opus, Command-R+ |
| MULTILINGUAL | Translation, language | GPT-4o, Command-R+ |
| GENERAL | General knowledge | Llama 3 (fallback) |
Enable wildcard selection:
export LLM_COUNCIL_WILDCARD_ENABLED=trueThis automatically adds a domain specialist to the council based on query classification. For example, a Python coding question will add a DeepSeek model alongside the default council.
Enable prompt optimization:
export LLM_COUNCIL_PROMPT_OPTIMIZATION_ENABLED=trueThis applies per-model prompt formatting. Claude receives XML-structured prompts, while other providers receive their preferred format.
Enable confidence-gated fast path:
export LLM_COUNCIL_FAST_PATH_ENABLED=true
export LLM_COUNCIL_FAST_PATH_CONFIDENCE_THRESHOLD=0.92 # defaultWhen enabled, simple queries are routed to a single model. If the model's confidence is below the threshold, the query automatically escalates to the full council. This can reduce costs by 45-55% on simple queries while maintaining quality.
Fast Path Quality Monitoring:
The fast path includes built-in quality monitoring:
| Metric | Threshold | Action |
|---|---|---|
| Shadow disagreement rate | > 8% | Automatic rollback |
| User escalation rate | > 15% | Automatic rollback |
| Error rate | > 1.5x baseline | Automatic rollback |
Configure monitoring:
export LLM_COUNCIL_SHADOW_SAMPLE_RATE=0.05 # 5% shadow sampling
export LLM_COUNCIL_ROLLBACK_ENABLED=true
export LLM_COUNCIL_ROLLBACK_WINDOW=100 # rolling window sizeOptional Not Diamond Integration:
For advanced model routing, integrate with Not Diamond:
export NOT_DIAMOND_API_KEY="your-key"
export LLM_COUNCIL_USE_NOT_DIAMOND=trueWhen Not Diamond is unavailable, the system gracefully falls back to heuristic-based classification.
The triage layer is currently opt-in (default: disabled) for backward compatibility.
Run council deliberations entirely on local hardware using Ollama:
# Install with Ollama support
pip install "llm-council-core[ollama]"
# Start Ollama (if not already running)
ollama serve
# Pull a model
ollama pull llama3.2Use local models in your council:
# Mix local and cloud models
export LLM_COUNCIL_MODELS="ollama/llama3.2,openai/gpt-4o,anthropic/claude-3-5-sonnet"
# Or use only local models (air-gapped mode)
export LLM_COUNCIL_MODELS="ollama/llama3.2,ollama/mistral,ollama/codellama"
export LLM_COUNCIL_CHAIRMAN="ollama/llama3.2"Hardware Requirements:
| Profile | Hardware | Models | Use Case |
|---|---|---|---|
| Minimum | 8+ core CPU, 16GB RAM | 7B quantized | Dev/testing |
| Recommended | M-series Pro, 32GB unified | 7B-13B | Small council |
| Professional | 2x RTX 4090, 64GB+ | 70B quantized | Production |
| Enterprise | Mac Studio 64GB+ | Multiple 70B | Air-gapped |
Quality Degradation Notice: Local models typically have reduced capabilities compared to cloud-hosted frontier models. The gateway includes quality notices in responses to inform users when local models are used.
Configuration:
# Ollama endpoint (default: http://localhost:11434)
export LLM_COUNCIL_OLLAMA_BASE_URL=http://localhost:11434
# Timeout for local models (default: 120s - first load can be slow)
export LLM_COUNCIL_OLLAMA_TIMEOUT=120.0Receive real-time notifications as the council deliberates:
from llm_council.webhooks import WebhookConfig, WebhookDispatcher, WebhookPayload
# Configure webhook endpoint
config = WebhookConfig(
url="https://your-server.com/webhook",
events=["council.complete", "council.error"],
secret="your-hmac-secret" # Optional, for signature verification
)
# Dispatch events (used internally by council)
dispatcher = WebhookDispatcher()
result = await dispatcher.dispatch(config, payload)Event Types:
| Event | Description | Payload |
|---|---|---|
council.deliberation_start |
Council begins | request_id, models |
council.stage1.complete |
All initial responses received | response_count |
model.vote_cast |
A model submitted rankings | voter, ranking |
council.stage2.complete |
All rankings complete | aggregate_rankings |
council.complete |
Final answer ready | stage3_response, duration_ms |
council.error |
Error occurred | error, partial_results |
HMAC Signature Verification:
Webhooks are signed using HMAC-SHA256 for security:
from llm_council.webhooks import verify_webhook_request
# Verify incoming webhook (in your server)
is_valid = verify_webhook_request(
payload=request.body,
headers=request.headers,
secret="your-hmac-secret"
)Headers included:
X-Council-Signature:sha256=<hex-digest>X-Council-Timestamp: Unix timestampX-Council-Version:1.0
Configuration:
# Webhook timeout (default: 5s)
export LLM_COUNCIL_WEBHOOK_TIMEOUT=5.0
# Max retry attempts (default: 3)
export LLM_COUNCIL_WEBHOOK_RETRIES=3
# Require HTTPS (except localhost) - default: true
export LLM_COUNCIL_WEBHOOK_HTTPS_ONLY=trueStream council events in real-time using Server-Sent Events.
Built-in HTTP Server:
The library includes a built-in HTTP server with SSE streaming:
# Install with HTTP server support
pip install "llm-council-core[http]"
# Start the server
llm-council serveThe SSE endpoint is available at GET /v1/council/stream:
# Stream council deliberation
curl -N "http://localhost:8000/v1/council/stream?prompt=What+is+AI"Custom Integration:
For custom FastAPI/Starlette applications:
from llm_council.webhooks import (
council_event_generator,
SSE_CONTENT_TYPE,
get_sse_headers
)
# In your FastAPI/Starlette endpoint
@app.get("/v1/council/stream")
async def stream_council(prompt: str):
return StreamingResponse(
council_event_generator(prompt, models=None, api_key=None),
media_type=SSE_CONTENT_TYPE,
headers=get_sse_headers()
)Client-side (JavaScript):
const source = new EventSource('/v1/council/stream?prompt=...');
source.addEventListener('council.deliberation_start', (e) => {
console.log('Started:', JSON.parse(e.data));
});
source.addEventListener('council.complete', (e) => {
const result = JSON.parse(e.data);
console.log('Final answer:', result.stage3_response);
source.close();
});Run LLM Council without any external metadata calls using the bundled model registry:
# Enable offline mode
export LLM_COUNCIL_OFFLINE=trueWhen offline mode is enabled:
- Uses
StaticRegistryProviderexclusively with 31 bundled models - No external API calls for metadata (context windows, pricing, capabilities)
- All core council operations continue to work
- Unknown models use safe defaults (4096 context window)
This implements the "Sovereign Orchestrator" philosophy: the system must function as a complete, independent utility without external dependencies.
Bundled Models (31 total):
| Provider | Count | Examples |
|---|---|---|
| OpenAI | 9 | gpt-4o, gpt-4o-mini, gpt-5.2, gpt-5-mini, o1, o1-preview, o1-mini, o3-mini |
| Anthropic | 5 | claude-opus-4.6, claude-3-5-sonnet, claude-3-5-haiku |
| 5 | gemini-3-pro, gemini-2.5-pro, gemini-2.0-flash, gemini-1.5-pro | |
| xAI | 2 | grok-4, grok-4.1-fast |
| DeepSeek | 2 | deepseek-r1, deepseek-chat |
| Meta | 2 | llama-3.3-70b, llama-3.1-405b |
| Mistral | 2 | mistral-large-2411, mistral-medium |
| Ollama | 6 | llama3.2, mistral, qwen2.5:14b, codellama, phi3 |
Model Metadata API:
from llm_council.metadata import get_provider
# Get the metadata provider
provider = get_provider()
# Query model information
info = provider.get_model_info("openai/gpt-4o")
print(f"Context window: {info.context_window}") # 128000
print(f"Quality tier: {info.quality_tier}") # frontier
# Check capabilities
window = provider.get_context_window("anthropic/claude-opus-4.6") # 200000
supports = provider.supports_reasoning("openai/o1") # True
# List all available models
models = provider.list_available_models() # 31 modelsLLM Council includes agent skills for AI-assisted code verification, review, and CI/CD quality gates. Skills use progressive disclosure to minimize token usage while providing detailed scoring rubrics when needed.
Available Skills:
| Skill | Category | Use Case |
|---|---|---|
council-verify |
verification | General work verification with multi-dimensional scoring |
council-review |
code-review | PR reviews with security, performance, and testing focus |
council-gate |
ci-cd | Quality gates for pipelines with exit codes (0=PASS, 1=FAIL, 2=UNCLEAR) |
Exit Codes for CI/CD:
# In GitHub Actions or any CI pipeline
llm-council gate --snapshot $GITHUB_SHA --rubric-focus Security
# Exit codes:
# 0 = PASS (confidence >= threshold, no blocking issues)
# 1 = FAIL (blocking issues present)
# 2 = UNCLEAR (needs human review)Skills are located in .github/skills/ and work with Claude Code, VS Code Copilot, Cursor, and other MCP-compatible clients.
For detailed documentation, see the Skills Guide.
| Variable | Description | Default |
|---|---|---|
OPENROUTER_API_KEY |
OpenRouter API key | Required for OpenRouter gateway |
REQUESTY_API_KEY |
Requesty API key | Required for Requesty gateway |
ANTHROPIC_API_KEY |
Anthropic API key | Required for Direct gateway (Anthropic) |
OPENAI_API_KEY |
OpenAI API key | Required for Direct gateway (OpenAI) |
GOOGLE_API_KEY |
Google API key | Required for Direct gateway (Google) |
LLM_COUNCIL_DEFAULT_GATEWAY |
Default gateway (openrouter/requesty/direct) | openrouter |
LLM_COUNCIL_USE_GATEWAY |
Enable gateway layer with circuit breaker | false |
| Variable | Description | Default |
|---|---|---|
LLM_COUNCIL_OLLAMA_BASE_URL |
Ollama API endpoint | http://localhost:11434 |
LLM_COUNCIL_OLLAMA_TIMEOUT |
Timeout for Ollama requests (seconds) | 120.0 |
LLM_COUNCIL_USE_LITELLM |
Enable LiteLLM wrapper for Ollama | true |
| Variable | Description | Default |
|---|---|---|
LLM_COUNCIL_WEBHOOK_TIMEOUT |
Webhook POST timeout (seconds) | 5.0 |
LLM_COUNCIL_WEBHOOK_RETRIES |
Max retry attempts | 3 |
LLM_COUNCIL_WEBHOOK_HTTPS_ONLY |
Require HTTPS (except localhost) | true |
| Variable | Description | Default |
|---|---|---|
LLM_COUNCIL_MODELS |
Comma-separated model list | GPT-5.1, Gemini 3 Pro, Claude 4.5, Grok 4 |
LLM_COUNCIL_CHAIRMAN |
Chairman model | google/gemini-3-pro-preview |
LLM_COUNCIL_MODE |
consensus or debate |
consensus |
LLM_COUNCIL_EXCLUDE_SELF_VOTES |
Exclude self-votes | true |
LLM_COUNCIL_STYLE_NORMALIZATION |
Enable style normalization | false |
LLM_COUNCIL_NORMALIZER_MODEL |
Model for normalization | google/gemini-2.0-flash-001 |
LLM_COUNCIL_MAX_REVIEWERS |
Max reviewers per response | null (all) |
LLM_COUNCIL_RUBRIC_SCORING |
Enable multi-dimensional rubric scoring | false |
LLM_COUNCIL_ACCURACY_CEILING |
Use accuracy as score ceiling | true |
LLM_COUNCIL_WEIGHT_* |
Rubric dimension weights (ACCURACY, RELEVANCE, COMPLETENESS, CONCISENESS, CLARITY) | See above |
LLM_COUNCIL_SAFETY_GATE |
Enable safety pre-check gate | false |
LLM_COUNCIL_SAFETY_SCORE_CAP |
Score cap for failed safety checks | 0.0 |
LLM_COUNCIL_BIAS_AUDIT |
Enable bias auditing (ADR-015) | false |
LLM_COUNCIL_LENGTH_CORRELATION_THRESHOLD |
Length-score correlation threshold for bias detection | 0.3 |
LLM_COUNCIL_POSITION_VARIANCE_THRESHOLD |
Position variance threshold for bias detection | 0.5 |
LLM_COUNCIL_BIAS_PERSISTENCE |
Enable cross-session bias storage (ADR-018) | false |
LLM_COUNCIL_BIAS_STORE |
Path to bias metrics JSONL file | ~/.llm-council/bias_metrics.jsonl |
LLM_COUNCIL_BIAS_CONSENT |
Consent level: 0=off, 1=local, 2=anonymous, 3=enhanced, 4=research | 1 |
LLM_COUNCIL_BIAS_WINDOW_SESSIONS |
Rolling window: max sessions for aggregation | 100 |
LLM_COUNCIL_BIAS_WINDOW_DAYS |
Rolling window: max days for aggregation | 30 |
LLM_COUNCIL_MIN_BIAS_SESSIONS |
Minimum sessions for aggregation analysis | 20 |
LLM_COUNCIL_HASH_SECRET |
Secret for query hashing (RESEARCH consent only) | dev-secret |
LLM_COUNCIL_SUPPRESS_WARNINGS |
Suppress security warnings | false |
LLM_COUNCIL_MODELS_QUICK |
Models for quick tier (ADR-022) | gpt-4o-mini, haiku, gemini-flash |
LLM_COUNCIL_MODELS_BALANCED |
Models for balanced tier (ADR-022) | gpt-4o, sonnet, gemini-pro |
LLM_COUNCIL_MODELS_HIGH |
Models for high tier (ADR-022) | gpt-4o, opus, gemini-3-pro, grok-4 |
LLM_COUNCIL_MODELS_REASONING |
Models for reasoning tier (ADR-022) | gpt-5.2, opus, o1-preview, deepseek-r1 |
LLM_COUNCIL_WILDCARD_ENABLED |
Enable wildcard specialist selection (ADR-020) | false |
LLM_COUNCIL_PROMPT_OPTIMIZATION_ENABLED |
Enable per-model prompt optimization (ADR-020) | false |
LLM_COUNCIL_FAST_PATH_ENABLED |
Enable confidence-gated fast path (ADR-020) | false |
LLM_COUNCIL_FAST_PATH_CONFIDENCE_THRESHOLD |
Confidence threshold for fast path (0.0-1.0) | 0.92 |
LLM_COUNCIL_FAST_PATH_MODEL |
Model for fast path routing | auto |
LLM_COUNCIL_SHADOW_SAMPLE_RATE |
Shadow sampling rate (0.0-1.0) | 0.05 |
LLM_COUNCIL_SHADOW_DISAGREEMENT_THRESHOLD |
Disagreement threshold for shadow samples | 0.08 |
LLM_COUNCIL_ROLLBACK_ENABLED |
Enable rollback metric tracking | true |
LLM_COUNCIL_ROLLBACK_WINDOW |
Rolling window size for metrics | 100 |
LLM_COUNCIL_ROLLBACK_DISAGREEMENT_THRESHOLD |
Shadow disagreement rollback threshold | 0.08 |
LLM_COUNCIL_ROLLBACK_ESCALATION_THRESHOLD |
User escalation rollback threshold | 0.15 |
NOT_DIAMOND_API_KEY |
Not Diamond API key (optional) | - |
LLM_COUNCIL_USE_NOT_DIAMOND |
Enable Not Diamond API integration | false |
LLM_COUNCIL_NOT_DIAMOND_TIMEOUT |
Not Diamond API timeout in seconds | 5.0 |
LLM_COUNCIL_NOT_DIAMOND_CACHE_TTL |
Not Diamond response cache TTL in seconds | 300 |
This project is a derivative work based on the original llm-council by Andrej Karpathy.
Original Work:
- Concept and 3-stage council orchestration: Andrej Karpathy
- Core council logic (Stage 1-3 process)
- OpenRouter integration
Derivative Work by Amiable:
- MCP (Model Context Protocol) server implementation
- Removal of web frontend (focus on MCP functionality)
- Python package structure for PyPI distribution
- User-configurable model selection
- Enhanced features (style normalization, self-vote exclusion, synthesis modes)
- Test suite and modern packaging standards
MIT License - see LICENSE file for details.