Multi-Agent PR Reviewer - Features & Reference

Core Features

1. Multi-Agent Debate System

Three specialized AI agents collaborate to review code:

OpenAI Architect (GPT-5.4-mini)

Correctness and logical soundness
Design patterns and maintainability
Code clarity and readability
Backward compatibility concerns
API design best practices

Anthropic Security Auditor (Claude Sonnet)

Vulnerability detection
Threat modeling
Authentication/authorization issues
Input validation and sanitization
Data protection compliance

Gemini Runtime Tester (Gemini 3.1 Flash)

Test coverage analysis
Performance implications
Database migration safety
Deployment readiness
Environment-specific issues

2. Consensus Mechanism

Voting Rules:

All 3 agents must approve to merge
No critical/high severity issues allowed
Configurable confidence threshold (default: 80%)
Up to 3 rebuttal rounds if agents disagree

Decision Matrix:

All Approve        → ✅ APPROVED (merge ready)
Any Disapprove     → ❌ REQUEST_CHANGES (rework needed)
Debate Continues   → Agents provide rebuttals (up to 3 rounds)

3. Confidence Scoring

Each agent provides confidence metrics (0-100%):

Correctness (OpenAI) - Design soundness
Security (Anthropic) - Vulnerability-free
Tests (Gemini) - Coverage and performance

Aggregate metrics:

Average: Mean of all agent confidence scores
Per-category: Clarity on specific concerns

Example Output:

{
  "confidence_scores": {
    "average": 92.3,
    "by_agent": {
      "openai_architect": 95.0,
      "anthropic_security": 88.0,
      "gemini_runtime": 92.0
    }
  }
}

4. API Request & Response Capture

What Gets Captured:

All LLM API calls (requests and responses)
Model, tokens, latency
Complete prompt and completion
Tool calls and reasoning steps

Files Generated:

api_interactions_*.jsonl - Raw request/response pairs
api_capture_*.log - Human-readable API log

Example Interaction:

{
  "request_id": "req_001",
  "timestamp": "2024-05-03T23:50:00Z",
  "provider": "openai",
  "model": "gpt-5.4-mini",
  "system_prompt": "You are a code review architect...",
  "user_message": "Review this PR...",
  "temperature": 0.7,
  "max_tokens": 2000,
  "response": "The code looks correct...",
  "tokens_used": {"prompt": 1250, "completion": 850},
  "latency_ms": 3421
}

5. Fine-Tuning Dataset Generation

What Gets Generated:

fine_tuning_dataset.jsonl - Training data in OpenAI format
Paired request/response samples
All agent perspectives (architect, security, tester)

Format (OpenAI Fine-tuning):

{
  "messages": [
    {
      "role": "system",
      "content": "You are a code review architect..."
    },
    {
      "role": "user",
      "content": "Review this PR:\n<diff>"
    },
    {
      "role": "assistant",
      "content": "The code looks correct..."
    }
  ]
}

Use Cases:

Fine-tune open-source models (Llama, Mistral, etc.)
Train domain-specific code reviewers
Improve review consistency
Reduce API costs with smaller fine-tuned models

6. GitHub Integration

Webhook Events Handled:

pull_request.opened - New PR submitted
pull_request.synchronize - Commits pushed
pull_request.reopened - PR reopened

Check Run Status:

Creates GitHub Check with detailed output
Real-time status updates
Links to full review results

PR Review Comments:

Posts formatted markdown review
Includes verdict and confidence
Lists specific issues per agent
Structured rebuttals if agents disagree

7. Batch Review Mode

Review multiple PRs programmatically with confidence tracking and dataset generation.

Command:

python batch_review_enhanced.py owner/repo \
  --installation-id ID \
  [--no-post] [--max-prs N] [--no-capture]

Options:

--installation-id - GitHub App installation ID
--no-post - Dry run (no GitHub updates)
--max-prs N - Limit to N PRs (for testing)
--no-capture - Skip API capture (faster)

Output:

Batch Review Results:
├── Reviewed: 40/40 PRs
├── Approved: 32 (80%)
├── Requested Changes: 8 (20%)
├── Average Confidence: 87.3%
└── Total Time: 28 minutes

Generated Files:
├── batch_review_results_enhanced.json
├── api_interactions_*.jsonl
├── api_capture_*.log
└── fine_tuning_dataset.jsonl

Configuration Reference

Environment Variables

# LLM API Keys
OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=AIzaSy...

# Models
OPENAI_MODEL=gpt-4o                           # Default: gpt-4o
ANTHROPIC_MODEL=claude-sonnet-4-6             # Default: claude-sonnet-4-6
GEMINI_MODEL=gemini-3.1-flash-lite-preview   # Default: gemini-3.1-flash-lite-preview

# GitHub App
GITHUB_APP_ID=3592372
GITHUB_PRIVATE_KEY_PATH=./private-key.pem
GITHUB_WEBHOOK_SECRET=your-secret-here
GITHUB_INSTALLATION_ID=<installation_id>      # Optional, can pass as CLI arg

# Review Configuration
MAX_DEBATE_ROUNDS=3                           # Max rebuttal rounds (default: 3)
MIN_CONFIDENCE_TO_APPROVE=0.80                # Minimum confidence threshold (default: 0.80)

# Optional
LOG_LEVEL=INFO
DEBUG=false

Review Schema

Each agent returns JSON with this structure:

{
  "verdict": "APPROVED or REQUEST_CHANGES",
  "confidence": 92.5,
  "issues": [
    {
      "severity": "HIGH or MEDIUM or LOW",
      "category": "security or correctness or performance or tests",
      "title": "Issue title",
      "description": "Detailed explanation",
      "location": "filename:line_number",
      "suggestion": "How to fix"
    }
  ],
  "summary": "Brief summary of review",
  "positives": ["What went well"],
  "rebuttal": "Response to other agents (if applicable)"
}

API Architecture

Request Flow

GitHub PR Event
    ↓
FastAPI Webhook Handler
    ↓
GitHub Context Builder (fetch PR metadata, diffs)
    ↓
LangGraph Workflow
    ├─→ OpenAI Architect Agent
    ├─→ Anthropic Security Agent
    └─→ Gemini Runtime Agent
    ↓
Consensus Evaluator
    ├─→ All approved? → Return APPROVED
    └─→ Disagreement? → Start debate rounds (max 3)
    ↓
API Capture & Logging
    ├─→ Save interactions to JSONL
    └─→ Generate confidence metrics
    ↓
GitHub Feedback
    ├─→ Create/update Check Run
    ├─→ Post PR Review
    └─→ Set approval status

LangGraph Workflow

Nodes:

initial_review - All agents review in parallel
evaluate_consensus - Check if consensus reached
start_debate - Begin rebuttal rounds
debate_round - Each agent provides rebuttal
finalize_review - Compile final verdict

Edges:

initial_review
    ↓
evaluate_consensus
    ├─→ consensus_reached → finalize_review ✅
    └─→ not_reached (< 3 rounds) → start_debate → debate_round → evaluate_consensus
    └─→ max_rounds_reached → finalize_review (all agents must approve; otherwise request changes)

Performance Characteristics

Time Complexity

Per PR review:

API calls: 3 initial + up to 6 rebuttals = 3-9 calls
Time per call: 5-15 seconds
Total time: 30-60 seconds per PR

Factors affecting speed:

PR size (larger = slower)
Code complexity
API provider latency
Debate rounds (increases time)

Cost Characteristics

Typical tokens per review:

Prompt tokens: 1,500-3,000 (depends on diff size)
Completion tokens: 300-800 per agent
Total: ~5,000-10,000 tokens

Cost per PR:

OpenAI GPT-4o: $0.02-0.10 (input/output rates)
Anthropic Claude: $0.03-0.08
Google Gemini: $0.01-0.05
Average: $0.10-0.30 per PR

For 40 PRs: approximately $5-15 total

Scalability

Batch Review Performance:

10 PRs  →  5-10 minutes    →  $1-3
20 PRs  →  10-20 minutes   →  $3-6
40 PRs  →  20-40 minutes   →  $5-15
100 PRs →  50-100 minutes  →  $15-30

Optimization Tips:

Use --no-capture to skip logging (2x faster)
Review in batches (avoid hitting rate limits)
Schedule during off-peak hours
Use smaller models for initial screening

Data Formats

Batch Review Results JSON

{
  "summary": {
    "total_prs": 40,
    "approved": 32,
    "requested_changes": 8,
    "average_confidence": 87.3,
    "total_time_seconds": 1680
  },
  "reviews": [
    {
      "pr_number": 123,
      "title": "Add user authentication",
      "verdict": "APPROVED",
      "confidence_scores": {
        "average": 92.3,
        "by_agent": {
          "openai_architect": 95.0,
          "anthropic_security": 88.0,
          "gemini_runtime": 92.0
        }
      },
      "agent_reviews": {
        "openai_architect": {...},
        "anthropic_security": {...},
        "gemini_runtime": {...}
      }
    }
  ]
}

API Interactions JSONL

{"request_id":"req_001","timestamp":"...","provider":"openai","model":"gpt-4o","system_prompt":"...","user_message":"...","temperature":0.7,"max_tokens":2000}
{"request_id":"req_001","timestamp":"...","provider":"openai","status":200,"response":"...","tokens_used":{"prompt":1250,"completion":850},"latency_ms":3421}

Fine-Tuning Dataset JSONL

{"messages":[{"role":"system","content":"You are a code review architect..."},{"role":"user","content":"Review this PR:\n..."},{"role":"assistant","content":"..."}]}
{"messages":[{"role":"system","content":"You are a security auditor..."},{"role":"user","content":"Review this PR:\n..."},{"role":"assistant","content":"..."}]}

Advanced Usage

Custom Models

Edit .env to use different models:

# Use latest models
OPENAI_MODEL=gpt-5.5-medium
ANTHROPIC_MODEL=claude-opus-4-7-thinking-xhigh
GEMINI_MODEL=gemini-pro

# Or use specific versions
OPENAI_MODEL=gpt-5.4-mini
ANTHROPIC_MODEL=claude-3-opus-20240229

Fine-Tuning Workflow

Generate dataset:

python batch_review_enhanced.py owner repo --installation-id ID

Fine-tune model (example with OpenAI):

openai api fine_tunes.create \
  -t fine_tuning_dataset.jsonl \
  -m gpt-5.4-mini

Use fine-tuned model:
```
OPENAI_MODEL=ft-ABCxyz123
```

Webhook Setup (Production)

For continuous webhook reviews:

Deploy FastAPI app (Heroku, Railway, AWS Lambda)
Set webhook URL in GitHub App settings
App automatically reviews all new/updated PRs
No batch script needed for ongoing reviews

Files Overview

Core Application

app/main.py - FastAPI webhook server
app/graph/workflow.py - LangGraph workflow
app/github_client.py - GitHub API client
app/diff_builder.py - PR context builder
app/schemas.py - JSON schema definitions

Providers (AI Agents)

app/providers/openai_agent.py - GPT-4o architect
app/providers/anthropic_agent.py - Claude security auditor
app/providers/gemini_agent.py - Gemini runtime tester

Batch Review

batch_review_enhanced.py - Main batch review script
batch_review_prs.py - Faster batch review variant
api_capture.py - API request/response logging
confidence_tracker.py - Confidence metrics

Utilities

verify_setup.py - Verify environment configuration
find_installation_id.py - Locate GitHub App installation ID
cleanup.py - Remove temporary files and caches

Configuration

.env.example - Environment template
.gitignore - Git ignore patterns
pyproject.toml - Python project config

Documentation

SETUP.md - Initial setup & usage guide
FEATURES.md - Features & technical reference

Support & Debugging

Verbose Output

To see detailed logs during batch review:

python -c "import logging; logging.basicConfig(level=logging.DEBUG)"
python batch_review_enhanced.py owner repo --installation-id ID

Check API Capture

Review raw API interactions:

# See formatted log
cat api_capture_*.log | head -50

# Count interactions
wc -l api_interactions_*.jsonl

# View first interaction
head -1 api_interactions_*.jsonl | python -m json.tool

Validate Fine-Tuning Dataset

# Check format
python -c "import json; [json.loads(line) for line in open('fine_tuning_dataset.jsonl')]"

# Count examples
wc -l fine_tuning_dataset.jsonl

# Sample entry
head -1 fine_tuning_dataset.jsonl | python -m json.tool

Common Questions

Q: How long does batch review take? A: ~30-60 seconds per PR. 40 PRs = 20-40 minutes.

Q: How much does it cost? A: ~$0.10-0.30 per PR. 40 PRs = ~$5-15.

Q: Can I use my own models? A: Yes, edit the model names in .env or use fine-tuned models after training.

Q: Does it work offline? A: No, requires LLM API access (OpenAI, Anthropic, Gemini).

Q: Can I use for production? A: Yes, deploy the FastAPI app to a server and configure GitHub App webhook.

Q: How do I reduce costs? A: Use cheaper models (Gemini instead of GPT-4), use --no-capture flag, or fine-tune smaller models.

Q: What if agents disagree? A: They debate for up to 3 rounds. If they still disagree, the review remains fail-closed and requests changes unless all agents approve.

Q: Can I customize the review criteria? A: Yes, modify agent system prompts in app/providers/*_agent.py.

Troubleshooting Guide

Issue: "Installation not found"

Run: python find_installation_id.py owner repo
Or manually check: https://github.com/settings/apps > Installations

Issue: Reviews not posting to GitHub

Test with --no-post --max-prs 1 first
Verify app permissions: Pull requests Read & write
Check app is installed on repository

Issue: API rate limits exceeded

GitHub API: wait 1 hour before retrying
LLM providers: check dashboard for limits
Run batch reviews during off-peak hours

Issue: Fine-tuning dataset is empty

Ensure --no-capture is NOT used
Check api_interactions_*.jsonl has data
Run: python cleanup.py then retry

Issue: Takes too long

Use --max-prs 5 to test speed
Larger PRs take longer (normal)
Use cheaper models if cost is concern
Use --no-capture if speed critical

This completes the comprehensive features and reference documentation.

FilesExpand file tree

FEATURES.md

Latest commit

History