Three specialized AI agents collaborate to review code:
OpenAI Architect (GPT-5.4-mini)
- Correctness and logical soundness
- Design patterns and maintainability
- Code clarity and readability
- Backward compatibility concerns
- API design best practices
Anthropic Security Auditor (Claude Sonnet)
- Vulnerability detection
- Threat modeling
- Authentication/authorization issues
- Input validation and sanitization
- Data protection compliance
Gemini Runtime Tester (Gemini 3.1 Flash)
- Test coverage analysis
- Performance implications
- Database migration safety
- Deployment readiness
- Environment-specific issues
Voting Rules:
- All 3 agents must approve to merge
- No critical/high severity issues allowed
- Configurable confidence threshold (default: 80%)
- Up to 3 rebuttal rounds if agents disagree
Decision Matrix:
All Approve → ✅ APPROVED (merge ready)
Any Disapprove → ❌ REQUEST_CHANGES (rework needed)
Debate Continues → Agents provide rebuttals (up to 3 rounds)
Each agent provides confidence metrics (0-100%):
- Correctness (OpenAI) - Design soundness
- Security (Anthropic) - Vulnerability-free
- Tests (Gemini) - Coverage and performance
Aggregate metrics:
- Average: Mean of all agent confidence scores
- Per-category: Clarity on specific concerns
Example Output:
{
"confidence_scores": {
"average": 92.3,
"by_agent": {
"openai_architect": 95.0,
"anthropic_security": 88.0,
"gemini_runtime": 92.0
}
}
}What Gets Captured:
- All LLM API calls (requests and responses)
- Model, tokens, latency
- Complete prompt and completion
- Tool calls and reasoning steps
Files Generated:
api_interactions_*.jsonl- Raw request/response pairsapi_capture_*.log- Human-readable API log
Example Interaction:
{
"request_id": "req_001",
"timestamp": "2024-05-03T23:50:00Z",
"provider": "openai",
"model": "gpt-5.4-mini",
"system_prompt": "You are a code review architect...",
"user_message": "Review this PR...",
"temperature": 0.7,
"max_tokens": 2000,
"response": "The code looks correct...",
"tokens_used": {"prompt": 1250, "completion": 850},
"latency_ms": 3421
}What Gets Generated:
fine_tuning_dataset.jsonl- Training data in OpenAI format- Paired request/response samples
- All agent perspectives (architect, security, tester)
Format (OpenAI Fine-tuning):
{
"messages": [
{
"role": "system",
"content": "You are a code review architect..."
},
{
"role": "user",
"content": "Review this PR:\n<diff>"
},
{
"role": "assistant",
"content": "The code looks correct..."
}
]
}Use Cases:
- Fine-tune open-source models (Llama, Mistral, etc.)
- Train domain-specific code reviewers
- Improve review consistency
- Reduce API costs with smaller fine-tuned models
Webhook Events Handled:
pull_request.opened- New PR submittedpull_request.synchronize- Commits pushedpull_request.reopened- PR reopened
Check Run Status:
- Creates GitHub Check with detailed output
- Real-time status updates
- Links to full review results
PR Review Comments:
- Posts formatted markdown review
- Includes verdict and confidence
- Lists specific issues per agent
- Structured rebuttals if agents disagree
Review multiple PRs programmatically with confidence tracking and dataset generation.
Command:
python batch_review_enhanced.py owner/repo \
--installation-id ID \
[--no-post] [--max-prs N] [--no-capture]Options:
--installation-id- GitHub App installation ID--no-post- Dry run (no GitHub updates)--max-prs N- Limit to N PRs (for testing)--no-capture- Skip API capture (faster)
Output:
Batch Review Results:
├── Reviewed: 40/40 PRs
├── Approved: 32 (80%)
├── Requested Changes: 8 (20%)
├── Average Confidence: 87.3%
└── Total Time: 28 minutes
Generated Files:
├── batch_review_results_enhanced.json
├── api_interactions_*.jsonl
├── api_capture_*.log
└── fine_tuning_dataset.jsonl
# LLM API Keys
OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=AIzaSy...
# Models
OPENAI_MODEL=gpt-4o # Default: gpt-4o
ANTHROPIC_MODEL=claude-sonnet-4-6 # Default: claude-sonnet-4-6
GEMINI_MODEL=gemini-3.1-flash-lite-preview # Default: gemini-3.1-flash-lite-preview
# GitHub App
GITHUB_APP_ID=3592372
GITHUB_PRIVATE_KEY_PATH=./private-key.pem
GITHUB_WEBHOOK_SECRET=your-secret-here
GITHUB_INSTALLATION_ID=<installation_id> # Optional, can pass as CLI arg
# Review Configuration
MAX_DEBATE_ROUNDS=3 # Max rebuttal rounds (default: 3)
MIN_CONFIDENCE_TO_APPROVE=0.80 # Minimum confidence threshold (default: 0.80)
# Optional
LOG_LEVEL=INFO
DEBUG=falseEach agent returns JSON with this structure:
{
"verdict": "APPROVED or REQUEST_CHANGES",
"confidence": 92.5,
"issues": [
{
"severity": "HIGH or MEDIUM or LOW",
"category": "security or correctness or performance or tests",
"title": "Issue title",
"description": "Detailed explanation",
"location": "filename:line_number",
"suggestion": "How to fix"
}
],
"summary": "Brief summary of review",
"positives": ["What went well"],
"rebuttal": "Response to other agents (if applicable)"
}GitHub PR Event
↓
FastAPI Webhook Handler
↓
GitHub Context Builder (fetch PR metadata, diffs)
↓
LangGraph Workflow
├─→ OpenAI Architect Agent
├─→ Anthropic Security Agent
└─→ Gemini Runtime Agent
↓
Consensus Evaluator
├─→ All approved? → Return APPROVED
└─→ Disagreement? → Start debate rounds (max 3)
↓
API Capture & Logging
├─→ Save interactions to JSONL
└─→ Generate confidence metrics
↓
GitHub Feedback
├─→ Create/update Check Run
├─→ Post PR Review
└─→ Set approval status
Nodes:
initial_review- All agents review in parallelevaluate_consensus- Check if consensus reachedstart_debate- Begin rebuttal roundsdebate_round- Each agent provides rebuttalfinalize_review- Compile final verdict
Edges:
initial_review
↓
evaluate_consensus
├─→ consensus_reached → finalize_review ✅
└─→ not_reached (< 3 rounds) → start_debate → debate_round → evaluate_consensus
└─→ max_rounds_reached → finalize_review (all agents must approve; otherwise request changes)
Per PR review:
- API calls: 3 initial + up to 6 rebuttals = 3-9 calls
- Time per call: 5-15 seconds
- Total time: 30-60 seconds per PR
Factors affecting speed:
- PR size (larger = slower)
- Code complexity
- API provider latency
- Debate rounds (increases time)
Typical tokens per review:
- Prompt tokens: 1,500-3,000 (depends on diff size)
- Completion tokens: 300-800 per agent
- Total: ~5,000-10,000 tokens
Cost per PR:
- OpenAI GPT-4o: $0.02-0.10 (input/output rates)
- Anthropic Claude: $0.03-0.08
- Google Gemini: $0.01-0.05
- Average: $0.10-0.30 per PR
For 40 PRs: approximately $5-15 total
Batch Review Performance:
10 PRs → 5-10 minutes → $1-3
20 PRs → 10-20 minutes → $3-6
40 PRs → 20-40 minutes → $5-15
100 PRs → 50-100 minutes → $15-30
Optimization Tips:
- Use
--no-captureto skip logging (2x faster) - Review in batches (avoid hitting rate limits)
- Schedule during off-peak hours
- Use smaller models for initial screening
{
"summary": {
"total_prs": 40,
"approved": 32,
"requested_changes": 8,
"average_confidence": 87.3,
"total_time_seconds": 1680
},
"reviews": [
{
"pr_number": 123,
"title": "Add user authentication",
"verdict": "APPROVED",
"confidence_scores": {
"average": 92.3,
"by_agent": {
"openai_architect": 95.0,
"anthropic_security": 88.0,
"gemini_runtime": 92.0
}
},
"agent_reviews": {
"openai_architect": {...},
"anthropic_security": {...},
"gemini_runtime": {...}
}
}
]
}{"request_id":"req_001","timestamp":"...","provider":"openai","model":"gpt-4o","system_prompt":"...","user_message":"...","temperature":0.7,"max_tokens":2000}
{"request_id":"req_001","timestamp":"...","provider":"openai","status":200,"response":"...","tokens_used":{"prompt":1250,"completion":850},"latency_ms":3421}{"messages":[{"role":"system","content":"You are a code review architect..."},{"role":"user","content":"Review this PR:\n..."},{"role":"assistant","content":"..."}]}
{"messages":[{"role":"system","content":"You are a security auditor..."},{"role":"user","content":"Review this PR:\n..."},{"role":"assistant","content":"..."}]}Edit .env to use different models:
# Use latest models
OPENAI_MODEL=gpt-5.5-medium
ANTHROPIC_MODEL=claude-opus-4-7-thinking-xhigh
GEMINI_MODEL=gemini-pro
# Or use specific versions
OPENAI_MODEL=gpt-5.4-mini
ANTHROPIC_MODEL=claude-3-opus-20240229-
Generate dataset:
python batch_review_enhanced.py owner repo --installation-id ID
-
Fine-tune model (example with OpenAI):
openai api fine_tunes.create \ -t fine_tuning_dataset.jsonl \ -m gpt-5.4-mini
-
Use fine-tuned model:
OPENAI_MODEL=ft-ABCxyz123
For continuous webhook reviews:
- Deploy FastAPI app (Heroku, Railway, AWS Lambda)
- Set webhook URL in GitHub App settings
- App automatically reviews all new/updated PRs
- No batch script needed for ongoing reviews
app/main.py- FastAPI webhook serverapp/graph/workflow.py- LangGraph workflowapp/github_client.py- GitHub API clientapp/diff_builder.py- PR context builderapp/schemas.py- JSON schema definitions
app/providers/openai_agent.py- GPT-4o architectapp/providers/anthropic_agent.py- Claude security auditorapp/providers/gemini_agent.py- Gemini runtime tester
batch_review_enhanced.py- Main batch review scriptbatch_review_prs.py- Faster batch review variantapi_capture.py- API request/response loggingconfidence_tracker.py- Confidence metrics
verify_setup.py- Verify environment configurationfind_installation_id.py- Locate GitHub App installation IDcleanup.py- Remove temporary files and caches
.env.example- Environment template.gitignore- Git ignore patternspyproject.toml- Python project config
SETUP.md- Initial setup & usage guideFEATURES.md- Features & technical reference
To see detailed logs during batch review:
python -c "import logging; logging.basicConfig(level=logging.DEBUG)"
python batch_review_enhanced.py owner repo --installation-id IDReview raw API interactions:
# See formatted log
cat api_capture_*.log | head -50
# Count interactions
wc -l api_interactions_*.jsonl
# View first interaction
head -1 api_interactions_*.jsonl | python -m json.tool# Check format
python -c "import json; [json.loads(line) for line in open('fine_tuning_dataset.jsonl')]"
# Count examples
wc -l fine_tuning_dataset.jsonl
# Sample entry
head -1 fine_tuning_dataset.jsonl | python -m json.toolQ: How long does batch review take? A: ~30-60 seconds per PR. 40 PRs = 20-40 minutes.
Q: How much does it cost? A: ~$0.10-0.30 per PR. 40 PRs = ~$5-15.
Q: Can I use my own models?
A: Yes, edit the model names in .env or use fine-tuned models after training.
Q: Does it work offline? A: No, requires LLM API access (OpenAI, Anthropic, Gemini).
Q: Can I use for production? A: Yes, deploy the FastAPI app to a server and configure GitHub App webhook.
Q: How do I reduce costs?
A: Use cheaper models (Gemini instead of GPT-4), use --no-capture flag, or fine-tune smaller models.
Q: What if agents disagree? A: They debate for up to 3 rounds. If they still disagree, the review remains fail-closed and requests changes unless all agents approve.
Q: Can I customize the review criteria?
A: Yes, modify agent system prompts in app/providers/*_agent.py.
- Run:
python find_installation_id.py owner repo - Or manually check: https://github.com/settings/apps > Installations
- Test with
--no-post --max-prs 1first - Verify app permissions: Pull requests Read & write
- Check app is installed on repository
- GitHub API: wait 1 hour before retrying
- LLM providers: check dashboard for limits
- Run batch reviews during off-peak hours
- Ensure
--no-captureis NOT used - Check
api_interactions_*.jsonlhas data - Run:
python cleanup.pythen retry
- Use
--max-prs 5to test speed - Larger PRs take longer (normal)
- Use cheaper models if cost is concern
- Use
--no-captureif speed critical
This completes the comprehensive features and reference documentation.