Skip to content

divarun/second_opinion

Repository files navigation

Second Opinion

AI-powered pre-mortem review for engineering teams. Catch distributed-systems failure modes before you ship — grounded in your own org's incident history.

Python 3.11+ FastAPI License: MIT Deploy to Vercel

Second Opinion analyzes architecture and design documents against 24 curated distributed-systems failure patterns, then matches findings against your team's past incidents — so every design review is grounded in your org's real production history, not just generic best practices.

Before you ship, ask for a second opinion.


Screenshots

Analyze a design doc Org incident matches
Analysis form Incident matches
Failure mode detail Incident library
Failure mode Incident library

Why Second Opinion?

The failures that page you at 3 a.m. are rarely the obvious ones. They're the thundering herd that happens when three caches expire simultaneously. The poison message that wedges a queue. The cascading timeout that turns a 500ms dependency into a 30-second outage.

Your post-mortems already document those failures — but most teams read them once and move on. Second Opinion turns your incident history into institutional memory that participates in every future design review.

What it does:

  • Evaluates a design document against 24 distributed-systems failure archetypes
  • Matches findings against your org's stored incidents, explaining exactly how the new design could reproduce a past failure
  • Surfaces implicit assumptions and critical information gaps the design doesn't address
  • Produces a structured, exportable report with evidence, trigger conditions, and discussion questions per finding

Features

  • Org Incident Memory — paste post-mortems once; every future review is grounded in your real failure history
  • 24 Failure Patterns — covering load, data, timing, resource, dependency, and distributed failure classes
  • Multi-provider LLM — NVIDIA NIM (free tier, default), OpenAI GPT-4o, Anthropic Claude, or local Ollama
  • Vercel-ready — step-based API keeps every serverless function call under 10s
  • PDF + Markdown upload — accepts .pdf, .md, .txt, .rst, .adoc
  • Bulk incident import — upload multiple post-mortem files or paste several at once (--- separated)
  • Mobile-first UI — works on phones; useful during live design review meetings
  • Export — copy as Markdown or download as JSON

How It Works

┌─────────────────────────────────────────────────────────┐
│  1. Build your Incident Library                          │
│     Paste past post-mortems → AI extracts structured     │
│     failure data → stored in Postgres                    │
└──────────────────────────┬──────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────┐
│  2. Paste or upload a design doc                         │
│     Add context: scale, SLOs, dependencies               │
└──────────────────────────┬──────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────┐
│  3. Analysis runs in parallel (3 rounds, ~20s total)     │
│                                                          │
│  Round 1 (parallel):                                     │
│    • Pattern matching against 24 failure archetypes      │
│    • Implicit assumption extraction                      │
│    • Known unknowns / information gaps                   │
│                                                          │
│  Round 2 (parallel, uses Round 1 findings):              │
│    • Ruled-out risk detection                            │
│    • Org incident library matching                       │
│                                                          │
│  Round 3: Summary                                        │
└──────────────────────────┬──────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────┐
│  4. Structured report                                    │
│     • Org incident matches (with relevance explanation)  │
│     • Failure modes (confidence, evidence, triggers)     │
│     • Implicit assumptions                               │
│     • Known unknowns                                     │
│     • Ruled-out risks                                    │
└─────────────────────────────────────────────────────────┘

Failure Patterns

24 curated distributed-systems failure archetypes:

Category Patterns
Load Thundering Herd, Load Shedding Blind Spot, Retry Storm, Hotspot/Hot Shard, Fan-out Amplification
Dependency Hidden Synchronous Dependency, Degraded but Not Dead, Single Point of Failure, Bulkhead Absence
Data Silent Data Loss, Metadata Corruption, Poison Message, State Machine Explosion, Dual Write Inconsistency, Missing Idempotency
Timing Cascading Timeout, Clock Skew Issues
Resource Resource Exhaustion, Unbounded Growth, Noisy Neighbor
Distributed Partial Outage Inconsistency, Version Skew, Coordination Overhead, Event Ordering Assumption

Quick Start

Option 1 — NVIDIA NIM (recommended, free tier, no GPU)

git clone https://github.com/divarun/second_opinion.git
cd second_opinion

python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate
pip install -r requirements.txt

# Get a free key at https://build.nvidia.com
cat > .env << EOF
LLM_PROVIDER=nvidia
NVIDIA_API_KEY=nvapi-...
DATABASE_URL=postgresql://second_opinion:second_opinion@localhost:5432/second_opinion
EOF

docker compose up postgres -d   # start Postgres
uvicorn app.app:app --reload

Open http://localhost:8000

Option 2 — OpenAI

echo "LLM_PROVIDER=openai" >> .env
echo "OPENAI_API_KEY=sk-..." >> .env
uvicorn app.app:app --reload

Option 3 — Anthropic Claude

echo "LLM_PROVIDER=anthropic" >> .env
echo "ANTHROPIC_API_KEY=sk-ant-..." >> .env
uvicorn app.app:app --reload

Option 4 — Local models (Ollama, no API key)

# Start Postgres + Ollama together
docker compose --profile ollama up

# App runs separately
echo "LLM_PROVIDER=ollama" >> .env
echo "DATABASE_URL=postgresql://second_opinion:second_opinion@localhost:5432/second_opinion" >> .env
uvicorn app.app:app --reload

Option 5 — Deploy to Vercel

See Vercel Deployment.


Vercel Deployment

Second Opinion is built for Vercel Hobby (free tier). The API is split into per-step endpoints so each serverless function call makes exactly one LLM call and stays well under the 10-second timeout.

1. Fork and import to Vercel

Fork this repo, then import it in the Vercel dashboard.

2. Add Neon Postgres

In your Vercel project: Storage → Add → Neon. This sets POSTGRES_URL automatically.

3. Set LLM environment variables

In Settings → Environment Variables:

LLM_PROVIDER=nvidia
NVIDIA_API_KEY=nvapi-...

4. Deploy

npm i -g vercel
vercel --prod

Vercel picks up vercel.json automatically. That's it.


LLM Providers

Provider Free Tier Setup
NVIDIA NIM ✅ default Yes (generous) build.nvidia.com
OpenAI No platform.openai.com
Anthropic No console.anthropic.com
Ollama Local only ollama.com

NVIDIA NIM default model: nvidia/llama-3.1-nemotron-70b-instruct — reasoning-optimized, reliable JSON output, free tier.


API Reference

The step endpoints are what the browser uses. Each makes exactly one LLM call.

Method Endpoint Description
POST /api/analyze/step/patterns Pattern matching (one LLM call)
POST /api/analyze/step/assumptions Implicit assumptions (one LLM call)
POST /api/analyze/step/unknowns Known unknowns / gaps (one LLM call)
POST /api/analyze/step/ruledout Ruled-out risks (one LLM call, needs findings)
POST /api/analyze/step/incidents Org incident matching (one LLM call, needs findings)
POST /api/analyze/step/summary Summary (no LLM)
POST /api/analyze Full analysis, single call (local dev only)
POST /api/extract-pdf PDF → text (no LLM)
GET /api/incidents List incident library
POST /api/incidents Add incident from post-mortem text
DELETE /api/incidents/{id} Remove incident
GET /api/patterns List all 24 failure patterns
GET /api/health LLM connectivity check

Configuration

Variable Default Description
LLM_PROVIDER nvidia nvidia | openai | anthropic | ollama
NVIDIA_API_KEY Required when LLM_PROVIDER=nvidia
NVIDIA_MODEL nvidia/llama-3.1-nemotron-70b-instruct NIM model ID
OPENAI_API_KEY Required when LLM_PROVIDER=openai
OPENAI_MODEL gpt-4o OpenAI model
ANTHROPIC_API_KEY Required when LLM_PROVIDER=anthropic
ANTHROPIC_MODEL claude-sonnet-4-6 Anthropic model
OLLAMA_MODEL llama3 Ollama model
OLLAMA_BASE_URL http://localhost:11434 Ollama server URL
DATABASE_URL Postgres URL (local/Docker)
POSTGRES_URL Postgres URL (set by Vercel/Neon)
CONFIDENCE_THRESHOLD 0.6 Minimum score to include a finding
MAX_FAILURE_MODES 10 Max findings returned
MAX_DOCUMENT_SIZE 50000 Max input characters

Project Structure

second_opinion/
├── api/
│   └── index.py              # Vercel entry point
├── app/
│   ├── app.py                # FastAPI routes
│   ├── analyzer.py           # Analysis pipeline + step methods
│   ├── patterns.py           # 24 failure pattern definitions
│   ├── llm.py                # Multi-provider LLM client (lazy singleton)
│   ├── models.py             # Pydantic models
│   ├── config.py             # Settings from environment variables
│   ├── database.py           # asyncpg connection pool
│   ├── incident_store.py     # Incident CRUD
│   ├── incident_extractor.py # LLM extraction from post-mortems
│   ├── templates/            # Jinja2 HTML
│   └── static/               # CSS + JS (no build step)
├── samples/
│   ├── design-doc/           # Example design documents to analyze
│   └── postmortem/           # Example post-mortems to load into incident library
├── docs/
│   └── screenshots/          # UI screenshots for README
├── Dockerfile
├── docker-compose.yml        # Postgres (default) + Ollama (--profile ollama)
├── vercel.json
└── requirements.txt

Roadmap

✅ Phase 1 — Incident Memory (complete)

Build an org-specific incident library from past post-mortems. Design reviews are grounded against your real failure history, not generic patterns.

🔜 Phase 2 — Code-Doc Drift Detection

Accept a code snippet or PR diff alongside the design doc. Detect divergences between what the doc claims and what the code actually implements.

Example: "The doc says circuit breakers are in place on all external calls. payment_client.py has no circuit breaker."

🔜 Phase 3 — Production Grounding

Connect to Prometheus or Datadog. When the design says "will handle 10K RPS", pull real metrics for the named services and flag the delta between assumed and actual.

Example: "Design assumes 4x headroom to peak. Current P99 at 500 req/min is 1.8s. Connection pool is at 90% utilization."


Contributing

Contributions are welcome. See CONTRIBUTING.md.

Key areas: new failure patterns, test suite, .docx support, OCR for scanned PDFs, more LLM providers.


License

MIT — see LICENSE.


Second Opinion assists in design reviews but does not guarantee correctness or completeness. Always apply human judgment to the results.

About

Second Opinion is a pre-mortem review tool that helps engineering teams identify potential failure modes.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors