AI-powered pre-mortem review for engineering teams. Catch distributed-systems failure modes before you ship — grounded in your own org's incident history.
Second Opinion analyzes architecture and design documents against 24 curated distributed-systems failure patterns, then matches findings against your team's past incidents — so every design review is grounded in your org's real production history, not just generic best practices.
Before you ship, ask for a second opinion.
| Analyze a design doc | Org incident matches |
|---|---|
![]() |
![]() |
| Failure mode detail | Incident library |
|---|---|
![]() |
![]() |
The failures that page you at 3 a.m. are rarely the obvious ones. They're the thundering herd that happens when three caches expire simultaneously. The poison message that wedges a queue. The cascading timeout that turns a 500ms dependency into a 30-second outage.
Your post-mortems already document those failures — but most teams read them once and move on. Second Opinion turns your incident history into institutional memory that participates in every future design review.
What it does:
- Evaluates a design document against 24 distributed-systems failure archetypes
- Matches findings against your org's stored incidents, explaining exactly how the new design could reproduce a past failure
- Surfaces implicit assumptions and critical information gaps the design doesn't address
- Produces a structured, exportable report with evidence, trigger conditions, and discussion questions per finding
- Org Incident Memory — paste post-mortems once; every future review is grounded in your real failure history
- 24 Failure Patterns — covering load, data, timing, resource, dependency, and distributed failure classes
- Multi-provider LLM — NVIDIA NIM (free tier, default), OpenAI GPT-4o, Anthropic Claude, or local Ollama
- Vercel-ready — step-based API keeps every serverless function call under 10s
- PDF + Markdown upload — accepts
.pdf,.md,.txt,.rst,.adoc - Bulk incident import — upload multiple post-mortem files or paste several at once (
---separated) - Mobile-first UI — works on phones; useful during live design review meetings
- Export — copy as Markdown or download as JSON
┌─────────────────────────────────────────────────────────┐
│ 1. Build your Incident Library │
│ Paste past post-mortems → AI extracts structured │
│ failure data → stored in Postgres │
└──────────────────────────┬──────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────┐
│ 2. Paste or upload a design doc │
│ Add context: scale, SLOs, dependencies │
└──────────────────────────┬──────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────┐
│ 3. Analysis runs in parallel (3 rounds, ~20s total) │
│ │
│ Round 1 (parallel): │
│ • Pattern matching against 24 failure archetypes │
│ • Implicit assumption extraction │
│ • Known unknowns / information gaps │
│ │
│ Round 2 (parallel, uses Round 1 findings): │
│ • Ruled-out risk detection │
│ • Org incident library matching │
│ │
│ Round 3: Summary │
└──────────────────────────┬──────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────┐
│ 4. Structured report │
│ • Org incident matches (with relevance explanation) │
│ • Failure modes (confidence, evidence, triggers) │
│ • Implicit assumptions │
│ • Known unknowns │
│ • Ruled-out risks │
└─────────────────────────────────────────────────────────┘
24 curated distributed-systems failure archetypes:
| Category | Patterns |
|---|---|
| Load | Thundering Herd, Load Shedding Blind Spot, Retry Storm, Hotspot/Hot Shard, Fan-out Amplification |
| Dependency | Hidden Synchronous Dependency, Degraded but Not Dead, Single Point of Failure, Bulkhead Absence |
| Data | Silent Data Loss, Metadata Corruption, Poison Message, State Machine Explosion, Dual Write Inconsistency, Missing Idempotency |
| Timing | Cascading Timeout, Clock Skew Issues |
| Resource | Resource Exhaustion, Unbounded Growth, Noisy Neighbor |
| Distributed | Partial Outage Inconsistency, Version Skew, Coordination Overhead, Event Ordering Assumption |
git clone https://github.com/divarun/second_opinion.git
cd second_opinion
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
# Get a free key at https://build.nvidia.com
cat > .env << EOF
LLM_PROVIDER=nvidia
NVIDIA_API_KEY=nvapi-...
DATABASE_URL=postgresql://second_opinion:second_opinion@localhost:5432/second_opinion
EOF
docker compose up postgres -d # start Postgres
uvicorn app.app:app --reloadecho "LLM_PROVIDER=openai" >> .env
echo "OPENAI_API_KEY=sk-..." >> .env
uvicorn app.app:app --reloadecho "LLM_PROVIDER=anthropic" >> .env
echo "ANTHROPIC_API_KEY=sk-ant-..." >> .env
uvicorn app.app:app --reload# Start Postgres + Ollama together
docker compose --profile ollama up
# App runs separately
echo "LLM_PROVIDER=ollama" >> .env
echo "DATABASE_URL=postgresql://second_opinion:second_opinion@localhost:5432/second_opinion" >> .env
uvicorn app.app:app --reloadSee Vercel Deployment.
Second Opinion is built for Vercel Hobby (free tier). The API is split into per-step endpoints so each serverless function call makes exactly one LLM call and stays well under the 10-second timeout.
1. Fork and import to Vercel
Fork this repo, then import it in the Vercel dashboard.
2. Add Neon Postgres
In your Vercel project: Storage → Add → Neon. This sets POSTGRES_URL automatically.
3. Set LLM environment variables
In Settings → Environment Variables:
LLM_PROVIDER=nvidia
NVIDIA_API_KEY=nvapi-...
4. Deploy
npm i -g vercel
vercel --prodVercel picks up vercel.json automatically. That's it.
| Provider | Free Tier | Setup |
|---|---|---|
| NVIDIA NIM ✅ default | Yes (generous) | build.nvidia.com |
| OpenAI | No | platform.openai.com |
| Anthropic | No | console.anthropic.com |
| Ollama | Local only | ollama.com |
NVIDIA NIM default model: nvidia/llama-3.1-nemotron-70b-instruct — reasoning-optimized, reliable JSON output, free tier.
The step endpoints are what the browser uses. Each makes exactly one LLM call.
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/analyze/step/patterns |
Pattern matching (one LLM call) |
POST |
/api/analyze/step/assumptions |
Implicit assumptions (one LLM call) |
POST |
/api/analyze/step/unknowns |
Known unknowns / gaps (one LLM call) |
POST |
/api/analyze/step/ruledout |
Ruled-out risks (one LLM call, needs findings) |
POST |
/api/analyze/step/incidents |
Org incident matching (one LLM call, needs findings) |
POST |
/api/analyze/step/summary |
Summary (no LLM) |
POST |
/api/analyze |
Full analysis, single call (local dev only) |
POST |
/api/extract-pdf |
PDF → text (no LLM) |
GET |
/api/incidents |
List incident library |
POST |
/api/incidents |
Add incident from post-mortem text |
DELETE |
/api/incidents/{id} |
Remove incident |
GET |
/api/patterns |
List all 24 failure patterns |
GET |
/api/health |
LLM connectivity check |
| Variable | Default | Description |
|---|---|---|
LLM_PROVIDER |
nvidia |
nvidia | openai | anthropic | ollama |
NVIDIA_API_KEY |
— | Required when LLM_PROVIDER=nvidia |
NVIDIA_MODEL |
nvidia/llama-3.1-nemotron-70b-instruct |
NIM model ID |
OPENAI_API_KEY |
— | Required when LLM_PROVIDER=openai |
OPENAI_MODEL |
gpt-4o |
OpenAI model |
ANTHROPIC_API_KEY |
— | Required when LLM_PROVIDER=anthropic |
ANTHROPIC_MODEL |
claude-sonnet-4-6 |
Anthropic model |
OLLAMA_MODEL |
llama3 |
Ollama model |
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama server URL |
DATABASE_URL |
— | Postgres URL (local/Docker) |
POSTGRES_URL |
— | Postgres URL (set by Vercel/Neon) |
CONFIDENCE_THRESHOLD |
0.6 |
Minimum score to include a finding |
MAX_FAILURE_MODES |
10 |
Max findings returned |
MAX_DOCUMENT_SIZE |
50000 |
Max input characters |
second_opinion/
├── api/
│ └── index.py # Vercel entry point
├── app/
│ ├── app.py # FastAPI routes
│ ├── analyzer.py # Analysis pipeline + step methods
│ ├── patterns.py # 24 failure pattern definitions
│ ├── llm.py # Multi-provider LLM client (lazy singleton)
│ ├── models.py # Pydantic models
│ ├── config.py # Settings from environment variables
│ ├── database.py # asyncpg connection pool
│ ├── incident_store.py # Incident CRUD
│ ├── incident_extractor.py # LLM extraction from post-mortems
│ ├── templates/ # Jinja2 HTML
│ └── static/ # CSS + JS (no build step)
├── samples/
│ ├── design-doc/ # Example design documents to analyze
│ └── postmortem/ # Example post-mortems to load into incident library
├── docs/
│ └── screenshots/ # UI screenshots for README
├── Dockerfile
├── docker-compose.yml # Postgres (default) + Ollama (--profile ollama)
├── vercel.json
└── requirements.txt
Build an org-specific incident library from past post-mortems. Design reviews are grounded against your real failure history, not generic patterns.
Accept a code snippet or PR diff alongside the design doc. Detect divergences between what the doc claims and what the code actually implements.
Example: "The doc says circuit breakers are in place on all external calls. payment_client.py has no circuit breaker."
Connect to Prometheus or Datadog. When the design says "will handle 10K RPS", pull real metrics for the named services and flag the delta between assumed and actual.
Example: "Design assumes 4x headroom to peak. Current P99 at 500 req/min is 1.8s. Connection pool is at 90% utilization."
Contributions are welcome. See CONTRIBUTING.md.
Key areas: new failure patterns, test suite, .docx support, OCR for scanned PDFs, more LLM providers.
MIT — see LICENSE.
Second Opinion assists in design reviews but does not guarantee correctness or completeness. Always apply human judgment to the results.



