A three-stage LLM pipeline for turning meeting transcripts into structured, decision-grade analysis — with explicit grounding, observed-vs-inferred labeling, and an audit pass that catches fabrication before it reaches the reader.
Most LLM-powered analysis tools optimize for output that looks confident. This one optimizes for output you can act on — which means letting the system say "not enough signal here" and giving the audit stage no power to invent.
LLMs are good at producing prose that reads like analysis. They are not, by default, good at producing analysis that holds up under scrutiny. The failure modes are well-known: fabricated quantities, manufactured commitments, plausible-sounding but ungrounded business framing, recommendations padded with "align with stakeholders" filler.
This repo is one bet about how to do better: explicit grounding plus a bounded audit beats trying to make an LLM be careful in prose.
transcript + (optional) business context
│
▼
┌───────────────┐
│ 1. EXTRACT │ high-fidelity structured extraction
│ (OpenAI) │ meeting_type, decision_state, evidence_map, ...
└───────┬───────┘
▼
┌───────────────┐
│ 2. SYNTHESIZE │ structured claim objects in 8 sections
│ (Anthropic) │ every claim labeled observed | inferred | recommendation
└───────┬───────┘
▼
┌───────────────┐
│ 3. AUDIT │ five mutation ops only:
│ (Anthropic) │ delete, downgrade, move, replace, collapse
└───────┬───────┘
▼
┌───────────────┐
│ 4. RENDER │ deterministic Python; claim objects → prose
│ (no LLM) │
└───────┬───────┘
▼
Markdown report
Three LLM stages plus a deterministic Python renderer. Each stage has a narrow job and constrained authority.
These are the calls that make the pipeline different from a generic "summarize this meeting" tool:
- Sparse is good. Empty is allowed. Sections may legitimately be short or empty. The model has explicit permission to abstain. "No recommendation yet, here is what we need first" is a valid output.
- Observed vs inferred is always labeled. Every claim is a structured object that declares what it is and where it came from. Inferred claims show as
[INFERRED]in the rendered output. - No quantity unless grounded. Numbers, percentages, and timelines must trace to the transcript or the business context packet. Invented precision (e.g. "audit the past 60-90 days") is what audit strips.
- Business framing is source-bounded. Only the transcript and the provided business context packet may be used for the Business Read. No generic SaaS framing, no default exec vocabulary.
- Audit cannot create. It can only
delete,downgrade,move,replace_with_insufficient_evidence, orcollapse_section. Its failure modes are weakening or removal — never fabrication. - Commitments are sacred. A commitment requires someone in the meeting to have committed. Analyst suggestions live in a separate section.
The rendered output has 8 sections:
- Context — topic, meeting_type, decision_state, participants
- What Happened — observed facts only
- DS Read — five methodological checks (claim validity, metric validity, design quality, data quality, decision sufficiency)
- Business Read — observed implications, inferred implications, missing context
- Strategic Options — gated; only produced when alternatives were actually discussed; explicitly handles "options selected in the room without formal evaluation"
- Recommended Path — allowed to abstain ("no recommendation yet"); supports
partial_directionfor meetings that confirmed direction but deferred execution - Commitments and Next Steps — committed-in-meeting kept strictly separate from analyst recommendations
- Open Questions — questions raised in the room, plus questions the analyst thinks should have been raised
The fastest way to understand the system is to open the notebook:
📓 walkthrough.ipynb — walks one transcript through every stage, then runs the eval harness across all fixtures.
Or run the CLI:
# 1. Install
pip install -r requirements.txt
# 2. Configure keys
cp .env.example .env
# edit .env with your OpenAI and Anthropic API keys
# 3. Run on a fixture
python cli.py transcripts/01_decision_meeting.txt
# Output: 01_decision_meeting_analysis.mdCLI options:
python cli.py transcripts/02_working_session.txt --output report.md
python cli.py transcripts/03_thin_transcript.txt --json full_result.json
python cli.py transcripts/01_decision_meeting.txt --skip-audit # debug
python cli.py transcripts/01_decision_meeting.txt --context my_context.txtA demo isn't an evaluation. The harness runs all fixtures and reports metrics that matter for trust in an LLM analysis system:
| Metric | What it measures | What "good" looks like |
|---|---|---|
abstention_rate |
Fraction of section slots that came back empty | Should match the transcript — high on thin transcripts, low on rich ones |
audit_operations_count |
How many fabrications the audit caught | Lower is better, but non-zero is honest |
fabricated_commitments |
Commitments the audit had to move out of committed_in_meeting |
Should be zero. This is the most dangerous failure mode |
ungrounded_quantities |
Audit operations that flagged invented numbers/dates | Should be zero |
python eval_harness.py
python eval_harness.py --fixture decision # filter by name
python eval_harness.py --skip-audit # see raw synthesis quality
python eval_harness.py --out eval_results.json # full results to JSONThree transcripts ship with the repo, each designed to stress a different pipeline behavior:
| Fixture | Tests |
|---|---|
01_decision_meeting.txt |
A clear decision among real alternatives. Should populate Strategic Options.selected and Recommended Path. |
02_working_session.txt |
Productive problem-solving with no decision. Should abstain on Recommended Path while still producing a useful DS Read. |
03_thin_transcript.txt |
Sparse, low-signal content. Most sections should come back empty. The system showing restraint is the right answer. |
Add new fixtures by dropping .txt files into transcripts/ and listing them in eval_harness.py:FIXTURES.
transcript-analysis-pipeline/
├── prompts.py # the three core prompts (extract, synthesize, audit)
├── agents.py # LLM call wrappers with fallback logic
├── engine.py # 3-stage orchestrator
├── renderer.py # deterministic claim objects → prose-ready dict
├── exporter.py # rendered dict → Markdown
├── cli.py # CLI entry point
├── eval_harness.py # fixture-based evaluation
├── walkthrough.ipynb # end-to-end demo notebook
├── transcripts/ # eval fixtures
├── examples/ # frozen example outputs
├── requirements.txt
├── .env.example
└── LICENSE
- Not a transcription tool. Bring your own transcript (Otter, Granola, Plaud, manual notes — anything that produces text).
- Not a meeting management tool. No scheduling, no follow-up tracking, no integration with calendars.
- Not Notion-integrated. The output is Markdown. You can paste it into any tool that takes Markdown.
- Not optimized for cost. It runs three LLM calls per analysis. A typical run is a few cents, but this is the wrong tool if you need to process thousands of transcripts a day.
MIT — see LICENSE.