Skip to content

williamgieng/transcript-analysis-pipeline

Repository files navigation

Transcript Analysis Pipeline

A three-stage LLM pipeline for turning meeting transcripts into structured, decision-grade analysis — with explicit grounding, observed-vs-inferred labeling, and an audit pass that catches fabrication before it reaches the reader.

Most LLM-powered analysis tools optimize for output that looks confident. This one optimizes for output you can act on — which means letting the system say "not enough signal here" and giving the audit stage no power to invent.

Why this exists

LLMs are good at producing prose that reads like analysis. They are not, by default, good at producing analysis that holds up under scrutiny. The failure modes are well-known: fabricated quantities, manufactured commitments, plausible-sounding but ungrounded business framing, recommendations padded with "align with stakeholders" filler.

This repo is one bet about how to do better: explicit grounding plus a bounded audit beats trying to make an LLM be careful in prose.

Architecture

transcript + (optional) business context
        │
        ▼
┌───────────────┐
│  1. EXTRACT   │  high-fidelity structured extraction
│   (OpenAI)    │  meeting_type, decision_state, evidence_map, ...
└───────┬───────┘
        ▼
┌───────────────┐
│ 2. SYNTHESIZE │  structured claim objects in 8 sections
│  (Anthropic)  │  every claim labeled observed | inferred | recommendation
└───────┬───────┘
        ▼
┌───────────────┐
│   3. AUDIT    │  five mutation ops only:
│  (Anthropic)  │  delete, downgrade, move, replace, collapse
└───────┬───────┘
        ▼
┌───────────────┐
│   4. RENDER   │  deterministic Python; claim objects → prose
│   (no LLM)    │
└───────┬───────┘
        ▼
   Markdown report

Three LLM stages plus a deterministic Python renderer. Each stage has a narrow job and constrained authority.

Design principles

These are the calls that make the pipeline different from a generic "summarize this meeting" tool:

  • Sparse is good. Empty is allowed. Sections may legitimately be short or empty. The model has explicit permission to abstain. "No recommendation yet, here is what we need first" is a valid output.
  • Observed vs inferred is always labeled. Every claim is a structured object that declares what it is and where it came from. Inferred claims show as [INFERRED] in the rendered output.
  • No quantity unless grounded. Numbers, percentages, and timelines must trace to the transcript or the business context packet. Invented precision (e.g. "audit the past 60-90 days") is what audit strips.
  • Business framing is source-bounded. Only the transcript and the provided business context packet may be used for the Business Read. No generic SaaS framing, no default exec vocabulary.
  • Audit cannot create. It can only delete, downgrade, move, replace_with_insufficient_evidence, or collapse_section. Its failure modes are weakening or removal — never fabrication.
  • Commitments are sacred. A commitment requires someone in the meeting to have committed. Analyst suggestions live in a separate section.

Output structure

The rendered output has 8 sections:

  1. Context — topic, meeting_type, decision_state, participants
  2. What Happened — observed facts only
  3. DS Read — five methodological checks (claim validity, metric validity, design quality, data quality, decision sufficiency)
  4. Business Read — observed implications, inferred implications, missing context
  5. Strategic Options — gated; only produced when alternatives were actually discussed; explicitly handles "options selected in the room without formal evaluation"
  6. Recommended Path — allowed to abstain ("no recommendation yet"); supports partial_direction for meetings that confirmed direction but deferred execution
  7. Commitments and Next Steps — committed-in-meeting kept strictly separate from analyst recommendations
  8. Open Questions — questions raised in the room, plus questions the analyst thinks should have been raised

Try it

The fastest way to understand the system is to open the notebook:

📓 walkthrough.ipynb — walks one transcript through every stage, then runs the eval harness across all fixtures.

Or run the CLI:

# 1. Install
pip install -r requirements.txt

# 2. Configure keys
cp .env.example .env
# edit .env with your OpenAI and Anthropic API keys

# 3. Run on a fixture
python cli.py transcripts/01_decision_meeting.txt

# Output: 01_decision_meeting_analysis.md

CLI options:

python cli.py transcripts/02_working_session.txt --output report.md
python cli.py transcripts/03_thin_transcript.txt --json full_result.json
python cli.py transcripts/01_decision_meeting.txt --skip-audit  # debug
python cli.py transcripts/01_decision_meeting.txt --context my_context.txt

Trust metrics: the eval harness

A demo isn't an evaluation. The harness runs all fixtures and reports metrics that matter for trust in an LLM analysis system:

Metric What it measures What "good" looks like
abstention_rate Fraction of section slots that came back empty Should match the transcript — high on thin transcripts, low on rich ones
audit_operations_count How many fabrications the audit caught Lower is better, but non-zero is honest
fabricated_commitments Commitments the audit had to move out of committed_in_meeting Should be zero. This is the most dangerous failure mode
ungrounded_quantities Audit operations that flagged invented numbers/dates Should be zero
python eval_harness.py
python eval_harness.py --fixture decision        # filter by name
python eval_harness.py --skip-audit              # see raw synthesis quality
python eval_harness.py --out eval_results.json   # full results to JSON

Fixtures

Three transcripts ship with the repo, each designed to stress a different pipeline behavior:

Fixture Tests
01_decision_meeting.txt A clear decision among real alternatives. Should populate Strategic Options.selected and Recommended Path.
02_working_session.txt Productive problem-solving with no decision. Should abstain on Recommended Path while still producing a useful DS Read.
03_thin_transcript.txt Sparse, low-signal content. Most sections should come back empty. The system showing restraint is the right answer.

Add new fixtures by dropping .txt files into transcripts/ and listing them in eval_harness.py:FIXTURES.

Project structure

transcript-analysis-pipeline/
├── prompts.py              # the three core prompts (extract, synthesize, audit)
├── agents.py               # LLM call wrappers with fallback logic
├── engine.py               # 3-stage orchestrator
├── renderer.py             # deterministic claim objects → prose-ready dict
├── exporter.py             # rendered dict → Markdown
├── cli.py                  # CLI entry point
├── eval_harness.py         # fixture-based evaluation
├── walkthrough.ipynb       # end-to-end demo notebook
├── transcripts/            # eval fixtures
├── examples/               # frozen example outputs
├── requirements.txt
├── .env.example
└── LICENSE

What this is not

  • Not a transcription tool. Bring your own transcript (Otter, Granola, Plaud, manual notes — anything that produces text).
  • Not a meeting management tool. No scheduling, no follow-up tracking, no integration with calendars.
  • Not Notion-integrated. The output is Markdown. You can paste it into any tool that takes Markdown.
  • Not optimized for cost. It runs three LLM calls per analysis. A typical run is a few cents, but this is the wrong tool if you need to process thousands of transcripts a day.

License

MIT — see LICENSE.

About

Three-stage LLM pipeline for grounded meeting analysis with bounded audit

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

No contributors