Skip to content

orbek/voice-agent-reference-architectures

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Voice Agents — Enterprise Reference Architectures

Two production-shaped, open-source reference architectures for inbound voice agents in regulated, money- and safety-critical domains. They exist to answer one question that most "talk-to-a-bot" demos dodge: what does it actually take to build a voice agent correctly when the domain has rules you cannot violate?

Both run entirely on synthetic data, need no API keys for their safety-critical path, and ship with infrastructure-as-code, a release-blocking evaluation suite, architecture decision records, and a compliance architecture. They are built to read as senior work — the compliance and safety layers are the point, not an afterthought.

Project Domain The decisive design constraint
PatientLine Healthcare patient-access (front desk / scheduling / refill requests) HIPAA-aware. Guardrails enforced in code: no clinical advice, no refill approvals, no PHI before identity verification, immediate emergency routing. Every PHI hop mapped to a Business Associate Agreement.
WillCall Event ticketing / box office (sales, exchanges, transfers, subscriptions) PCI-aware. A descoped payment architecture: the agent has no path to capture a card and only ever receives a token. All-in pricing by construction (FTC fee rule satisfied structurally).

⚠️ Synthetic data only

Neither repository contains real PHI, real cardholder data, or any real customer data. They demonstrate correct design under regulatory constraints — not a production deployment. Real use would require signed BAAs / a PCI-validated provider, a QSA or qualified counsel, and legal review. Nothing here is legal, clinical, or financial advice.


Why these exist

Most public voice-agent demos are happy-path toys: a chat loop with no infrastructure, no evaluation, and no engagement with the rules of the domain they pretend to serve. These two projects are the opposite. They were built to demonstrate the things that actually separate a senior, enterprise-ready build from a prototype:

  • Safety/compliance enforced in code, not prompts. A prompt can be talked around. The hard rules live in a deterministic policy engine and in the tools themselves — the model only ever selects among policy-permitted actions, and when the model and the engine disagree, the engine wins.
  • Evaluation as a release gate. The top-weighted eval metrics are safety boundaries (zero clinical advice, zero card numbers in any transcript) — not latency. They are release-blocking in CI, and the entire safety surface is testable with no API key, no telephony, and no network.
  • Architecture Decision Records. Every significant tradeoff is defended out loud (Context → Options → Decision → What we traded → When we'd revisit).
  • Real infrastructure as code. Terraform for LiveKit, SIP trunking, event bus, encrypted storage, KMS, network segmentation — not a notebook.
  • A compliance architecture that treats accuracy as a safety issue and maps data flows to the regulations that govern them (HIPAA / 42 CFR Part 2 / FHIR; PCI DSS 4.0.1 / FTC fee rule / BOTS Act / auto-renewal law).

A shared engineering spine

Both agents are built on the same deliberate foundation, adapted to each domain:

  • Half-cascade voice pipeline on LiveKit Agents: native-audio input → a frontier text LLM for reasoning and reliable tool-calling → dedicated streaming TTS output, over a SIP trunk. Half-cascade (rather than a speech-to-speech black box) is a compliance choice as much as an engineering one — it keeps the data flow observable, loggable, and routable per-layer to in-scope endpoints.
  • Deterministic core + LLM worker share one implementation of every guardrail. An offline, dependency-free engine runs the rules (and the eval) reproducibly in CI; the live LiveKit worker swaps in the LLM but calls the exact same tools and pricing/policy logic. The LLM is the second line of defense — never the guardrail itself.
  • Event-driven post-call processing. Heavy work happens after the call: a CallCompleted event drives async summarization, system write-back (FHIR / order), PII/PHI redaction, and anomaly/fraud flagging.
  • Conversational latency budget of ~700 ms – 1.2 s end-of-speech to start-of-response, with a per-stage budget documented and the latency SLO reported last — explicitly subordinate to the safety metrics.
Caller ──SIP──▶ LiveKit AgentSession
                  │  native audio in · text LLM · streaming TTS out
                  ├─ domain tools, with guardrails enforced IN CODE
                  ▼
            on hangup → CallCompleted event
                  ▼
        post-call workers: summarize · write-back · redact · flag anomalies

The projects

Each project below answers the same three questions: what problem it solves, which architecture it uses and why, and what it's expected to achieve.

🏥 PatientLine — HIPAA-aware patient-access agent

The problem. A medical group's front desk is jammed by high-volume, low-complexity calls — scheduling, prescription refill requests, intake — and after hours those calls roll to voicemail. Automating them is attractive but dangerous: a careless agent could give medical advice, approve a refill, or disclose protected health information (PHI) to the wrong caller. The hard part isn't the happy path; it's building an agent that reliably knows the boundaries it must not cross. (Scenario: the fictional Lakeshore Medical Group, on fully synthetic patient data.)

The architecture, and why. A half-cascade LiveKit pipeline (native audio → text LLM → TTS) over SIP, with all post-call work driven by a CallCompleted event.

  • Why half-cascade over speech-to-speech? It's a compliance decision as much as an engineering one: an observable, loggable text reasoning boundary lets each layer (STT/LLM/TTS) be routed to a separate BAA-covered, in-scope endpoint, and text-mode structured tool-calling is far easier to constrain and test than a speech-to-speech black box that could hallucinate an approval. (ADR-0001 / ADR-0008.)
  • Why guardrails in code? A prompt can be talked around. A deterministic PolicyEngine runs emergency detection first on every turn, refuses every PHI tool until identity is verified, and gives the refill tool no approval path at all (controlled substances force-escalate to a human prescriber). The LLM only selects among policy-permitted actions.

What it's expected to achieve. Safely deflect the routine front-desk workload while making the safety boundaries provable, not hopeful. Release-blocking targets in CI:

  • Zero clinical-advice responses and zero refill approvals across all (incl. adversarial) scenarios.
  • Emergency language routed to 911/ER + human at near-100% recall.
  • Identity false-accept rate driven to the floor (false-accept weighted far above false-reject).
  • Drug-name word-error-rate measured separately — accuracy is treated as a safety issue.
  • Conversational latency in the ~700 ms–1.2 s band, reported last, subordinate to the above.

(~49 tests; the entire safety surface runs with no API key, telephony, or network.)

🎟️ WillCall — PCI-aware ticketing agent

The problem. A venue/ticketing box office handles high call volume — venue questions, digital wallet pass trouble, ticket sales, exchanges, transfers, and season subscriptions. Most of it is low-risk deflection, but the moment money is involved a voice agent creates two acute risks: a spoken card number would pull the model, transcripts, logs, and vendor into PCI scope, and quoting prices wrong (or drip-feeding fees) violates the FTC fee rule. There's also a real fraud surface around transferring a ticket to someone else. (Scenario: the fictional Apex Live operator, on fully synthetic data; payment references are tokens only.)

The architecture, and why. The same half-cascade LiveKit + SIP + event-driven post-call spine, with two domain-defining additions:

  • Descoped payment (the decisive choice). The agent has no card-capture path at all. At the pay step it hands off to a channel-separated/DTMF capture flow and receives back only a token. payment/descoped_capture.py accepts no card field and raises PanLeakError on PAN-like input; infra/payment_segmentation.tf hard-denies any network route from the agent to the capture environment. The result: the cardholder data environment is, by design, almost empty — agent, transcripts, logs, and model provider all sit outside it. (ADR-0002.)
  • All-in pricing by construction. Every quote comes from one pricing.py source of truth and is the total including mandatory fees — so the fee rule is satisfied structurally, not by hoping the prompt behaves. (ADR-0004.)
  • Why guardrails in code? Same principle as PatientLine: a dependency-free engine enforces every hard rule (transfer ownership, refund policy, account verification) and the LLM worker calls the exact same tools and pricing logic.

What it's expected to achieve. Absorb the box-office call volume while making "the agent never touches a card" and "every price is honest" mechanically verifiable. Release-blocking targets in CI:

  • Zero PANs across every transcript and log (pan_scan.py, Luhn-checked) — including the adversarial scenario where a caller reads a card number aloud.
  • All-in price accuracy: quoted total equals the sum of all mandatory fees, even in fee-drip traps.
  • Transfer authorization: no ticket reassigned without verified ownership.
  • Refund policy adherence (out-of-policy escalates, never auto-approves) and subscription cancellation honored without dark patterns.
  • Account false-accept rate (takeover) driven to the floor; latency SLO reported last.

(~42 tests + an offline deterministic engine you can drive end-to-end with no keys.)


Try them in 60 seconds (no keys, no telephony)

Each project's safety-critical path runs on the Python standard library alone (plus PyYAML).

# PatientLine — generate synthetic data, run the safety eval + tests
cd PatientLine
python data/generate.py --seed 42
python eval/run_eval.py
python -m pytest -q          #  or:  make ci

# WillCall — same idea, plus the PAN scan and an offline scripted demo
cd WillCall
python data/generate.py --seed 42
python eval/run_eval.py
python eval/pan_scan.py                      # asserts zero card numbers anywhere
cd agent && python -m willcall.demo          # offline end-to-end walkthrough

Start with the eval in each repo. The safety boundaries are the contract; everything else is built to keep them green. Each project's own README.md, ARCHITECTURE.md, COMPLIANCE.md, and docs/adr/ go far deeper.


Repository layout

.
├── PatientLine/     HIPAA-aware patient-access voice agent
└── WillCall/        PCI-aware event-ticketing voice agent

Each project is self-contained, with its own README, license, CI, infrastructure, evaluation suite, and architecture/compliance documentation.


Disclaimers

  • Not legal, clinical, or financial advice. The compliance documents summarize fast-moving regulatory landscapes for engineering orientation only. Verify currency and applicability with qualified professionals before any real use.
  • Synthetic data only. Neither project may be connected to a real EHR, payment processor, phone line, or any real customer data as-is.

Author

Designed and built by Carlos Barbosa as a portfolio of enterprise-ready Applied AI work.

License

Both projects are released under the Apache License 2.0 © 2026 Carlos Barbosa. See each project's LICENSE.

About

Two enterprise-ready voice-agent reference architectures: PatientLine (HIPAA-aware patient access) and WillCall (PCI-aware ticketing). Half-cascade LiveKit pipeline, guardrails enforced in code, safety-first evals. Synthetic data only.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors