Memory Audit Layer

Memory is becoming a critical subsystem of AI agents—but it's largely unobservable.

Agents can silently forget relevant information, retrieve incorrect memories, or generate responses that contradict prior interactions. Traditional monitoring tools won't catch these failures.

Memory Audit Layer provides observability for agent memory, automatically detecting omissions, fabrications, and inconsistencies across conversations and calls.

If logging tells you what happened, Memory Audit Layer tells you whether your agent remembered correctly.

Demo

Overall Observability

Contradictions detected by agent

Hallucination

What It Does

Three things silently go wrong in AI agent memory — and nobody notices:

Silent Failure	What Happens
Forgetting	Agent fails to recall what the user stated
Hallucination	Agent states facts not in memory or transcript
Contradiction	Agent says something different from a previous call

After every call, this system produces four scores:

Score	Meaning	Healthy
P1 Retention	Did agent recall what was stated?	Higher = better
P2 Hallucination	Did agent fabricate anything?	100% = nothing made up
P3 Contradiction	Is memory consistent across calls?	100% = no conflicts
Memory Health	Composite of all three	100% = all good

When a contradiction is found, it tells you why:

HALLUCINATION — agent invented a value not stated by the user. Fix the prompt or lower temperature.
POLICY_CHANGE — user explicitly stated a new value. Real change. Update downstream.
MEMORY_STALE — memory not refreshed after a known change.

Works on any domain. The LLM extracts whatever facts exist — clinical, support, education, finance, general conversation.

Quick Start

Requires Python 3.11+, Ollama, and ollama pull llama3 (~4.7 GB, one-time).

# Clone and install
git clone <repo-url> memory-audit && cd memory-audit
python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
cp .env.example .env

# Start (two terminals)
ollama serve                   # Terminal 1 — keep open
memory-audit serve             # Terminal 2 — keep open

# Open dashboard
# http://localhost:8000/dashboard

Example Test Scenarios

Scenario 1 — Normal Call (All Scores Healthy)

What it tests: Agent correctly recalls everything the user stated. Expect all scores near 100%.

Via dashboard — paste as transcript, User ID: user_demo:

Customer: I am on the Pro plan at $49 per month. My refund window is 30 days.
Agent: Confirmed — Pro plan, $49/month, 30-day refund window.

Via voice:

python voice_pipeline.py --call-id demo_1 --user user_demo --duration 20

Say: "I am on the Pro plan at forty-nine dollars per month. My refund window is thirty days. Agent confirmed Pro plan, forty-nine dollars, thirty day refund."

Expected:

P1 Retention:     ~90%   ✓ agent recalled what was stated
P2 Hallucination: 100%   ✓ nothing fabricated
P3 Contradiction: 100%   ✓ no conflicts in memory
Health:           ~97%
Status:           OK

Scenario 2 — Contradiction Detected

What it tests: A second call states a different value for the same attribute. P3 drops. Contradiction classified and stored.

Step 1 — run Scenario 1 first to seed memory for user_demo.

Step 2 — same User ID, new Call ID. Via dashboard:

Agent: Your refund window is now 14 days from date of purchase.

Via voice (same user, new call):

python voice_pipeline.py --call-id demo_2 --user user_demo --duration 15

Say: "Your refund window is now fourteen days from date of purchase."

Expected:

P3 Contradiction: 0%     ✗ conflict detected
Contradiction kind: POLICY_CHANGE
Attribute:   refund_days
Old value:   30  ← stored from Call 1
New value:   14  ← stated in Call 2
Memory:      old record SUPERSEDED, new value ACTIVE
Status:      CONTRADICTION

Check the Contradictions tab and Memory State tab in the dashboard.

Scenario 3 — Hallucination Detected

What it tests: Agent states a value not grounded in memory or transcript. P3 drops. Kind = HALLUCINATION (not POLICY_CHANGE).

Step 1 — seed memory, User ID: user_hall:

python voice_pipeline.py --call-id hall_1 --user user_hall --duration 15

Say: "My refund window is thirty days."

Step 2 — agent fabricates a different value:

python voice_pipeline.py --call-id hall_2 --user user_hall --duration 15

Say: "Your refund window is fourteen days from date of purchase."

The word "Your" signals agent speech → tagged agent_inference. The user never said 14 days. That combination = hallucination.

Expected:

P3 Contradiction: 0%     ✗ conflict detected
Contradiction kind: HALLUCINATION
Reason: Agent stated '14' via inference — user never said this
Action: Review agent system prompt, lower temperature
Status: HALLUCINATION  (purple badge, not red)

Why HALLUCINATION not POLICY_CHANGE? The system checks: was the new value stated by the user? No — only the agent said it. Agent cannot use its own statement as evidence it was right. That makes it a hallucination.

How to Submit Transcripts

Browser — paste any conversation, upload .txt, or upload audio:

http://localhost:8000/dashboard

CLI — single file:

memory-audit process --call-id c001 --user alice --transcript call.txt

CLI — batch directory:

memory-audit batch --dir ./calls/ --user alice

Voice — mic:

pip install -e ".[voice]"
python voice_pipeline.py --call-id c001 --user alice --duration 30

API:

curl -X POST http://localhost:8000/calls \
  -H "Content-Type: application/json" \
  -d '{"call_id":"c001","user_id":"alice","transcript":"..."}'

Inspect Memory · Reset · Alerts

# See what is stored for a user
memory-audit memory --user alice

# Reset one user
memory-audit reset --user alice --confirm

# Reset everything
memory-audit reset --all --confirm

Configure Slack alerts for critical attributes in .env:

SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
ALERT_CRITICAL_KEYS=allergy,dosage        # clinical
# ALERT_CRITICAL_KEYS=refund_days,price   # support

Run Tests

# Automated (no Ollama needed)
python -m pytest tests/ -v

# 6 scenario text tests
python text_tests.py --reset

# Benchmark — 10 ground-truth contradiction scenarios
python -m benchmark.runner --scenarios benchmark/scenarios.jsonl --verbose

The Pipeline — Step by Step

Once text arrives (from dashboard, CLI, file, or voice), here is exactly what happens in order:

TEXT IN
  │
  ▼
┌─────────────────────────────────────────────────────┐
│ STEP 1: FACT EXTRACTION                             │
│                                                     │
│ Three layers run in sequence:                       │
│  • Regex — price, plan, email, dates (0ms, no LLM) │
│  • LLM per sentence — one small call per sentence   │
│  • Key=value fallback — simplest format, any model  │
│                                                     │
│ Speaker detection: "Your..." → agent_inference      │
│                    "My..."   → user_statement        │
└───────────────────────┬─────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────┐
│ STEP 2: DEDUPLICATION                               │
│                                                     │
│ "thirty days" and "30 days" → same fact, merge      │
│ Uses sentence-transformers embeddings (local, free) │
└───────────────────────┬─────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────┐
│ STEP 3: CONTRADICTION DETECTION (3-stage funnel)    │
│                                                     │
│  Stage 0: exact value match?    → skip (0ms)        │
│  Stage 1: embedding similarity? → skip (10ms)       │
│  Stage 2: LLM classifier        → yes/no (2–5s)     │
│                                                     │
│  Old record → SUPERSEDED                            │
│  New record → ACTIVE                                │
└───────────────────────┬─────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────┐
│ STEP 4: MEMORY QUERY                                │
│                                                     │
│ Ask backend: what do we know about this user?       │
│ Returns all ACTIVE facts for the user's attributes  │
└───────────────────────┬─────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────┐
│ STEP 5: DIFF ENGINE                                 │
│                                                     │
│ Stored memory vs agent claims:                      │
│  • Retained — agent mentioned it correctly          │
│  • Missed   — in memory but agent didn't say it     │
│  • Wrong    — agent said a different value          │
│  • Hallucinated — agent said something not in memory│
└───────────────────────┬─────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────┐
│ STEP 6: SCORING — 4 PILLARS                         │
│                                                     │
│  P1 = retained / stored                             │
│  P2 = 1 − (hallucinated / total)                   │
│  P3 = clean_facts / total                           │
│  Health = avg(P1, P2, P3)                           │
│  Kind = HALLUCINATION | POLICY_CHANGE | MEMORY_STALE│
└───────────────────────┬─────────────────────────────┘
                        │
                        ▼
              Persist to SQLite
              Alert if critical
              Dashboard updates

How Memory Works

Memory is a per-user filing cabinet stored in SQLite. Every fact has a lifecycle state:

User: alice
┌──────────────────────────────────────────────────────┐
│ ACTIVE (current truth — returned by query())         │
│   plan         = "Pro"        ← from call_001        │
│   price        = "49"         ← from call_001        │
│   refund_days  = "14"         ← updated in call_002  │
├──────────────────────────────────────────────────────┤
│ SUPERSEDED (archived — never deleted)                │
│   refund_days  = "30"         ← replaced by "14"     │
└──────────────────────────────────────────────────────┘

Lifecycle transitions:

New fact arrives
       │
       ├─ Same value already ACTIVE?
       │         └─ Skip write. No contradiction.
       │
       ├─ Different value, high confidence (≥ 0.75)?
       │         └─ Old → SUPERSEDED
       │            New → ACTIVE
       │
       ├─ Different value, low confidence (< 0.75)?
       │         └─ Old → DISPUTED
       │            New → DISPUTED  (manual review needed)
       │
       └─ No existing fact?
                 └─ New → ACTIVE

Old records are never deleted. History is always preserved for audit.

Plug In Your Own Memory Backend

Four methods. Any storage system works.

from memory_audit import MemoryAuditClient

class MyBackend:
    def query(self, entity, attribute): ...
    def write(self, record): ...
    def list_records(self, entity): ...
    def update_lifecycle(self, record_id, lifecycle, superseded_by=None): ...

client = MemoryAuditClient(user_id="alice", memory_backend=MyBackend())
result = client.process_call(call_id="c001", transcript="...")

1-Week Learning Tutorial

Learn AI observability by building it yourself.

Day 1 — Run It, Then Read the Orchestrator

Run the project first. Follow Quick Start above. Submit Scenario 2 (contradiction) from the dashboard. Watch P3 drop to 0%. You now have a target — everything this week explains how that happened.

Then read: memory_audit/client.py → process_call()

This is the conductor. It calls every other module in sequence. The comments label Steps 1–9. Your task: map each step to a file name. The answers are in the import statements at the top of the file.

Step 1 → extractor.py      Step 6 → diff.py
Step 2 → dedup.py          Step 7 → scorer.py
Step 3 → contradiction.py  Step 8 → sqlite.py
Step 4 → contradiction.py  Step 9 → alerts.py
Step 5 → sqlite_persistent.py

Key design decision: client.py never imports from adapters/mem0.py or any specific backend directly. It only uses the MemoryBackend protocol from adapters/base.py. This is dependency inversion — swap the backend, nothing else changes.

Task: In a Python shell, import MemoryAuditClient and call process_call() with a one-sentence transcript. Print result.retention_score. Confirm it runs end to end before Day 2.

Day 2 — Fact Extraction

Read: memory_audit/core/extractor.py

The extractor turns raw conversation into structured data the rest of the pipeline can reason about. It runs three layers in sequence, merging results:

Layer 1 — Regex (_REGEX_PATTERNS, _regex_extract()): Pattern-matches price ($49), plan (Pro plan), email, dates, refund windows without any LLM call. Runs in under 1ms. Deterministic — same input always gives same output. This is why fact extraction works even when the LLM fails.

Layer 2 — LLM per sentence (_llm_extract()): Splits the transcript into sentences using _split_sentences(), then asks the LLM about each sentence individually. One small prompt per sentence instead of one large prompt for everything. Small models fail on "extract 10 facts as a JSON array" but reliably answer "does this one sentence contain a fact?" Smaller task = more reliable result.

Layer 3 — Key=value fallback (_extract_batch_fallback()): If both layers return nothing, asks for attribute=value lines — the simplest possible format any model can produce.

Speaker detection (_AGENT_PREFIXES): Sentences starting with "Your", "You are", "You have", "You've" are tagged agent_inference. Everything else is user_statement. This single rule is what makes hallucination detection possible — the system knows who said what.

Task: Open a Python shell. Import FactExtractor and get_llm. Extract facts from "Your refund window is 14 days." Print each fact's source attribute. Confirm it shows agent_inference. Then try "My refund window is 14 days." — confirm it shows user_statement. The difference matters for Day 5.

Day 3 — Memory Storage and Lifecycle

Read first: memory_audit/adapters/base.py

This is the interface contract. Just 4 methods:

query(entity, attribute) → return active facts
write(record) → persist a new record
list_records(entity) → return all records for an entity
update_lifecycle(record_id, lifecycle, superseded_by) → change a record's state

Every memory backend — SQLite, Qdrant, Mem0, or your own — must implement exactly these 4 methods. The rest of the codebase never calls anything else. This is the Protocol pattern in Python: define the shape, not the implementation.

Read second: memory_audit/adapters/sqlite_persistent.py

Memory is scoped by user_id. Alice and Bob share one .db file but have completely separate memory. Facts have a lifecycle:

active      → current truth, returned by query()
superseded  → replaced by newer value, kept for history audit
disputed    → conflict detected but confidence too low to decide
deprecated  → manually retired

When a contradiction is confirmed at high confidence (≥ 0.75), the old record moves to superseded and the new one becomes active. The old record is never deleted — you can always audit what was true when.

Task: Run Scenario 2 (contradiction). Then inspect the database directly:

import sqlite3
conn = sqlite3.connect("memory_audit.db")
rows = conn.execute(
    "SELECT user_id, attribute, value, lifecycle FROM persistent_memory ORDER BY updated_at DESC LIMIT 10"
).fetchall()
for r in rows: print(r)

You should see two rows for refund_days — one active (14 days) and one superseded (30 days). This is the lifecycle in action.

Day 4 — Contradiction Detection

Read: memory_audit/core/contradiction.py

Naively calling the LLM for every new fact vs every stored fact is unusable — on CPU, each LLM call takes 2–5 seconds. 10 new facts × 50 stored facts = 500 LLM calls = 25 minutes per transcript. The solution is a three-stage funnel:

Stage 0 — Exact-value guard (_values_equal()): Normalises both values before comparing. "thirty" becomes "30", "$49" becomes "49", "30 days" becomes "30". If normalised values match → same fact, no LLM call, skip. This eliminates the most common case: user restates a fact they already stated.

Stage 1 — Embedding cosine filter (_cosine_filter()): Converts both fact texts to embedding vectors using sentence-transformers. Measures cosine similarity (angle between vectors). If similarity < 0.70 → the facts are about different things, skip. Fast (10ms), no LLM. Eliminates unrelated facts — you don't need the LLM to know "refund_days=30" and "email=alice@example.com" don't conflict.

Stage 2 — LLM classifier (_classify()): Only runs if Stages 0 and 1 both pass. Sends a short prompt: "Do these two values conflict? Return JSON." The confidence threshold matters — below 0.6 returns no contradiction; between 0.6–0.75 marks both records disputed; above 0.75 supersedes the old record.

Why this architecture matters: Each stage is independently testable, independently replaceable. You could swap Stage 2 for a local rule-based classifier and nothing else changes. This is the strategy pattern applied to a detection pipeline.

Task: Comment out the Stage 0 check in contradiction.py (the if any(self._values_equal(...)) block). Run the contradiction test. Notice false positives appear — "30 days" conflicts with "30 days". Put it back. The Stage 0 guard is cheap insurance against LLM unreliability.

Day 5 — Scoring and Kind Classification

Read: memory_audit/core/scorer.py

The scorer takes raw detections and produces four numbers.

P1 Retention — retained_facts / stored_facts. The diff engine (diff.py) compares stored memory vs what the agent claimed. agent_claims is extracted separately from agent-tagged sentences. If the agent mentioned a stored attribute, it counts as retained. If agent claims are unavailable, the score is neutral (1.0) rather than misleadingly 0%.

P2 Hallucination — 1 - (hallucinated_facts / total_facts). Two sources feed this: the diff engine's hallucinated_facts list AND contradictions where new_source == agent_inference. The second source is the critical one — a contradiction caused by the agent saying something the user never stated IS a hallucination, even if the diff engine didn't catch it separately.

P3 Contradiction — clean_facts / total_facts where clean = facts not involved in any detected contradiction.

_classify_contradiction() — this is the most important method in the file. The classification logic:

agent_inference source + value NOT in user_stated_values → HALLUCINATION
user_statement source  + value in transcript             → POLICY_CHANGE  
user_statement source  + days_gap > 0                   → MEMORY_STALE

The critical guard: user_stated_values is built from user_statement facts only — not agent facts. The agent cannot use its own statement as evidence it was not hallucinating. This is an explicit design choice, not an accident.

Task: Find the line in _classify_contradiction() that builds user_stated_values. Change it to use stated_facts instead of filtering by source. Re-run Scenario 3 (hallucination). Watch the kind change from HALLUCINATION to UNKNOWN. Change it back. Now you understand why that filter exists.

Day 6 — API and Dashboard

Read: memory_audit/api/server.py

API design: Every route reads from SQLite on each request. The server holds zero in-memory state beyond the _clients dict (which is a cache of MemoryAuditClient instances per user). This means the CLI, voice pipeline, and API all write to the same database and results appear in the same dashboard regardless of which input method was used.

Multi-user scoping: get_client(user_id) returns a cached MemoryAuditClient with a SQLitePersistentAdapter scoped to that user. Different user_id values = completely separate memory stores, even within one server session.

Dashboard architecture: The entire dashboard is a single HTML string returned by GET /dashboard. It contains embedded JavaScript that polls three endpoints every 5 seconds:

GET /metrics → updates the 4 score cards
GET /calls?limit=20 → updates the recent calls table
GET /calls/{id} → fetches contradiction details per call (async, Promise.all)

The calls table uses async/await with Promise.all so all per-call detail fetches run in parallel — fetching detail for 20 calls takes the same time as fetching detail for 1.

Task: Start the server. Open http://localhost:8000/docs (auto-generated Swagger UI). Call POST /calls manually. Then call GET /calls/{call_id} with the ID you used. Read the full JSON response shape. This is what the dashboard JavaScript reads every 5 seconds.

Day 7 — Voice Input and End to End

Read: voice_pipeline.py

Voice adds exactly one step before the existing pipeline: mic → Whisper → text. Everything after transcription is identical to the text path.

Whisper runs locally on CPU. No API key. No internet after first model download. The small model (default) takes roughly 2× real-time on CPU — a 30-second recording transcribes in about 1 minute. Use --whisper-model tiny for faster but less accurate transcription.

Two modes:

full (default): record the entire conversation including both user and agent turns. Speaker roles detected from sentence prefixes.
two-turn (--mode two-turn): records user and agent separately with a pause in between. More precise P1 measurement because agent response is explicit and unambiguous.

The reason speaker detection matters at the transcript level: In a real call centre, you often receive one transcript with both speakers interleaved. The system must determine who said what from context — this is what _AGENT_PREFIXES solves in the extractor.

End-to-end validation task:

# Reset everything
memory-audit reset --all --confirm

# Run all 6 text scenarios
python text_tests.py --reset

# Start server and watch dashboard
memory-audit serve
# Open http://localhost:8000/dashboard

After text_tests.py completes, the dashboard should show:

At least 1 call with Status = HALLUCINATION
At least 1 call with Status = CONTRADICTION
At least 1 POLICY_CHANGE in the Contradictions tab
Varying P1 scores (some low-retention calls)

If you can look at any call result and explain why each score has that value — what was extracted, what was stored, what the agent claimed, what conflicted — you have understood the system.

Files by concept:

Concept	File
Data shapes (Fact, MemoryRecord, CallResult)	`memory_audit/core/models.py`
Pipeline orchestration	`memory_audit/client.py`
Fact extraction (3-layer)	`memory_audit/core/extractor.py`
Semantic deduplication	`memory_audit/core/dedup.py`
Contradiction detection (3-stage)	`memory_audit/core/contradiction.py`
Retention diff engine	`memory_audit/core/diff.py`
P1/P2/P3 scoring + kind	`memory_audit/core/scorer.py`
Memory backend contract	`memory_audit/adapters/base.py`
SQLite persistent memory	`memory_audit/adapters/sqlite_persistent.py`
API routes + dashboard	`memory_audit/api/server.py`
CLI commands	`memory_audit/cli/main.py`
Voice input	`voice_pipeline.py`

License

MIT — free to use, modify, and distribute.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
benchmark		benchmark
docs		docs
memory_audit		memory_audit
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml
text_tests.py		text_tests.py
two_call_test.py		two_call_test.py
voice_pipeline.py		voice_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Memory Audit Layer

Demo

Overall Observability

Contradictions detected by agent

Hallucination

What It Does

Quick Start

Example Test Scenarios

Scenario 1 — Normal Call (All Scores Healthy)

Scenario 2 — Contradiction Detected

Scenario 3 — Hallucination Detected

How to Submit Transcripts

Inspect Memory · Reset · Alerts

Run Tests

The Pipeline — Step by Step

How Memory Works

Plug In Your Own Memory Backend

1-Week Learning Tutorial

Day 1 — Run It, Then Read the Orchestrator

Day 2 — Fact Extraction

Day 3 — Memory Storage and Lifecycle

Day 4 — Contradiction Detection

Day 5 — Scoring and Kind Classification

Day 6 — API and Dashboard

Day 7 — Voice Input and End to End

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Memory Audit Layer

Demo

Overall Observability

Contradictions detected by agent

Hallucination

What It Does

Quick Start

Example Test Scenarios

Scenario 1 — Normal Call (All Scores Healthy)

Scenario 2 — Contradiction Detected

Scenario 3 — Hallucination Detected

How to Submit Transcripts

Inspect Memory · Reset · Alerts

Run Tests

The Pipeline — Step by Step

How Memory Works

Plug In Your Own Memory Backend

1-Week Learning Tutorial

Day 1 — Run It, Then Read the Orchestrator

Day 2 — Fact Extraction

Day 3 — Memory Storage and Lifecycle

Day 4 — Contradiction Detection

Day 5 — Scoring and Kind Classification

Day 6 — API and Dashboard

Day 7 — Voice Input and End to End

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages