Memory is becoming a critical subsystem of AI agents—but it's largely unobservable.
Agents can silently forget relevant information, retrieve incorrect memories, or generate responses that contradict prior interactions. Traditional monitoring tools won't catch these failures.
Memory Audit Layer provides observability for agent memory, automatically detecting omissions, fabrications, and inconsistencies across conversations and calls.
If logging tells you what happened, Memory Audit Layer tells you whether your agent remembered correctly.
Three things silently go wrong in AI agent memory — and nobody notices:
| Silent Failure | What Happens |
|---|---|
| Forgetting | Agent fails to recall what the user stated |
| Hallucination | Agent states facts not in memory or transcript |
| Contradiction | Agent says something different from a previous call |
After every call, this system produces four scores:
| Score | Meaning | Healthy |
|---|---|---|
| P1 Retention | Did agent recall what was stated? | Higher = better |
| P2 Hallucination | Did agent fabricate anything? | 100% = nothing made up |
| P3 Contradiction | Is memory consistent across calls? | 100% = no conflicts |
| Memory Health | Composite of all three | 100% = all good |
When a contradiction is found, it tells you why:
HALLUCINATION— agent invented a value not stated by the user. Fix the prompt or lower temperature.POLICY_CHANGE— user explicitly stated a new value. Real change. Update downstream.MEMORY_STALE— memory not refreshed after a known change.
Works on any domain. The LLM extracts whatever facts exist — clinical, support, education, finance, general conversation.
Requires Python 3.11+, Ollama, and
ollama pull llama3(~4.7 GB, one-time).
# Clone and install
git clone <repo-url> memory-audit && cd memory-audit
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
cp .env.example .env
# Start (two terminals)
ollama serve # Terminal 1 — keep open
memory-audit serve # Terminal 2 — keep open
# Open dashboard
# http://localhost:8000/dashboardWhat it tests: Agent correctly recalls everything the user stated. Expect all scores near 100%.
Via dashboard — paste as transcript, User ID: user_demo:
Customer: I am on the Pro plan at $49 per month. My refund window is 30 days.
Agent: Confirmed — Pro plan, $49/month, 30-day refund window.
Via voice:
python voice_pipeline.py --call-id demo_1 --user user_demo --duration 20Say: "I am on the Pro plan at forty-nine dollars per month. My refund window is thirty days. Agent confirmed Pro plan, forty-nine dollars, thirty day refund."
Expected:
P1 Retention: ~90% ✓ agent recalled what was stated
P2 Hallucination: 100% ✓ nothing fabricated
P3 Contradiction: 100% ✓ no conflicts in memory
Health: ~97%
Status: OK
What it tests: A second call states a different value for the same attribute. P3 drops. Contradiction classified and stored.
Step 1 — run Scenario 1 first to seed memory for user_demo.
Step 2 — same User ID, new Call ID. Via dashboard:
Agent: Your refund window is now 14 days from date of purchase.
Via voice (same user, new call):
python voice_pipeline.py --call-id demo_2 --user user_demo --duration 15Say: "Your refund window is now fourteen days from date of purchase."
Expected:
P3 Contradiction: 0% ✗ conflict detected
Contradiction kind: POLICY_CHANGE
Attribute: refund_days
Old value: 30 ← stored from Call 1
New value: 14 ← stated in Call 2
Memory: old record SUPERSEDED, new value ACTIVE
Status: CONTRADICTION
Check the Contradictions tab and Memory State tab in the dashboard.
What it tests: Agent states a value not grounded in memory or transcript. P3 drops. Kind = HALLUCINATION (not POLICY_CHANGE).
Step 1 — seed memory, User ID: user_hall:
python voice_pipeline.py --call-id hall_1 --user user_hall --duration 15Say: "My refund window is thirty days."
Step 2 — agent fabricates a different value:
python voice_pipeline.py --call-id hall_2 --user user_hall --duration 15Say: "Your refund window is fourteen days from date of purchase."
The word "Your" signals agent speech → tagged agent_inference. The user never said 14 days. That combination = hallucination.
Expected:
P3 Contradiction: 0% ✗ conflict detected
Contradiction kind: HALLUCINATION
Reason: Agent stated '14' via inference — user never said this
Action: Review agent system prompt, lower temperature
Status: HALLUCINATION (purple badge, not red)
Why HALLUCINATION not POLICY_CHANGE? The system checks: was the new value stated by the user? No — only the agent said it. Agent cannot use its own statement as evidence it was right. That makes it a hallucination.
Browser — paste any conversation, upload .txt, or upload audio:
http://localhost:8000/dashboard
CLI — single file:
memory-audit process --call-id c001 --user alice --transcript call.txtCLI — batch directory:
memory-audit batch --dir ./calls/ --user aliceVoice — mic:
pip install -e ".[voice]"
python voice_pipeline.py --call-id c001 --user alice --duration 30API:
curl -X POST http://localhost:8000/calls \
-H "Content-Type: application/json" \
-d '{"call_id":"c001","user_id":"alice","transcript":"..."}'# See what is stored for a user
memory-audit memory --user alice
# Reset one user
memory-audit reset --user alice --confirm
# Reset everything
memory-audit reset --all --confirmConfigure Slack alerts for critical attributes in .env:
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
ALERT_CRITICAL_KEYS=allergy,dosage # clinical
# ALERT_CRITICAL_KEYS=refund_days,price # support# Automated (no Ollama needed)
python -m pytest tests/ -v
# 6 scenario text tests
python text_tests.py --reset
# Benchmark — 10 ground-truth contradiction scenarios
python -m benchmark.runner --scenarios benchmark/scenarios.jsonl --verboseOnce text arrives (from dashboard, CLI, file, or voice), here is exactly what happens in order:
TEXT IN
│
▼
┌─────────────────────────────────────────────────────┐
│ STEP 1: FACT EXTRACTION │
│ │
│ Three layers run in sequence: │
│ • Regex — price, plan, email, dates (0ms, no LLM) │
│ • LLM per sentence — one small call per sentence │
│ • Key=value fallback — simplest format, any model │
│ │
│ Speaker detection: "Your..." → agent_inference │
│ "My..." → user_statement │
└───────────────────────┬─────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ STEP 2: DEDUPLICATION │
│ │
│ "thirty days" and "30 days" → same fact, merge │
│ Uses sentence-transformers embeddings (local, free) │
└───────────────────────┬─────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ STEP 3: CONTRADICTION DETECTION (3-stage funnel) │
│ │
│ Stage 0: exact value match? → skip (0ms) │
│ Stage 1: embedding similarity? → skip (10ms) │
│ Stage 2: LLM classifier → yes/no (2–5s) │
│ │
│ Old record → SUPERSEDED │
│ New record → ACTIVE │
└───────────────────────┬─────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ STEP 4: MEMORY QUERY │
│ │
│ Ask backend: what do we know about this user? │
│ Returns all ACTIVE facts for the user's attributes │
└───────────────────────┬─────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ STEP 5: DIFF ENGINE │
│ │
│ Stored memory vs agent claims: │
│ • Retained — agent mentioned it correctly │
│ • Missed — in memory but agent didn't say it │
│ • Wrong — agent said a different value │
│ • Hallucinated — agent said something not in memory│
└───────────────────────┬─────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ STEP 6: SCORING — 4 PILLARS │
│ │
│ P1 = retained / stored │
│ P2 = 1 − (hallucinated / total) │
│ P3 = clean_facts / total │
│ Health = avg(P1, P2, P3) │
│ Kind = HALLUCINATION | POLICY_CHANGE | MEMORY_STALE│
└───────────────────────┬─────────────────────────────┘
│
▼
Persist to SQLite
Alert if critical
Dashboard updates
Memory is a per-user filing cabinet stored in SQLite. Every fact has a lifecycle state:
User: alice
┌──────────────────────────────────────────────────────┐
│ ACTIVE (current truth — returned by query()) │
│ plan = "Pro" ← from call_001 │
│ price = "49" ← from call_001 │
│ refund_days = "14" ← updated in call_002 │
├──────────────────────────────────────────────────────┤
│ SUPERSEDED (archived — never deleted) │
│ refund_days = "30" ← replaced by "14" │
└──────────────────────────────────────────────────────┘
Lifecycle transitions:
New fact arrives
│
├─ Same value already ACTIVE?
│ └─ Skip write. No contradiction.
│
├─ Different value, high confidence (≥ 0.75)?
│ └─ Old → SUPERSEDED
│ New → ACTIVE
│
├─ Different value, low confidence (< 0.75)?
│ └─ Old → DISPUTED
│ New → DISPUTED (manual review needed)
│
└─ No existing fact?
└─ New → ACTIVE
Old records are never deleted. History is always preserved for audit.
Four methods. Any storage system works.
from memory_audit import MemoryAuditClient
class MyBackend:
def query(self, entity, attribute): ...
def write(self, record): ...
def list_records(self, entity): ...
def update_lifecycle(self, record_id, lifecycle, superseded_by=None): ...
client = MemoryAuditClient(user_id="alice", memory_backend=MyBackend())
result = client.process_call(call_id="c001", transcript="...")Learn AI observability by building it yourself.
Run the project first. Follow Quick Start above. Submit Scenario 2 (contradiction) from the dashboard. Watch P3 drop to 0%. You now have a target — everything this week explains how that happened.
Then read: memory_audit/client.py → process_call()
This is the conductor. It calls every other module in sequence. The comments label Steps 1–9. Your task: map each step to a file name. The answers are in the import statements at the top of the file.
Step 1 → extractor.py Step 6 → diff.py
Step 2 → dedup.py Step 7 → scorer.py
Step 3 → contradiction.py Step 8 → sqlite.py
Step 4 → contradiction.py Step 9 → alerts.py
Step 5 → sqlite_persistent.py
Key design decision: client.py never imports from adapters/mem0.py or any specific backend directly. It only uses the MemoryBackend protocol from adapters/base.py. This is dependency inversion — swap the backend, nothing else changes.
Task: In a Python shell, import MemoryAuditClient and call process_call() with a one-sentence transcript. Print result.retention_score. Confirm it runs end to end before Day 2.
Read: memory_audit/core/extractor.py
The extractor turns raw conversation into structured data the rest of the pipeline can reason about. It runs three layers in sequence, merging results:
Layer 1 — Regex (_REGEX_PATTERNS, _regex_extract()): Pattern-matches price ($49), plan (Pro plan), email, dates, refund windows without any LLM call. Runs in under 1ms. Deterministic — same input always gives same output. This is why fact extraction works even when the LLM fails.
Layer 2 — LLM per sentence (_llm_extract()): Splits the transcript into sentences using _split_sentences(), then asks the LLM about each sentence individually. One small prompt per sentence instead of one large prompt for everything. Small models fail on "extract 10 facts as a JSON array" but reliably answer "does this one sentence contain a fact?" Smaller task = more reliable result.
Layer 3 — Key=value fallback (_extract_batch_fallback()): If both layers return nothing, asks for attribute=value lines — the simplest possible format any model can produce.
Speaker detection (_AGENT_PREFIXES): Sentences starting with "Your", "You are", "You have", "You've" are tagged agent_inference. Everything else is user_statement. This single rule is what makes hallucination detection possible — the system knows who said what.
Task: Open a Python shell. Import FactExtractor and get_llm. Extract facts from "Your refund window is 14 days." Print each fact's source attribute. Confirm it shows agent_inference. Then try "My refund window is 14 days." — confirm it shows user_statement. The difference matters for Day 5.
Read first: memory_audit/adapters/base.py
This is the interface contract. Just 4 methods:
query(entity, attribute)→ return active factswrite(record)→ persist a new recordlist_records(entity)→ return all records for an entityupdate_lifecycle(record_id, lifecycle, superseded_by)→ change a record's state
Every memory backend — SQLite, Qdrant, Mem0, or your own — must implement exactly these 4 methods. The rest of the codebase never calls anything else. This is the Protocol pattern in Python: define the shape, not the implementation.
Read second: memory_audit/adapters/sqlite_persistent.py
Memory is scoped by user_id. Alice and Bob share one .db file but have completely separate memory. Facts have a lifecycle:
active → current truth, returned by query()
superseded → replaced by newer value, kept for history audit
disputed → conflict detected but confidence too low to decide
deprecated → manually retired
When a contradiction is confirmed at high confidence (≥ 0.75), the old record moves to superseded and the new one becomes active. The old record is never deleted — you can always audit what was true when.
Task: Run Scenario 2 (contradiction). Then inspect the database directly:
import sqlite3
conn = sqlite3.connect("memory_audit.db")
rows = conn.execute(
"SELECT user_id, attribute, value, lifecycle FROM persistent_memory ORDER BY updated_at DESC LIMIT 10"
).fetchall()
for r in rows: print(r)You should see two rows for refund_days — one active (14 days) and one superseded (30 days). This is the lifecycle in action.
Read: memory_audit/core/contradiction.py
Naively calling the LLM for every new fact vs every stored fact is unusable — on CPU, each LLM call takes 2–5 seconds. 10 new facts × 50 stored facts = 500 LLM calls = 25 minutes per transcript. The solution is a three-stage funnel:
Stage 0 — Exact-value guard (_values_equal()): Normalises both values before comparing. "thirty" becomes "30", "$49" becomes "49", "30 days" becomes "30". If normalised values match → same fact, no LLM call, skip. This eliminates the most common case: user restates a fact they already stated.
Stage 1 — Embedding cosine filter (_cosine_filter()): Converts both fact texts to embedding vectors using sentence-transformers. Measures cosine similarity (angle between vectors). If similarity < 0.70 → the facts are about different things, skip. Fast (10ms), no LLM. Eliminates unrelated facts — you don't need the LLM to know "refund_days=30" and "email=alice@example.com" don't conflict.
Stage 2 — LLM classifier (_classify()): Only runs if Stages 0 and 1 both pass. Sends a short prompt: "Do these two values conflict? Return JSON." The confidence threshold matters — below 0.6 returns no contradiction; between 0.6–0.75 marks both records disputed; above 0.75 supersedes the old record.
Why this architecture matters: Each stage is independently testable, independently replaceable. You could swap Stage 2 for a local rule-based classifier and nothing else changes. This is the strategy pattern applied to a detection pipeline.
Task: Comment out the Stage 0 check in contradiction.py (the if any(self._values_equal(...)) block). Run the contradiction test. Notice false positives appear — "30 days" conflicts with "30 days". Put it back. The Stage 0 guard is cheap insurance against LLM unreliability.
Read: memory_audit/core/scorer.py
The scorer takes raw detections and produces four numbers.
P1 Retention — retained_facts / stored_facts. The diff engine (diff.py) compares stored memory vs what the agent claimed. agent_claims is extracted separately from agent-tagged sentences. If the agent mentioned a stored attribute, it counts as retained. If agent claims are unavailable, the score is neutral (1.0) rather than misleadingly 0%.
P2 Hallucination — 1 - (hallucinated_facts / total_facts). Two sources feed this: the diff engine's hallucinated_facts list AND contradictions where new_source == agent_inference. The second source is the critical one — a contradiction caused by the agent saying something the user never stated IS a hallucination, even if the diff engine didn't catch it separately.
P3 Contradiction — clean_facts / total_facts where clean = facts not involved in any detected contradiction.
_classify_contradiction() — this is the most important method in the file. The classification logic:
agent_inference source + value NOT in user_stated_values → HALLUCINATION
user_statement source + value in transcript → POLICY_CHANGE
user_statement source + days_gap > 0 → MEMORY_STALE
The critical guard: user_stated_values is built from user_statement facts only — not agent facts. The agent cannot use its own statement as evidence it was not hallucinating. This is an explicit design choice, not an accident.
Task: Find the line in _classify_contradiction() that builds user_stated_values. Change it to use stated_facts instead of filtering by source. Re-run Scenario 3 (hallucination). Watch the kind change from HALLUCINATION to UNKNOWN. Change it back. Now you understand why that filter exists.
Read: memory_audit/api/server.py
API design: Every route reads from SQLite on each request. The server holds zero in-memory state beyond the _clients dict (which is a cache of MemoryAuditClient instances per user). This means the CLI, voice pipeline, and API all write to the same database and results appear in the same dashboard regardless of which input method was used.
Multi-user scoping: get_client(user_id) returns a cached MemoryAuditClient with a SQLitePersistentAdapter scoped to that user. Different user_id values = completely separate memory stores, even within one server session.
Dashboard architecture: The entire dashboard is a single HTML string returned by GET /dashboard. It contains embedded JavaScript that polls three endpoints every 5 seconds:
GET /metrics→ updates the 4 score cardsGET /calls?limit=20→ updates the recent calls tableGET /calls/{id}→ fetches contradiction details per call (async,Promise.all)
The calls table uses async/await with Promise.all so all per-call detail fetches run in parallel — fetching detail for 20 calls takes the same time as fetching detail for 1.
Task: Start the server. Open http://localhost:8000/docs (auto-generated Swagger UI). Call POST /calls manually. Then call GET /calls/{call_id} with the ID you used. Read the full JSON response shape. This is what the dashboard JavaScript reads every 5 seconds.
Read: voice_pipeline.py
Voice adds exactly one step before the existing pipeline: mic → Whisper → text. Everything after transcription is identical to the text path.
Whisper runs locally on CPU. No API key. No internet after first model download. The small model (default) takes roughly 2× real-time on CPU — a 30-second recording transcribes in about 1 minute. Use --whisper-model tiny for faster but less accurate transcription.
Two modes:
full(default): record the entire conversation including both user and agent turns. Speaker roles detected from sentence prefixes.two-turn(--mode two-turn): records user and agent separately with a pause in between. More precise P1 measurement because agent response is explicit and unambiguous.
The reason speaker detection matters at the transcript level: In a real call centre, you often receive one transcript with both speakers interleaved. The system must determine who said what from context — this is what _AGENT_PREFIXES solves in the extractor.
End-to-end validation task:
# Reset everything
memory-audit reset --all --confirm
# Run all 6 text scenarios
python text_tests.py --reset
# Start server and watch dashboard
memory-audit serve
# Open http://localhost:8000/dashboardAfter text_tests.py completes, the dashboard should show:
- At least 1 call with Status =
HALLUCINATION - At least 1 call with Status =
CONTRADICTION - At least 1
POLICY_CHANGEin the Contradictions tab - Varying P1 scores (some low-retention calls)
If you can look at any call result and explain why each score has that value — what was extracted, what was stored, what the agent claimed, what conflicted — you have understood the system.
Files by concept:
| Concept | File |
|---|---|
| Data shapes (Fact, MemoryRecord, CallResult) | memory_audit/core/models.py |
| Pipeline orchestration | memory_audit/client.py |
| Fact extraction (3-layer) | memory_audit/core/extractor.py |
| Semantic deduplication | memory_audit/core/dedup.py |
| Contradiction detection (3-stage) | memory_audit/core/contradiction.py |
| Retention diff engine | memory_audit/core/diff.py |
| P1/P2/P3 scoring + kind | memory_audit/core/scorer.py |
| Memory backend contract | memory_audit/adapters/base.py |
| SQLite persistent memory | memory_audit/adapters/sqlite_persistent.py |
| API routes + dashboard | memory_audit/api/server.py |
| CLI commands | memory_audit/cli/main.py |
| Voice input | voice_pipeline.py |
MIT — free to use, modify, and distribute.


