Skip to content

prathamesh-6099/Memory-management-System-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Long-Form Memory System - Phases 1-5

A production-grade memory system for AI agents that enables accurate recall across 1,000+ conversation turns.

What This Is

This is Phases 1-5 of a 6-phase implementation plan for a complete long-form memory system.

Phase 1 Features:

  • ✅ Flat file storage for Core Memory (always-injected user identity)
  • ✅ Redis storage for Long-Term Memory (persistent across sessions)
  • ✅ Two-stage extraction pipeline (heuristic filter + pattern-based classifier)
  • ✅ Basic retrieval (type-priority + recency-based)
  • ✅ Automated deduplication
  • ✅ Full memory pipeline: Extract → Store → Retrieve → Inject

Phase 2 Features:

  • ✅ Vector store (Qdrant) for semantic search
  • ✅ Embedding generation with sentence-transformers
  • ✅ Semantic similarity search
  • ✅ Multi-signal ranking (semantic + type + recency)

Phase 3 Features:

  • ✅ Stage 3 LLM-based extraction for complex cases
  • ✅ Multi-provider support (OpenAI, Anthropic, Groq)
  • ✅ Semantic deduplication using vector similarity
  • ✅ Memory updates and superseding
  • ✅ Confidence scoring with certainty modifiers
  • ✅ Confidence boosting for repeated mentions

Phase 4 Features:

  • ✅ Background consolidation worker
  • ✅ Memory decay for old/unused memories
  • ✅ Memory merging for similar content
  • ✅ Promotion to Core Memory
  • ✅ 5-signal ranking (semantic + type + recency + frequency + confidence)
  • ✅ Access tracking for frequency scoring

Phase 5 Features:

  • ✅ RAGAS-based evaluation framework
  • ✅ Synthetic conversation generator (200 test samples)
  • ✅ Extraction accuracy metrics (Precision, Recall, F1)
  • ✅ Retrieval quality metrics (Context Precision/Recall, MRR)
  • ✅ Distance sweep tests (10-1000 turn recall)
  • ✅ Consolidation quality evaluation
  • ✅ Automated test runner with comprehensive reporting

Recent Improvements (February 2026):

  • Hybrid Retrieval System: Dual-branch architecture (semantic + recency) achieving 100% long-term recall at all distances (10-1000 turns)
  • Payment Domain Support: Extended extraction pipeline with 13 payment-specific regex patterns (account numbers, amounts, due dates, payment status)
  • Multi-Key API Rotation: Support for 4 simultaneous Groq API keys with automatic rotation on rate limits (400k tokens/day total capacity)
  • Optimized Extraction: Reduced Stage 3 LLM calls from ~100% to 13.3% through enhanced Phase 1/2 patterns
  • 1000-Turn Validation: Comprehensive latency testing showing 575ms mean processing, 294ms mean retrieval across 1000 conversation turns
  • 5-Signal Ranking Optimization: Rebalanced weights (30/40/10/5/15) for improved context recall (80.1%)

Status: All 5 phases verified working with production-grade performance as of February 2026


📚 Documentation


Quick Start

1. Prerequisites

  • Python 3.8+
  • Docker and Docker Compose (for Redis and Qdrant)

2. Setup

# Clone/navigate to the project directory
cd memory-system

# Install Python dependencies
pip install -r requirements.txt

# Start Redis and Qdrant
docker-compose up -d

# Verify services are running
docker-compose ps

3. Run the Demo

# Set your LLM API key(s) first:
# For Groq (fastest, recommended): 
#   Single key: set GROQ_API_KEY=your_key_here
#   Multiple keys for rate limit rotation: 
#     set GROQ_API_KEY_1=your_first_key
#     set GROQ_API_KEY_2=your_second_key
# For OpenAI: set OPENAI_API_KEY=your_key_here
# For Anthropic: set ANTHROPIC_API_KEY=your_key_here

# Phase 4 demo (consolidation & 5-signal ranking)
python demo_phase4.py

# Comprehensive test (all phases, 120+ turns, 3 consolidation cycles)
python test_all_phases.py

# Full conversation test (60-turn customer service scenario with active memory tracking)
python test_customer_conversation.py

# Active memory tracking demo (10-turn demonstration)
python demo_active_memories.py

# Simple active memory example (5 turns)
python example_active_memories.py

# Phase 5: Run comprehensive evaluation (200 test conversations)
pip install -r requirements_evaluation.txt
python run_evaluation.py

# 1000-turn latency test (production validation)
python test_1000_turn_latency.py

# Diagnostic tool for extraction phases
python diagnostic_extraction_phases.py

📊 See RESULTS_FEBRUARY_2026.md for detailed optimization results and performance analysis

The demos demonstrate:

demo_phase4.py:

  1. Full 4-phase system with LLM extraction
  2. Semantic search and 5-signal ranking
  3. Background consolidation triggering
  4. Memory decay, merging, and promotion

test_all_phases.py (comprehensive):

  1. Process 120+ conversation turns with 60+ extractable memories
  2. Trigger automatic consolidation 3 times (at turns 50, 100, 150)
  3. Demonstrate all 4 phases working together
  4. Show memory decay, merging, and promotion to core memory
  5. Test 5-signal ranking with various query types
  6. Multi-API key rotation (avoids rate limits)

test_customer_conversation.py:

  1. Realistic 60-turn customer service conversation
  2. Extract customer information, preferences, and transaction details
  3. Demonstrate active memory tracking at each turn
  4. Show which memories influenced each response with full metadata

test_1000_turn_latency.py (production validation):

  1. 1000 conversation turns in payment reminder domain
  2. Comprehensive latency measurement (processing + retrieval)
  3. Multi-key API rotation demonstration
  4. Performance validation across extended conversations
  5. Detailed statistics: mean, median, P95, P99, throughput
  6. Results saved to latency_results_1000_turns.txt

diagnostic_extraction_phases.py (debugging):

  1. Test extraction phases independently (Phase 1, 2, 3)
  2. 15 payment domain test messages
  3. Phase-by-phase pass rates and extraction counts
  4. Identify bottlenecks in extraction pipeline

demo_active_memories.py:

  1. 10-turn conversation showing memory tracking
  2. Exposes which memories influenced each response
  3. Shows memory evolution: origin_turn, last_used_turn, access_count
  4. Demonstrates memory persistence across turns

Project Structure

memory-system-phase1/
├── docker-compose.yml          # Redis + Qdrant setup
├── redis.conf                  # Redis configuration (AOF persistence)
├── requirements.txt            # Python dependencies
├── requirements_evaluation.txt # Phase 5 evaluation dependencies
├── .env.example                # Environment variable template
├── .gitignore                  # Git ignore patterns
├── README.md                   # This file
├── RESULTS_FEBRUARY_2026.md    # Optimization results & benchmarks (NEW)
├── demo_phase4.py              # Phase 4 demo (consolidation)
├── test_all_phases.py          # Comprehensive test (120+ turns)
├── test_customer_conversation.py  # Customer service test (60 turns)
├── test_1000_turn_latency.py   # 1000-turn latency validation (NEW)
├── diagnostic_extraction_phases.py # Extraction phase diagnostics (NEW)
├── demo_active_memories.py     # Active memory tracking demo
├── example_active_memories.py  # Simple active memory example
├── run_evaluation.py           # Phase 5 evaluation runner
├── memory/                     # Flat file storage
│   ├── user_1/                 # Per-user directory
│   │   ├── CORE.md             # Core identity (always injected)
│   │   ├── PREFERENCES.md      # User preferences
│   │   ├── INSTRUCTIONS.md     # Behavioral instructions
│   │   └── CONSTRAINTS.md      # Hard constraints
│   └── jennifer_martinez/      # Another user example
└── src/                        # Source code
    ├── __init__.py
    ├── config.py               # Configuration & tunable parameters
    ├── flat_file_store.py      # Flat file storage layer
    ├── redis_store.py          # Redis storage layer (with superseding)
    ├── extractor.py            # Memory extraction (Stage 1, 2 & 3) + payment patterns
    ├── llm_extractor.py        # Phase 3: LLM extraction (multi-key rotation)
    ├── retriever.py            # Memory retrieval (hybrid + 5-signal ranking)
    ├── memory_system.py        # Main orchestrator (with active memory tracking)
    ├── embedding_service.py    # Phase 2: Embedding generation
    ├── vector_store.py         # Phase 2: Qdrant vector store
    └── consolidation_worker.py # Phase 4: Background consolidation

Usage

Basic Usage

from src import MemorySystem

# Initialize for a user
memory = MemorySystem(user_id="alice")

# Process each conversation turn
for user_message in conversation:
    memory_context, stats = memory.process_turn(user_message)
    
    # Access active memories that influenced this response
    active_memories = stats.get('active_memories', [])
    for mem in active_memories:
        print(f"Memory {mem['memory_id']} influenced response:")
        print(f"  Content: {mem['content']}")
        print(f"  Origin: Turn {mem['origin_turn']}")
        print(f"  Last Used: Turn {mem['last_used_turn']}")
        print(f"  Confidence: {mem['confidence']:.2f}")
    
    # Inject memory_context into your LLM prompt
    prompt = f"""
    {memory_context}
    
    User: {user_message}
    Assistant: 
    """
    
    # Generate response with your LLM
    response = your_llm(prompt)

Active Memory Tracking

The system exposes which memories influenced each response:

# Process a turn
_, stats = memory.process_turn("What are my scheduling preferences?")

# Get active memories
active_memories = stats['active_memories']

# Example output:
# [
#   {
#     "memory_id": "mem_0142",
#     "content": "call_preference: after 11 AM",
#     "type": "preference",
#     "origin_turn": 1,
#     "last_used_turn": 412,
#     "confidence": 0.95,
#     "mention_count": 1,
#     "access_count": 15
#   }
# ]

This allows you to:

  • Track which memories influenced each response
  • Debug retrieval behavior
  • Audit memory usage over time
  • Validate memory relevance

Retrieval Only

# Get memory context without processing the turn
memory_context = memory.get_prompt_context(
    user_message="What's my manager's name?",
    priority_types=["entity", "preference"]
)

Update Core Memory

# Update a field in core identity
memory.update_core_memory(
    file="CORE.md",
    section="Identity", 
    field="Name",
    value="Alice"
)

Statistics

stats = memory.get_statistics()
print(f"Total memories: {stats['total_memories']}")
print(f"Memories by type: {stats['memories_by_type']}")

How It Works

Memory Layers

  1. Core Memory (Flat Files)

    • Always injected into every prompt
    • Contains: name, language, timezone, core preferences
    • Human-readable Markdown files
    • ~200-500 tokens
  2. Long-Term Memory (Redis)

    • Selectively retrieved based on current message
    • Contains: preferences, constraints, entities, commitments
    • Indexed by type and recency
    • ~500 tokens budget

Extraction Pipeline (Phase 1 & 3)

Stage 1: Sensory Filter (Heuristic)

  • Fast pattern matching
  • Filters out ~60% of turns (greetings, acknowledgments)
  • Weighted scoring: length, keywords, questions, specificity

Stage 2: Pattern-Based Classifier

  • Regex patterns for each memory type
  • Extracts key-value pairs
  • Assigns confidence scores
  • Types: preference, constraint, entity, instruction, commitment, fact

Domain-Specific Pattern Groups:

  • Personal Assistant: Name, preferences, locations, schedules (8 patterns)
  • Payment/Financial: Account numbers, amounts, due dates, payment status, arrangements (13 patterns)
  • General: Entities, facts, commitments (5+ patterns)

Payment Domain Examples:

  • Account numbers: account ending in 4567 → entity (confidence: 0.95)
  • Payment amounts: payment of $450 → fact (confidence: 0.90)
  • Due dates: due on February 5th → fact (confidence: 0.90)
  • Payment status: payment received 3 days ago → fact (confidence: 0.85)
  • Arrangements: payment extension of 10 days → commitment (confidence: 0.85)

Stage 3: LLM Extraction (Phase 3)

  • Uses OpenAI, Anthropic, or Groq for complex extraction
  • Escalates when Stage 2 confidence < 0.7 or no results
  • Structured JSON extraction with confidence scores
  • Detects memory updates and contradictions
  • Target latency: ~1-3s (Groq), ~200-500ms (standard APIs)

Retrieval Strategy

Phase 1 (Basic Retrieval)

  1. Always-On Types: constraint, instruction (always retrieved)
  2. Recency: Recent memories with exponential decay
  3. Priority Types: User-specified types to prioritize

Phase 2 (Semantic Search + Multi-Signal Ranking)

Phase 2 uses a multi-signal ranking formula that combines three signals:

final_score = w_semantic × semantic_score + w_type × type_priority + w_recency × recency_score
Signal Weight Description
Semantic 0.5 Cosine similarity between query and memory embeddings
Type Priority 0.25 Memory type importance (constraints > instructions > preferences)
Recency 0.25 Exponential decay based on turns since memory creation

Phase 5+ (Hybrid Retrieval)

Dual-Branch Architecture:

  1. Semantic Branch: Filtered by similarity threshold (MIN_SEMANTIC_SCORE: 0.3)
  2. Recency Branch: Unfiltered recent memories (last 100 turns)
  3. Merge & Deduplicate: Combines both branches for comprehensive coverage

5-Signal Ranking Formula (Phase 4 Enhanced):

final_score = w_semantic × semantic_score + w_type × type_priority + w_recency × recency_score 
              + w_frequency × access_frequency + w_confidence × confidence_score
Signal Weight Description
Semantic 0.30 Cosine similarity between query and memory embeddings
Type Priority 0.40 Memory type importance (constraints > instructions > preferences)
Recency 0.10 Exponential decay (rate: 0.001, max: 5000 turns)
Frequency 0.05 Access count normalized by logarithmic scaling
Confidence 0.15 Memory confidence score (0.6-0.95)

Results:

  • Long-term recall: 100% at all distances (10, 50, 100, 500, 1000 turns)
  • Context recall: 80.1% (comprehensive context injection)
  • Extraction F1: 89.5% (high precision and recall)

Type Priority Values:

  • constraint: 1.0 (highest - safety critical)
  • instruction: 0.95 (behavioral guidance)
  • commitment: 0.8 (time-sensitive)
  • preference: 0.7 (user experience)
  • entity: 0.6 (context)
  • fact: 0.5 (general knowledge)
  • event: 0.4 (lowest)

Deduplication (Phase 1 & 3)

Phase 1: Key-Based Deduplication

  • Redis dedup index: {type}:{key}memory_id
  • Prevents storing identical memories
  • Updates recency of existing memories instead

Phase 3: Semantic Deduplication

  • Uses vector similarity (cosine score > 0.92 = duplicate)
  • Catches near-duplicates with different wording:
    • "I prefer calls after 11 AM"
    • "Call me after 11 in the morning"
  • Boosts confidence when repeated
  • Supersedes old memories when updates detected

Configuration

All tunable parameters are in src/config.py:

Phase 1 Parameters

Parameter Default Description
SENSORY_FILTER_THRESHOLD 0.3 Minimum score to pass heuristic filter
EXTRACTION_CLASSIFIER_THRESHOLD 0.6 Minimum confidence to store memory
MAX_MEMORIES_TO_RETRIEVE 10 Top K memories to inject
MEMORY_TOKEN_BUDGET 500 Max tokens for retrieved memories
CORE_MEMORY_TOKEN_BUDGET 500 Max tokens for core memory

Phase 2 Parameters

Parameter Default Description
SEMANTIC_SEARCH_ENABLED True Enable/disable semantic search
SEMANTIC_SEARCH_LIMIT 20 Number of candidates from vector search
MIN_SEMANTIC_SCORE 0.3 Minimum similarity score threshold
QDRANT_HOST localhost Qdrant server host
QDRANT_PORT 6333 Qdrant server port
EMBEDDING_MODEL all-MiniLM-L6-v2 Sentence-transformers model
RANKING_WEIGHTS.semantic 0.5 Weight for semantic similarity
RANKING_WEIGHTS.type 0.25 Weight for type priority
RANKING_WEIGHTS.recency 0.25 Weight for recency score

Phase 3 Parameters (NEW)

Parameter Default Description
STAGE_3_ENABLED False Enable/disable LLM-based extraction
LLM_PROVIDER groq LLM provider: "openai", "anthropic", "groq"
LLM_EXTRACTION_MODEL llama-3.3-70b-versatile Model for Stage 3 extraction
STAGE_3_CONFIDENCE_THRESHOLD 0.7 Escalate to LLM if Stage 2 < this
STAGE_3_MAX_TOKENS 500 Max tokens for LLM response (prevents JSON cutoff)
STAGE_3_TEMPERATURE 0.1 Temperature for consistent extraction
GROQ_API_KEYS Array of keys Support for 4 simultaneous API keys (auto-rotation)
SEMANTIC_DEDUP_ENABLED True Enable semantic deduplication
SEMANTIC_DEDUP_THRESHOLD 0.92 Similarity score to consider duplicate
MIN_CONFIDENCE_TO_STORE 0.6 Discard memories below this confidence
CONFIDENCE_BOOST_PER_MENTION 0.1 Boost confidence when repeated
MAX_CONFIDENCE 0.95 Maximum confidence after boosts

Multi-Key API Rotation:

  • Configure up to 4 Groq API keys via environment variables: GROQ_API_KEY, GROQ_API_KEY_1, GROQ_API_KEY_2, GROQ_API_KEY_3
  • Automatic rotation on rate limit errors (429 status)
  • Total capacity: 400k tokens/day, 48k tokens/minute
  • Enables high-volume testing and production workloads

Phase 4 Parameters (NEW)

Parameter Default Description
CONSOLIDATION_ENABLED True Enable/disable background consolidation
CONSOLIDATION_INTERVAL_TURNS 50 Turns between consolidation runs
MEMORY_DECAY_ENABLED True Enable memory decay for old memories
MEMORY_MERGE_ENABLED True Enable merging of similar memories
PROMOTION_ENABLED True Enable promotion to Core Memory
DECAY_TURNS_THRESHOLD 100 Turns before decay starts
DECAY_FACTOR 0.95 Confidence multiplier per decay cycle
MERGE_SIMILARITY_THRESHOLD 0.88 Similarity to consider for merging
PROMOTION_CONFIDENCE_THRESHOLD 0.85 Min confidence for promotion
PROMOTION_MENTION_THRESHOLD 3 Min mentions for promotion
PROMOTION_ACCESS_THRESHOLD 5 Min accesses for promotion
RANKING_WEIGHTS_5_SIGNAL.semantic 0.30 Weight for semantic similarity (optimized)
RANKING_WEIGHTS_5_SIGNAL.type 0.40 Weight for type priority (optimized)
RANKING_WEIGHTS_5_SIGNAL.recency 0.10 Weight for recency score (optimized)
RANKING_WEIGHTS_5_SIGNAL.frequency 0.05 Weight for access frequency (optimized)
RANKING_WEIGHTS_5_SIGNAL.confidence 0.15 Weight for confidence score (optimized)

Phase 5+ Parameters (Hybrid Retrieval)

Parameter Default Description
HYBRID_RETRIEVAL_ENABLED True Enable dual-branch hybrid retrieval
RECENCY_DECAY_RATE 0.001 Exponential decay rate for recency scoring
RECENCY_DECAY_MAX_TURNS 5000 Maximum turns before recency score floors

What's Implemented

Phase 1

  • ✅ Flat file storage with human-editable Markdown
  • ✅ Redis storage with AOF persistence
  • ✅ Two-stage extraction (heuristic + pattern classifier)
  • ✅ Type-based and recency-based retrieval
  • ✅ Deduplication
  • ✅ Memory indices (type, recency)
  • ✅ Full pipeline orchestration
  • ✅ Statistics and monitoring

Phase 2

  • ✅ Vector store (Qdrant) for semantic embeddings
  • ✅ Embedding generation with sentence-transformers (all-MiniLM-L6-v2)
  • ✅ Semantic similarity search
  • ✅ Multi-signal ranking (semantic + type + recency)
  • ✅ Configurable ranking weights
  • ✅ Graceful fallback to Phase 1 if Qdrant unavailable

Phase 3

  • ✅ Stage 3 LLM-based extraction (OpenAI, Anthropic, Groq)
  • ✅ Escalation logic (low confidence → LLM)
  • ✅ Semantic deduplication using vector similarity
  • ✅ Memory superseding and update detection
  • ✅ Confidence modifiers (certainty words)
  • ✅ Confidence boosting for repeated mentions
  • ✅ Superseded memory filtering in retrieval

Phase 4

  • ✅ Background consolidation worker
  • ✅ Memory decay for old/unused memories
  • ✅ Memory merging for semantically similar content
  • ✅ Promotion to Core Memory files
  • ✅ 5-signal ranking (semantic + type + recency + frequency + confidence)
  • ✅ Access tracking and frequency scoring
  • ✅ Configurable consolidation intervals

Phase 5

  • ✅ RAGAS-based evaluation framework
  • ✅ Synthetic conversation generator (200 test samples)
  • ✅ Extraction accuracy metrics (Precision, Recall, F1)
  • ✅ Retrieval quality metrics (Context Precision/Recall, MRR)
  • ✅ Distance sweep tests (10-1000 turn recall)
  • ✅ Consolidation quality evaluation
  • ✅ Automated test runner with comprehensive reporting

What's Coming Next

Phase 6 (Future)

  • Parameter tuning (grid search for optimal ranking weights)
  • Production monitoring and alerts
  • A/B testing framework for parameter optimization
  • Performance optimization and caching strategies
  • Documentation and deployment guides

Potential Enhancements

  • Additional Domain Support: Healthcare, legal, customer service patterns
  • Multi-language Support: Extend patterns for non-English conversations
  • Parallel Retrieval: Run semantic + recency branches in parallel (50% latency reduction)
  • Memory Summarization: Compress old memories for token efficiency
  • Adaptive Thresholds: ML-based threshold tuning per user
  • Memory Graphs: Network relationships between memories
  • Explainability Interface: "Why was this memory retrieved?" debugging

Comprehensive Test

The test_all_phases.py runs a full system test:

python test_all_phases.py

It validates:

  • All 4 phases working together (120+ turns)
  • Automatic consolidation triggering (3 cycles at turns 50, 100, 150)
  • 60+ memory extractions from realistic dialogue
  • 5-signal ranking effectiveness
  • Memory decay, merging, and promotion to core memory
  • Multi-API key rotation (prevents rate limit issues)
  • Performance metrics across extended conversations

Customer Service Test

The test_customer_conversation.py runs a realistic scenario:

python test_customer_conversation.py

It validates:

  • 60-turn customer service conversation
  • Extraction of entities, facts, preferences, and constraints
  • Active memory tracking (which memories influenced each response)
  • Memory persistence and access count tracking
  • JSON output format for integration

Active Memory Demos

These demonstrate the active memory tracking feature:

# 10-turn comprehensive demo
python demo_active_memories.py

# 5-turn simple example
python example_active_memories.py

They show:

  • Which memories influenced each response
  • Memory metadata: origin_turn, last_used_turn, access_count, confidence
  • Memory evolution across conversation turns
  • JSON output format for debugging and auditing

Troubleshooting

Redis Connection Error

redis.exceptions.ConnectionError: Error connecting to Redis

Solution: Make sure Redis is running:

docker-compose up -d
docker-compose ps  # Should show redis as "Up"

Qdrant Connection Error (Phase 2)

Failed to connect to Qdrant

Solution: Make sure Qdrant is running:

docker-compose up -d
docker-compose ps  # Should show qdrant as "Up"

If Qdrant is not available, the system will automatically fall back to Phase 1 mode (non-semantic retrieval).

Slow First Query (Phase 2)

The first query after startup may take a few seconds as the embedding model (all-MiniLM-L6-v2) is loaded. Subsequent queries will be much faster.

No Memories Extracted

If the demo shows 0 memories extracted, check:

  1. Extraction thresholds in config.py
  2. Pattern matching in extractor.py
  3. Enable DEBUG logging to see filtering decisions

Memory Not Retrieved

If memories are stored but not retrieved:

  1. Check retrieval strategy in retriever.py
  2. Verify memory types match priority types
  3. Check token budget limits
  4. (Phase 2) Check MIN_SEMANTIC_SCORE threshold

Architecture Notes

This implementation follows the spec in LONG_FORM_MEMORY_SYSTEM_Version2.md:

  • Context ≠ Memory: Context is ephemeral (current window), Memory is persistent
  • Five-stage memory pipeline: Sensory → Encoding → Storage → Retrieval → Forgetting
  • Human-like memory model: Inspired by cognitive science research
  • Forgetting is essential: Phase 4 will add decay and consolidation

Performance

1000-Turn Production Validation (February 2026)

Test Configuration:

  • 1000 conversation turns (payment reminder domain)
  • 4 Groq API keys with rotation
  • Full extraction + retrieval pipeline
  • Comprehensive latency measurement

Latency Results:

Metric Mean Median P95 P99 Min Max
Processing (Injection) 575ms 350ms 1333ms 2042ms 25ms 25764ms
Retrieval (Extraction) 294ms 296ms 379ms 428ms 63ms 575ms

Throughput:

  • Total time: 9.58 minutes (574.6 seconds)
  • Throughput: 1.74 turns/second
  • Memories stored: 40 (0.04 per turn)
  • API rotation: Smooth distribution across all 4 keys

Extraction Pipeline Efficiency:

  • Phase 1 (Sensory Filter): 73.3% pass rate
  • Phase 2 (Pattern Matching): 46.7% extraction rate
  • Phase 3 (LLM Fallback): 13.3% escalation rate ✅
  • Result: 87% of memories extracted without LLM (cost-efficient)

Long-Term Recall (Distance Sweep):

Distance Recall Status
10 turns 100%
50 turns 100%
100 turns 100%
500 turns 100%
1000 turns 100%

Context Quality:

  • Context Recall: 80.1%
  • Context Precision: High (filtered by 5-signal ranking)
  • Extraction F1: 89.5%

Component Performance (All 5 Phases)

Operation Target Actual (Measured)
Retrieval <50ms 24-52ms (cold), 294ms (1000-turn mean) ✅
Storage <10ms 130-233ms (with vector indexing)
LLM Extraction (Groq) 50-200ms 1.1-3.2s (Groq API latency)
Embedding Model Load - ~16s (one-time cold start)
Semantic Search - 24-35ms ✅
Hybrid Retrieval - 294ms mean (includes semantic + recency branches) ✅

Phase-Specific Performance

Phase 1 (Basic):

  • Extraction: ~1-2ms per turn (heuristic + pattern matching)
  • Storage: <10ms per operation

Phase 2 (Semantic Search):

  • First query: ~16-26s (embedding model cold start)
  • Subsequent queries: 24-52ms
  • Vector indexing adds ~20-50ms to storage

Phase 3 (LLM Extraction):

  • Groq API latency: 1.1-3.2s per extraction
  • Includes retry logic for JSON parsing errors
  • Semantic deduplication: <50ms

License

This is a reference implementation based on the memory system specification.

Evaluation (Phase 5)

Running Evaluation

# Install evaluation dependencies
pip install -r requirements_evaluation.txt

# Run full evaluation suite (generates 200 test conversations)
python run_evaluation.py

Evaluation Metrics

Extraction Metrics:

  • Precision, Recall, F1 Score
  • Tests accuracy of memory extraction (Phases 1-3)

Retrieval Metrics (RAGAS-style):

  • Context Precision: Relevance of retrieved memories
  • Context Recall: Coverage of ground truth memories
  • MRR (Mean Reciprocal Rank): Ranking quality
  • Top-K accuracy: Performance at different K values

Distance Sweep:

  • Recall at 10, 50, 100, 500, 1000 turns
  • Critical test for long-form memory capability

Consolidation Metrics:

  • Decay appropriateness
  • Merge quality (duplicate reduction)
  • Promotion success rate

Evaluation Results

Results are saved to evaluation/results/evaluation_results.json with:

  • Per-conversation detailed metrics
  • Aggregated statistics across 200 conversations
  • Performance benchmarks for each phase

Contributing

All 5 phases are now implemented and verified working. Contributions welcome for:

  • Bug fixes
  • Performance improvements
  • Documentation
  • Test cases
  • Phase 6 implementation (production monitoring, parameter tuning)

See "What's Coming Next" section for planned features.


Key Resources & Documentation

📚 Documentation Files

🧪 Test & Diagnostic Files

🎯 Quick Performance Reference

Metric Value Status
Long-term Recall (1000 turns) 100% ✅ Production-ready
Context Recall 80.1% ✅ High quality
Extraction F1 89.5% ✅ High precision/recall
Mean Processing Latency 575ms ✅ Sub-second
Mean Retrieval Latency 294ms ✅ Consistent
Throughput 1.74 turns/sec ✅ High-volume capable
LLM Call Reduction 87% ✅ Cost-efficient
API Scalability 4-key rotation ✅ 400k tokens/day

🚀 Recent Improvements Summary

  • Hybrid Retrieval: Dual-branch architecture (semantic + recency) → 100% long-term recall
  • 5-Signal Ranking: Rebalanced weights → 80.1% context recall (up from 68.3%)
  • Multi-Key Rotation: 4 API keys → 400k tokens/day capacity (4x scalability)
  • 1000-Turn Validation: Production testing → consistent sub-second latency at scale

📊 For detailed analysis, benchmarks, and before/after comparisons, see RESULTS_FEBRUARY_2026.md


The memory system is validated and ready for production deployment with:

  • ✅ 100% long-term recall (10-1000 turn validation)
  • ✅ Sub-second latency (575ms mean processing)
  • ✅ High throughput (1.74 turns/second sustained)
  • ✅ Cost efficiency (87% reduction in LLM calls)
  • ✅ Scalability (multi-key rotation, 400k tokens/day)
  • ✅ Production testing (1000-turn validation completed)
  • ✅ Comprehensive monitoring (latency, recall, extraction metrics)

Last Updated: February 13, 2026

About

A production-grade memory system for AI agents that enables accurate recall across 1,000+ conversation turns.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages