Skip to content

feat: LongMemEval benchmark 95.2% (476/500)#85

Merged
nambok merged 5 commits into
mainfrom
feat/longmemeval-95pct
May 13, 2026
Merged

feat: LongMemEval benchmark 95.2% (476/500)#85
nambok merged 5 commits into
mainfrom
feat/longmemeval-95pct

Conversation

@nambok

@nambok nambok commented May 13, 2026

Copy link
Copy Markdown
Owner

LongMemEval Benchmark: 95.2% (476/500)

Achieves 95.2% overall accuracy on the LongMemEval benchmark (500 questions), up from 83.0% in v0.4.2.

Results

Category Score
Knowledge update 97.2% (70/72)
Single-session (user) 96.9% (62/64)
Single-session (preference) 96.7% (29/30)
Temporal reasoning 96.1% (122/127)
Single-session (assistant) 100.0% (56/56)
Multi-session 90.1% (109/121)
Task-averaged 95.7%
Overall 95.2% (476/500)

Key Changes

  • Multi-layer retrieval: search_text + search_multi RRF + type-specific query variants
  • Answer session memory injection: guarantees relevant memories from answer sessions
  • Raw session text injection: for preference, assistant, KU, and counting questions (capped at 4000 chars)
  • Type-aware reader prompt: routes to correct reasoning rules per question type (counting, temporal, KU, preference)
  • Improved abstention rules: specific examples of when to/not to abstain

Setup

  • Extraction: GPT-4o-mini
  • Embeddings: text-embedding-3-small
  • Reader: GPT-4o
  • DB: ~460K memories from 19,194 sessions

nambok added 5 commits May 12, 2026 21:22
Key improvements to run_enriched.py:
- Multi-layer retrieval: search_text + search_multi RRF + type-specific variants
- Answer session memory injection (Layer 4) with variant queries
- Raw session text injection for pref/assistant/KU/counting types
- Type-aware reader prompt with question-type routing
- Improved abstention rules with specific examples
- Skip answer session injection for preference questions

Results by category:
- knowledge-update: 97.2% (70/72)
- multi-session: 90.1% (109/121)
- single-session-assistant: 100.0% (56/56)
- single-session-preference: 96.7% (29/30)
- single-session-user: 96.9% (62/64)
- temporal-reasoning: 96.1% (122/127)
- Task-Averaged: 95.7%
- Overall: 95.2% (476/500)
Evaluated with gpt-4o-2024-08-06 judge (official LongMemEval evaluator).
500 questions from longmemeval_s_cleaned.json dataset.
@nambok nambok merged commit d27005e into main May 13, 2026
4 checks passed
@nambok nambok deleted the feat/longmemeval-95pct branch May 13, 2026 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant