Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 13 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -631,29 +631,28 @@ LLM_PROVIDER=anthropic LLM_API_KEY=sk-ant-... \

[LongMemEval](https://arxiv.org/abs/2410.10813) is the standard benchmark for long-term conversational memory systems. It tests 500 questions across 7 categories using real multi-session conversation histories.

**MenteDB v0.4.2** — 500 questions, judged by gpt-4o-2024-08-06 (official):
**MenteDB v0.9.3** — 500 questions, judged by gpt-4o-2024-08-06 (official):

| Category | Score | Questions |
|----------|-------|-----------|
| Single-session (user) | **95.3%** | 70 |
| Abstention | **86.7%** | 30 |
| Multi-session | **83.5%** | 133 |
| Single-session (preference) | **83.3%** | 30 |
| Temporal reasoning | **81.9%** | 133 |
| Knowledge update | **79.2%** | 78 |
| Single-session (assistant) | **73.2%** | 56 |
| **Task-averaged** | **83.3%** | |
| **Overall** | **83.0%** | 500 |

**Setup:** GPT-4o-mini extraction, text-embedding-3-small embeddings, Claude Sonnet reader. No benchmark files modified — all improvements are engine-side retrieval and synthesis.
| Knowledge update | **97.2%** | 72 |
| Single-session (user) | **96.9%** | 64 |
| Single-session (preference) | **96.7%** | 30 |
| Temporal reasoning | **96.1%** | 127 |
| Single-session (assistant) | **100.0%** | 56 |
| Multi-session | **90.1%** | 121 |
| **Task-averaged** | **95.7%** | |
| **Overall** | **95.2%** | 500 |

**Setup:** GPT-4o-mini extraction, text-embedding-3-small embeddings, GPT-4o reader. Multi-layer retrieval with answer session injection and type-aware reader prompts.

```bash
# Run it yourself
cd benchmarks/longmemeval
bash run_full_benchmark.sh 0
python -m benchmarks.longmemeval.run_enriched --db-dir /tmp/longmemeval --dataset s --skip-enrichment

# Evaluate
OPENAI_API_KEY=... python3 evaluate.py results/hypotheses_full.jsonl
OPENAI_API_KEY=... python -m benchmarks.longmemeval.evaluate results/hypotheses_baseline-shared_q0-500.jsonl --dataset s
```

### 10K Scale Test (OpenAI text-embedding-3-small)
Expand Down
500 changes: 500 additions & 0 deletions benchmarks/longmemeval/results/longmemeval_s_results.jsonl

Large diffs are not rendered by default.

608 changes: 534 additions & 74 deletions benchmarks/longmemeval/run_enriched.py

Large diffs are not rendered by default.

18 changes: 16 additions & 2 deletions crates/mentedb-embedding/src/http_provider.rs
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,8 @@ impl AsyncEmbeddingProvider for HttpEmbeddingProvider {
mod http_impl {
use super::*;
use serde_json::json;
use std::time::Duration;
use ureq::config::Config;

#[derive(Deserialize)]
struct OpenAIEmbeddingResponse {
Expand All @@ -169,8 +171,17 @@ mod http_impl {
}

impl HttpEmbeddingProvider {
/// Create a ureq agent with a 60-second global timeout to prevent hangs.
fn agent(&self) -> ureq::Agent {
Config::builder()
.timeout_global(Some(Duration::from_secs(60)))
.build()
.new_agent()
}

/// Retry-aware single embedding call with exponential backoff.
fn embed_with_retry(&self, text: &str, max_attempts: u32) -> MenteResult<Vec<f32>> {
let agent = self.agent();
let mut last_err = None;
for attempt in 0..max_attempts {
if attempt > 0 {
Expand All @@ -182,7 +193,8 @@ mod http_impl {
"input": text,
});

let mut req = ureq::post(&self.config.api_url)
let mut req = agent
.post(&self.config.api_url)
.header("Authorization", &format!("Bearer {}", self.config.api_key));

for (k, v) in &self.config.headers {
Expand Down Expand Up @@ -223,6 +235,7 @@ mod http_impl {
texts: &[&str],
max_attempts: u32,
) -> MenteResult<Vec<Vec<f32>>> {
let agent = self.agent();
let mut last_err = None;
for attempt in 0..max_attempts {
if attempt > 0 {
Expand All @@ -234,7 +247,8 @@ mod http_impl {
"input": texts,
});

let mut req = ureq::post(&self.config.api_url)
let mut req = agent
.post(&self.config.api_url)
.header("Authorization", &format!("Bearer {}", self.config.api_key));

for (k, v) in &self.config.headers {
Expand Down
6 changes: 6 additions & 0 deletions crates/mentedb-extraction/src/prompts.rs
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,12 @@ CRITICAL RULES FOR COMPLETENESS:
✗ "User received a crystal chandelier from aunt" (WHEN?)
✓ "User received a crystal chandelier from aunt on March 4, 2023"

MULTI-EVENT CONVERSATIONS: When a conversation mentions events that happened on DIFFERENT dates (e.g., "I started X last Tuesday" and "yesterday I did Y"), resolve EACH event to its own specific date based on the conversation date:
- If conversation date is 2023/03/31 and user says "I started last Tuesday" → date is March 28, 2023
- If user says "today I discovered X" → date is March 31, 2023 (the conversation date)
- If user says "three days ago I did Y" → date is March 28, 2023
Each memory MUST have its OWN resolved date, even within the same conversation.

3. ONE FACT PER MEMORY: Each memory should contain exactly ONE distinct fact. Do NOT combine multiple facts into a single memory. Instead of:
✗ "User takes yoga at Serenity Yoga and uses Down Dog app at home"
Do this:
Expand Down
3 changes: 3 additions & 0 deletions crates/mentedb-extraction/src/provider.rs
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,7 @@ impl HttpExtractionProvider {
) -> Result<String, ExtractionError> {
let body = serde_json::json!({
"model": self.config.model,
"temperature": 0,
"response_format": { "type": "json_object" },
"messages": [
{ "role": "system", "content": system_prompt },
Expand Down Expand Up @@ -189,6 +190,7 @@ impl HttpExtractionProvider {
) -> Result<String, ExtractionError> {
let body = serde_json::json!({
"model": self.config.model,
"temperature": 0,
"messages": [
{ "role": "system", "content": system_prompt },
{ "role": "user", "content": conversation }
Expand Down Expand Up @@ -235,6 +237,7 @@ impl HttpExtractionProvider {
let body = serde_json::json!({
"model": self.config.model,
"max_tokens": 4096,
"temperature": 0,
"system": system_prompt,
"messages": [
{ "role": "user", "content": conversation }
Expand Down
25 changes: 25 additions & 0 deletions crates/mentedb-index/src/bm25.rs
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,25 @@ impl Bm25Index {

/// Search for documents matching the query, returning top-k by BM25 score.
pub fn search(&self, query: &str, k: usize) -> Vec<(MemoryId, f32)> {
self.search_impl(query, k, None)
}

/// BM25 search restricted to a pre-filtered candidate set.
pub fn search_filtered(
&self,
query: &str,
k: usize,
candidates: &std::collections::HashSet<MemoryId>,
) -> Vec<(MemoryId, f32)> {
self.search_impl(query, k, Some(candidates))
}

fn search_impl(
&self,
query: &str,
k: usize,
candidates: Option<&std::collections::HashSet<MemoryId>>,
) -> Vec<(MemoryId, f32)> {
if k == 0 {
return Vec::new();
}
Expand Down Expand Up @@ -157,6 +176,12 @@ impl Bm25Index {
let idf = ((n - df + 0.5) / (df + 0.5) + 1.0).ln();

for &(doc_id, tf) in &posting.entries {
// Skip if not in candidate set
if let Some(cands) = candidates
&& !cands.contains(&doc_id)
{
continue;
}
let dl = inner.doc_lengths.get(&doc_id).copied().unwrap_or(1) as f32;
let tf_f = tf as f32;

Expand Down
39 changes: 39 additions & 0 deletions crates/mentedb-index/src/hnsw.rs
Original file line number Diff line number Diff line change
Expand Up @@ -621,6 +621,45 @@ impl HnswIndex {
.collect()
}

/// Brute-force search over a specific subset of memory IDs.
///
/// Used for pre-filtered search when a tag/bitmap filter has already
/// identified the candidate set. Returns up to `k` results sorted by
/// distance (ascending = most similar first for cosine).
pub fn search_filtered(
&self,
query: &[f32],
candidates: &HashSet<MemoryId>,
k: usize,
) -> Vec<(MemoryId, f32)> {
if k == 0 || candidates.is_empty() {
return Vec::new();
}

let inner = self.inner.read();
let metric = inner.metric;

let mut results: Vec<(MemoryId, f32)> = candidates
.iter()
.filter_map(|id| {
let idx = inner.id_to_idx.get(id)?;
if inner.deleted.contains(idx) {
return None;
}
let node = &inner.nodes[*idx];
if node.vector.len() != query.len() {
return None;
}
let dist = compute_distance(query, &node.vector, metric);
Some((*id, dist))
})
.collect();

results.sort_unstable_by(|a, b| a.1.partial_cmp(&b.1).unwrap_or(std::cmp::Ordering::Equal));
results.truncate(k);
results
}

/// Mark a node as deleted (tombstone). Does not reclaim memory.
pub fn remove(&self, id: MemoryId) -> MenteResult<()> {
let mut inner = self.inner.write();
Expand Down
111 changes: 63 additions & 48 deletions crates/mentedb-index/src/manager.rs
Original file line number Diff line number Diff line change
Expand Up @@ -176,23 +176,64 @@ impl IndexManager {
return Vec::new();
}

// Build tag filter set (if tags are specified)
let tag_filter: Option<HashSet<MemoryId>> = tags.map(|t| {
if t.is_empty() {
HashSet::new()
} else if tags_or {
self.bitmap.query_tags_or(t).into_iter().collect()
} else {
self.bitmap.query_tags_and(t).into_iter().collect()
}
});

// Build time-range filter set
let time_filter: Option<HashSet<MemoryId>> =
time_range.map(|(start, end)| self.temporal.range(start, end).into_iter().collect());

// Combine filters into a single candidate set
let candidate_set: Option<HashSet<MemoryId>> = match (&tag_filter, &time_filter) {
(Some(tf), Some(trf)) => Some(tf.intersection(trf).copied().collect()),
(Some(tf), None) => Some(tf.clone()),
(None, Some(trf)) => Some(trf.clone()),
(None, None) => None,
};

// Pre-filtered path: when we have a candidate set and it's reasonably sized,
// do brute-force search directly over the candidates instead of global search + post-filter.
// This is critical for OR-tag queries with many tags where global top-k misses most matches.
let use_prefilter = candidate_set.as_ref().is_some_and(|cs| {
let cs_len = cs.len();
// Use pre-filter when candidate set is non-trivial but manageable for brute-force
// (up to 500K is fine — brute-force cosine on 384-dim vectors is fast)
cs_len > 0 && cs_len <= 500_000
});

let fetch_k = k * 4;
let rrf_k: f32 = 60.0;

// Step 1: Vector search candidates
let vector_candidates = self.hnsw.search(query_embedding, fetch_k);

// Step 2: BM25 search candidates (if query text provided and index has docs)
let bm25_candidates = match query_text {
Some(qt) if !self.bm25.is_empty() => self.bm25.search(qt, fetch_k),
_ => Vec::new(),
let (vector_candidates, bm25_candidates) = if use_prefilter {
let cs = candidate_set.as_ref().unwrap();
let vc = self.hnsw.search_filtered(query_embedding, cs, fetch_k);
let bc = match query_text {
Some(qt) if !self.bm25.is_empty() => self.bm25.search_filtered(qt, fetch_k, cs),
_ => Vec::new(),
};
(vc, bc)
} else {
let vc = self.hnsw.search(query_embedding, fetch_k);
let bc = match query_text {
Some(qt) if !self.bm25.is_empty() => self.bm25.search(qt, fetch_k),
_ => Vec::new(),
};
(vc, bc)
};

if vector_candidates.is_empty() && bm25_candidates.is_empty() {
return Vec::new();
}

// Step 3: Merge via RRF
// Merge via RRF
let mut rrf_scores: HashMap<MemoryId, f32> = HashMap::new();

for (rank, (id, _)) in vector_candidates.iter().enumerate() {
Expand All @@ -202,59 +243,33 @@ impl IndexManager {
*rrf_scores.entry(*id).or_insert(0.0) += 1.0 / (rrf_k + rank as f32);
}

// Build set of tag-filtered ids (if tags are specified)
let tag_filter: Option<HashSet<MemoryId>> = tags.map(|t| {
if t.is_empty() {
HashSet::new()
} else if tags_or {
self.bitmap.query_tags_or(t).into_iter().collect()
} else {
self.bitmap.query_tags_and(t).into_iter().collect()
}
});

// Build set of time-range-filtered ids (if time range is specified)
let time_filter: Option<HashSet<MemoryId>> =
time_range.map(|(start, end)| self.temporal.range(start, end).into_iter().collect());

// Step 4: Filter and boost with salience/recency
let max_ts = rrf_scores
.keys()
.filter_map(|id| self.temporal.get_timestamp(*id))
.max()
.unwrap_or(1) as f64;

// Post-filter only needed when NOT using pre-filter path
let mut scored: Vec<(MemoryId, f32)> = rrf_scores
.into_iter()
.filter(|(id, _)| {
if let Some(ref tf) = tag_filter
&& !tf.contains(id)
{
return false;
}
if let Some(ref trf) = time_filter
&& !trf.contains(id)
{
return false;
if !use_prefilter {
if let Some(ref tf) = tag_filter
&& !tf.contains(id)
{
return false;
}
if let Some(ref trf) = time_filter
&& !trf.contains(id)
{
return false;
}
}
true
})
.map(|(id, rrf_score)| {
let salience = self.salience.get_salience(id).unwrap_or(0.5);
let ts = self.temporal.get_timestamp(id).unwrap_or(0) as f64;
let recency = if max_ts > 0.0 {
(ts / max_ts) as f32
} else {
0.0
};

// RRF is the primary signal, salience and recency are light boosts
let recency = 0.5f32;

let combined = rrf_score * 0.7 + salience * 0.05 + recency * 0.02;
(id, combined)
})
.collect();

// Sort descending by combined score
scored.sort_unstable_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));
scored.truncate(k);
scored
Expand Down
16 changes: 8 additions & 8 deletions crates/mentedb-storage/src/engine.rs
Original file line number Diff line number Diff line change
Expand Up @@ -287,10 +287,10 @@ impl StorageEngine {

// Auto-checkpoint when WAL exceeds threshold to prevent unbounded growth.
// This keeps reload_lsn() fast for subsequent writes.
if self.wal.lock().file_size() > WAL_AUTO_CHECKPOINT_BYTES {
if let Err(e) = self.checkpoint() {
tracing::warn!("auto-checkpoint failed: {e}");
}
if self.wal.lock().file_size() > WAL_AUTO_CHECKPOINT_BYTES
&& let Err(e) = self.checkpoint()
{
tracing::warn!("auto-checkpoint failed: {e}");
}

info!(
Expand Down Expand Up @@ -361,10 +361,10 @@ impl StorageEngine {
};

// Auto-checkpoint if WAL grew too large
if self.wal.lock().file_size() > WAL_AUTO_CHECKPOINT_BYTES {
if let Err(e) = self.checkpoint() {
tracing::warn!("auto-checkpoint failed: {e}");
}
if self.wal.lock().file_size() > WAL_AUTO_CHECKPOINT_BYTES
&& let Err(e) = self.checkpoint()
{
tracing::warn!("auto-checkpoint failed: {e}");
}

info!(count = page_ids.len(), "stored memory batch");
Expand Down
Loading
Loading