MemSLM is a local, stage-auditable long-conversation QA system for studying how structured memory processing improves answer quality beyond model-only and naive rag baselines under local 8B constraints.
Under local 8B constraints, the current mainline improves judged answer accuracy from 15% -> 45% on the LongMemEval Diagnostic Split and from 0% -> 30% on the LongMemEval Held-Out Matched Split.
The active mainline pipeline is:
mid retrieval -> evidence filter -> claims -> light graph -> toolkit -> final 8B answer
This repository is organized as a research-grade engineering codebase:
- one active runtime path
- explicit stage artifacts
- reproducible evaluation runners
- stage-wise audit and visualization support
- exploratory ideas isolated under
future_work/ - shared experiment launch helpers and stable CLI entrypoints
- Local 8B end-to-end QA with stage-level inspection
- Four-way evaluation protocol:
model-onlynaive ragmemslmfilter-only ablation
- Two thesis splits for current mainline reporting:
LongMemEval Diagnostic SplitLongMemEval Held-Out Matched Split
- Stage-wise answerability, latency, and noise-density analysis
- Combined light-graph visualization across all questions in a split
Design principles:
- retrieval is recall-oriented
- filtering is conservative
- claims preserve grounded support structure
- the light graph is an organizer, not an answer oracle
- toolkit reasoning only consumes graph output
- evaluation should expose where signal is lost, not just whether the final answer is wrong
More detail:
All memslm runs below use:
- answer model:
qwen3:8b - judge model:
deepseek-r1:8b
| Method | Accuracy | Avg Latency (s) | Answer Density | Noise Density |
|---|---|---|---|---|
model-only |
0.15 |
17.28 |
0.0795 |
0.9205 |
naive rag |
0.10 |
17.51 |
0.0795 |
0.9205 |
memslm |
0.45 |
31.29 |
0.0977 |
0.9023 |
filter-only ablation |
0.15 |
6.74 |
0.1011 |
0.8989 |
| Method | Accuracy | Avg Latency (s) | Answer Density | Noise Density |
|---|---|---|---|---|
model-only |
0.00 |
21.76 |
0.0256 |
0.9744 |
naive rag |
0.10 |
20.76 |
0.0404 |
0.9596 |
memslm |
0.30 |
35.52 |
0.0593 |
0.9407 |
filter-only ablation |
0.15 |
6.90 |
0.0561 |
0.9439 |
Interpretation:
memslmis the strongest system on both splitsfilter-only ablationis the fastest mainline-compatible ablation- the held-out split is materially harder than the diagnostic split
Detailed reports:
- Diagnostic comparison report
- Held-out comparison report
- Extended results index, including per-type tables
These runs are not part of the core two-split, four-way comparison grid above. They are intended as external-validity checks for robustness under changed evaluation conditions.
Setup:
- answer model:
deepseek-r1:8b - judge model:
qwen3:8b - split:
LongMemEval Held-Out Matched Split
Result:
final_answer_acc = 0.30avg_latency_sec = 36.08
Artifacts:
Interpretation:
- the framework still runs coherently when the answer model and judge model are exchanged
- this check supports robustness of the evaluation setup, but does not outperform the main
qwen3:8b -> deepseek-r1:8bconfiguration
Setup:
- split:
LoCoMo Matched-Distribution 20-QA Subset - answer model:
qwen3:8b - judge model:
deepseek-r1:8b
Results:
| Method | Accuracy | Avg Latency (s) |
|---|---|---|
model-only |
0.05 |
21.34 |
memslm |
0.15 |
42.04 |
Artifacts:
Interpretation:
- the pipeline transfers across dataset format and domain without code-path replacement
memslmstill improves overmodel-onlyon the external dataset- performance is materially lower than on LongMemEval, so this run should be read as a stress-test for external generalization rather than a headline benchmark
These heatmaps give the quickest type-level comparison across the four evaluation settings:
memslmis strongest onknowledge-updateandsingle-session-userin the diagnostic split- the held-out split remains harder overall, but
memslmstill improves onmulti-session,single-session-preference, andsingle-session-user
These figures are the most useful stage-wise diagnostics in the current thesis workflow because they show both:
- where answer-bearing signal survives across stages
- where the runtime cost concentrates across question types
The light graph is strongest as:
- an intermediate structural representation
- a debugging surface
- a compact summary of support relations across questions
It should not be interpreted as a standalone answer engine.
llm_long_memory/
config/ Runtime and evaluation configuration
evaluation/ Eval loops, metrics, reporting, SQLite persistence
experiments/ Main experiment runners and exporters
future_work/ Isolated exploratory prototypes
llm/ Local LLM wrappers
memory/ Active mainline runtime path
scripts/ Audit and utility entrypoints
tests/ Unit and integration-style tests
utils/ Shared helpers
docs/
assets/ Stable figures referenced by repository documentation
Important boundary:
llm_long_memory/memory/is the active runtimellm_long_memory/experiments/is the active evaluation/reporting surfacellm_long_memory/future_work/is intentionally isolated from the mainline
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtMainline experiments assume local Ollama-compatible models such as:
qwen3:8bdeepseek-r1:8bnomic-embed-text
Runtime configuration:
For everyday repository maintenance:
make compile
make test
make eval-memslm SPLIT=longmemeval_diagnosticThe complete paper-facing reproduction protocol is documented in:
python3 -m llm_long_memory.experiments.run_thesis_eval \
--config llm_long_memory/config/config.yaml \
--split longmemeval_diagnostic \
--model qwen3:8b \
--judge \
--judge-model deepseek-r1:8bpython3 -m llm_long_memory.experiments.run_model_only_eval \
--config llm_long_memory/config/config.yaml \
--split longmemeval_diagnostic \
--model qwen3:8bpython3 -m llm_long_memory.experiments.run_naive_rag_eval \
--config llm_long_memory/config/config.yaml \
--split longmemeval_diagnostic \
--model qwen3:8bpython3 -m llm_long_memory.experiments.run_ablation_eval \
--config llm_long_memory/config/config.yaml \
--split longmemeval_diagnostic \
--model qwen3:8bpython3 -m llm_long_memory.experiments.run_thesis_compare \
--config llm_long_memory/config/config.yaml \
--split longmemeval_diagnostic \
--judge \
--judge-model deepseek-r1:8b \
--model-only-run-id <run_id> \
--naive-rag-run-id <run_id> \
--memslm-run-id <run_id> \
--ablation-run-id <run_id>PYTHONPATH=. python3 llm_long_memory/scripts/run_answer_source_audit.py \
--config llm_long_memory/config/config.yaml \
--dataset llm_long_memory/data/raw/LongMemEval/longmemeval_ragdebug10_rebuilt.json \
--output-dir llm_long_memory/data/processed/thesis_reports_debug_analysis \
--output-prefix answer_source_audit_longmemeval_diagnostic_memslm \
--enable-evidence-filter \
--enable-evidence-claims \
--enable-evidence-light-graph
python3 -m llm_long_memory.experiments.export_graph \
--audit-json <audit_json_path> \
--output-dir llm_long_memory/data/graphs_thesis_debug_analysis \
--artifact-prefix longmemeval_diagnostic_memslm_light_graphMore experiment entry points:
For paper-ready tables, figures, captions, and source paths:
Two additional experiments are intentionally kept outside the current two-split main comparison:
- swapping the answer model and judge model roles between
qwen3:8banddeepseek-r1:8b - evaluating the framework on
LoCoMo
These are intended as generalization checks rather than part of the main thesis comparison grid.
MemSLM should be read as:
- a local-memory research platform
- a stage-auditable retrieval-and-structure system
- a thesis-grade codebase focused on reproducibility and diagnosis
It should not be read as:
- a production assistant
- a fully self-improving learning system
- a proof that graph structure always dominates filtered retrieval
Implementation note: the current system is a research prototype. Some heuristic lexical resources are still maintained independently across modules to support rapid experiments and local tuning. A future engineering direction is to abstract a unified lexical-resource layer for stop words, intent labels, and the question-type schema, reducing maintenance cost.
Exploratory modules and negative-result directions are preserved under:
This keeps the mainline stable while preserving research continuity.
