Memento

Paper (PDF) | OpenMementos Dataset

Memento extends the effective output length of large language models by splitting chain-of-thought reasoning into blocks and summaries (memento). After each reasoning block, the model generates a short summary, then the block content is evicted from the KV cache. The model continues from the summary with a shorter context, enabling more reasoning within a fixed context window.

Special tokens

Token	Purpose
`<think>` / `</think>`	Reasoning wrapper
`<\|block_start\|>` / `<\|block_end\|>`	Reasoning block boundaries
`<\|summary_start\|>` / `<\|summary_end\|>`	Summary (memento block) boundaries

Repository layout

Directory	Description
`data/`	Data pipeline — converts raw CoT traces into the Memento format (block boundaries + summaries) for SFT training
`vllm/`	vLLM overlay — adds KV cache block masking to stock vLLM for efficient Memento inference

Quick start

Data pipeline

Convert chain-of-thought traces into Memento training data:

pip install -r data/requirements.txt
export OPENAI_API_KEY=sk-...   # or any OpenAI-compatible provider

cd data/pipeline
python run_full_pipeline.py \
    --input ../examples/example_trace.jsonl \
    --output-dir output/ \
    --model gpt-4o \
    --limit 1

See data/README.md for full documentation.

vLLM inference with block masking

Step 1: Set Up the Environment Build a customized vllm with block masking support:
```
pip install vllm==0.13.0
cd vllm
bash install_overlay.sh
```

Step 2: Serve a Memento Model with KV Cache Compaction To expose the model through an API-compatible server, run:

python -m vllm.entrypoints.openai.api_server \
    --model /path/to/memento-checkpoint \
    --served-model-name memento \
    --port 8010 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --chat-template chat_templates/memento_nosys.jinja \
    --block-masking-config '{
        "enable": true,
        "keep_last_n_blocks": 0,
        "mask_delimiters": false,
        "compact_on_summary_end": true,
        "require_assistant_section": true,
        "debug": true
    }'

See vllm/README.md for full documentation, including API usage and alternative setup options.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
blogpost		blogpost
data		data
docs		docs
vllm		vllm
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
memento.pdf		memento.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Memento

Special tokens

Repository layout

Quick start

Data pipeline

vLLM inference with block masking

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Memento

Special tokens

Repository layout

Quick start

Data pipeline

vLLM inference with block masking

License

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages