Add context management and compression by bubbltaco · Pull Request #34 · p-e-w/waidrin

bubbltaco · 2025-08-10T04:06:26Z

Here's a first draft of my changes regarding context management and compression of context. These changes should mitigate issues with maxing out the context length of a model as the story progresses. To manage context we adopt a recursive summarization strategy where we first try replacing each individual events with a summary that strips out extraneous prose/dialogue. Then as the context fills up even more we replace entire scenes with a summary.

I've tested the summarization a bit and it generally seems to be working, but I've not extensively tested the context compression yet. I want to get some validation of my general approach before I spend time with more thorough testing.

Summary of changes:

Update NarrationEvent to store tokens, summary, summary tokens.
Update LocationChangeEvent event to store tokens, summary, summary tokens.
Backends now need to implement a getContextLength method that allows us to get the max context size we can use with that backend.
- For DefaultBackend we try to use OpenAI compatible /models endpoint and fall back to a default (64k for now) context length if not found. We cache the value as an instance variable in DefaultBackend class so it doesn’t need to fetch each time.
After every narration event, we summarize the event. After every location change, we summarize all events from last location change to current location change.
We incorporate a token budget into prompts that require event history context, and we use the context generation algorithm described below to generate/compress the context.

Context generation algorithm:

To make use of the summarizations, we have an algorithm that constructs the context given a token budget. There’s a few ways we could go about this. The simplest would be to progressively replace events in context with event summaries, and if we reach a point where we’ve replaced everything in context with an event summary, then start replacing the oldest scenes with their scene summaries. However, I think being able to keep the full text for newer events as much as possible will lead to higher quality. So I decided to come up with an algorithm that switches between using event summaries and scene summaries as we need more and more compression.

This is how it’s set up now:
At first it replaces the oldest events with their event summaries. Once 50% of events have been replaced by an event summary, then it goes back and starts replacing the oldest scenes with their scene summaries. Once 25% of oldest events have been replaced by their scene summaries, it switches back to event summaries again and starts replacing up to 80% of the oldest events with event summaries. Then switches to replacing with scene summaries and so on. If at any point the context fits in the token budget it stops and returns that context. All of these thresholds for switching between summary types can be easily modified.

Possible extensions to summarization:

Have separate model to handle summarization, maybe separate params to handle summarization.
For summarizing each scene (indicated by location change), if the scene was particularly massive it might have large context by itself. So we could look into compression of context when generating scene summary.

There might be some edge cases where summarization might breakdown. Ex: if the user stays in the same location for so long that just that scene by itself maxes out the context window.

p-e-w

Thanks for the PR! I'm a big fan of the overall code and comment quality 👍

Waidrin is a complex project and will only become more complex in the future; to keep things manageable, it's important to avoid unnecessary complexity wherever possible. For this reason, it's imperative that we build complexity bottom up, not top down. In other words, we should start with a simple solution, and add more complexity only when practical experience has shown that it is unavoidable for solving a specific problem.

Having multiple "compression strategies" is an interesting idea, but very hard to reason about. For now, please pare this PR down to the absolute minimum complexity required to make context management work at all. This is a major change, and even in its most basic form it will be difficult to verify its correctness.

bubbltaco · 2025-08-10T19:18:14Z

Thanks for the detailed review! I'm going to make the changes and also replace the compression algorithm with the simpler approach. Also had some questions, please look through those when you get the change and let me know your thoughts

bubbltaco · 2025-08-10T21:13:26Z

I've just replaced the context management algorithm with something much simpler. This one works in 4 steps:

Step 1: just try all events without any summarization

Step 2: replace oldest events with their summaries one by one until context fits under budget.

Step 3: replace oldest scenes with their summaries one by one until context fits under budget.

Step 4: If none of the above worked, then start removing oldest scenes. This is a just a last resort and if we do some back-of-the-napkin calculations most scene summaries will be up to 300-400 tokens, so for 50k context length the story would have to get to over a hundred scenes before Step 4 becomes relevant.

In the future we can replace step 4 with a less lossy alternative like summarizing multiple scenes or something.

p-e-w

I looked a bit deeper into the code today, and I must admit I fear the complexity of this system. This isn't necessarily a critique of your approach; it's the problem itself that is extremely complex. Add to that the fact that we're working with imperfect information (token counts) and that prompt building is tricky, and we're dealing with something that I foresee will be very difficult to test and reason about.

After thinking about this some more, I believe we should limit ourselves to compressing scenes (all events in one location) rather than scenes + narration events. The two-level approach is responsible for much of the complexity, even though event-level compression provides only a modest degree of compaction.

In practice, I expect that most players will be using a context length between 32k and 128k. If we assume that a typical scene is 5k-10k tokens, but compresses to 3-5 paragraphs (about 500 tokens), we can always keep at least 2-3 complete scenes in the context, while having space for dozens more in compressed form. That's equivalent to multiple novels worth of narrative before we ever have to remove scenes, and we can always bring them back contextually depending e.g. on the characters involved.

The algorithm could be similar to your current approach:

Put everything into the context. If it fits, that's the context used.
Replace scenes before the current one with their summaries, starting with the oldest one, until the result fits the context.
If that isn't enough, remove scene summaries, starting with the oldest one, until the result fits the context.
If that still isn't enough (i.e. the current scene doesn't fit the context even by itself), throw an error. We need to see this happen in practice to be able to decide what to do about it.

Another important benefit of this approach is that summarization only needs to be done on location change, when there's a bunch of other processing happening anyway. This is a much smoother experience for the player than having to wait for summarization to complete after each narration event, when they're itching to do something.

Please tell me what you think. This is indeed a very difficult problem to solve.

bubbltaco · 2025-08-11T14:50:22Z

I agree with your points, especially that summarizing each event adds to the wait time and it might not even be a significant difference. Since we want to build complexity bottom up, we can start with just scene summaries and see how things play out. In the future we'll probably be making significant changes to these systems as we see usage patterns and experiment, so it totally makes sense to not over complicate things right off the bat.

The algorithm you proposed will be very easy to implement from what we already have, so I can quickly make those changes.

I'll also make changes for all the comments you mentioned regarding DRY, error handling, etc.

There's just two unresolved threads from your comments, please let me know your thoughts there.

bubbltaco · 2025-08-17T20:36:37Z

I've just updated the summarization and made the other changes you mentioned regarding DRY and error handling. Please take a look.

p-e-w

I just can't wrap my head around what's going on here. What are the three types of "context unit" for? My understanding of the correct algorithm is this:

Break the entire context into scenes (everything that happens in one location). Each scene has a text (introduction + all of its narration), and a summary (which was obtained on location change).
For each scene, keep either the text, or the summary, or nothing.

So what are the different types doing? Isn't there only one type?

bubbltaco · 2025-08-19T05:05:33Z

Yeah I realize I can simplify this further. All the context unit stuff is leftover from before when we had two types of summaries and had to keep track of everything. I've just made some changes that should make everything clearer and closer to what you just described. Please let me know if this is better.

p-e-w · 2025-08-25T06:46:15Z

Ok, good. I think we now have something mergeable. I'm going to make a few architectural adjustments soon, but those are easier to just do than explain, so I'll just take care of them myself.

Please run npx @biomejs/biome check --write to fix the CI errors.

bubbltaco · 2025-08-26T13:40:13Z

Just fixed all biome errors

p-e-w · 2025-08-27T09:55:59Z

Thanks! I'm now doing intensive testing, and then this is going in.

p-e-w · 2025-08-28T02:08:30Z

The implementation of getContextLength fails with the llama.cpp server. It throws the error "Model not found". I don't know what the best solution is here, but llama.cpp is the core local engine I intend to support, so unfortunately, this cannot be merged without a fix.

p-e-w · 2025-08-28T02:13:47Z

The llama.cpp docs claim OpenAI compatibility, which unfortunately seems to include not providing the context length over the API, even though that information is easily available to the engine.

This is just an absolute shitshow. It's a terrible user experience to have to provide the context length manually, but there doesn't seem to be any choice.

Ref #34

p-e-w · 2025-08-28T04:58:46Z

I've looked into some more docs, and it appears that regrettably, automatic context length determination for an abstract "OpenAI compatible" backend just isn't going to happen. I desperately wanted Waidrin to not require the user to provide that parameter, which many users will have no understanding of, but that doesn't seem to be possible in the current ecosystem.

I therefore decided to bite the bullet and add a context length field to the configuration screen, which I just pushed. The state object now contains contextLength and inputLength fields. Only contextLength is set by the user; inputLength is calculated from it with a simple heuristic that takes GPT-5 constraints into account.

You can therefore remove getContextLength, as well as the tokenBudget parameters everywhere. state.inputLength is the token budget, and can be directly obtained from the global state object wherever needed, eliminating the requirement to pass it around.

Your implementation of getContextLength might find future use in custom backend plugins, such as a dedicated OpenRouter backend (see #28), where we know that the specific backend supports that interface.

bubbltaco · 2025-08-31T17:05:36Z

Given our constraints, that makes sense to me. I've just updated to get rid of getContextLength, as well passing around a tokenBudget.

p-e-w · 2025-09-06T11:26:32Z

Merged! I appreciate your patience in seeing this through.

Ref p-e-w#34

* Add content management and compression * changes based on feedback * replace context management algorithm * remove MAX_PAGES_TO_SEARCH from getContextLength * simplify summarization and fix DRY issues and error handling * rewrite context algorithm + small fixes * fix CI errors * remove getContextLength and tokenBudget

p-e-w requested changes Aug 10, 2025

View reviewed changes

p-e-w requested changes Aug 11, 2025

View reviewed changes

Comment thread lib/backend.ts Outdated

Comment thread lib/context.ts

Comment thread lib/prompts.ts

Comment thread lib/prompts.ts Outdated

Comment thread lib/prompts.ts Outdated

Comment thread lib/engine.ts Outdated

p-e-w reviewed Aug 19, 2025

View reviewed changes

Comment thread lib/backend.ts Outdated

Comment thread lib/context.ts Outdated

Comment thread lib/context.ts Outdated

Comment thread lib/prompts.ts Outdated

Comment thread lib/prompts.ts Outdated

p-e-w added a commit that referenced this pull request Aug 28, 2025

Make context length configurable

10dff44

Ref #34

bubbltaco added 7 commits August 31, 2025 11:52

Add content management and compression

5341139

changes based on feedback

1ecaff4

replace context management algorithm

97fac17

remove MAX_PAGES_TO_SEARCH from getContextLength

a725f02

simplify summarization and fix DRY issues and error handling

e4d511b

rewrite context algorithm + small fixes

7671e93

fix CI errors

7f8b22d

bubbltaco force-pushed the context-management-compression branch from 566f577 to 7f8b22d Compare August 31, 2025 16:52

remove getContextLength and tokenBudget

ec9e68b

p-e-w merged commit 58005b1 into p-e-w:master Sep 6, 2025
1 check passed

chengkaichee pushed a commit to chengkaichee/wRPG-Plugin that referenced this pull request Sep 15, 2025

Make context length configurable

9f55457

Ref p-e-w#34

Conversation

bubbltaco commented Aug 10, 2025

Summary of changes:

Context generation algorithm:

Possible extensions to summarization:

Uh oh!

p-e-w left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bubbltaco commented Aug 10, 2025

Uh oh!

bubbltaco commented Aug 10, 2025

Uh oh!

p-e-w left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bubbltaco commented Aug 11, 2025

Uh oh!

bubbltaco commented Aug 17, 2025

Uh oh!

p-e-w left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bubbltaco commented Aug 19, 2025

Uh oh!

p-e-w commented Aug 25, 2025

Uh oh!

bubbltaco commented Aug 26, 2025

Uh oh!

p-e-w commented Aug 27, 2025

Uh oh!

p-e-w commented Aug 28, 2025

Uh oh!

p-e-w commented Aug 28, 2025

Uh oh!

p-e-w commented Aug 28, 2025

Uh oh!

bubbltaco commented Aug 31, 2025

Uh oh!

Uh oh!

p-e-w commented Sep 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants