Skip to content

Add context management and compression#34

Merged
p-e-w merged 8 commits into
p-e-w:masterfrom
bubbltaco:context-management-compression
Sep 6, 2025
Merged

Add context management and compression#34
p-e-w merged 8 commits into
p-e-w:masterfrom
bubbltaco:context-management-compression

Conversation

@bubbltaco

Copy link
Copy Markdown
Contributor

Here's a first draft of my changes regarding context management and compression of context. These changes should mitigate issues with maxing out the context length of a model as the story progresses. To manage context we adopt a recursive summarization strategy where we first try replacing each individual events with a summary that strips out extraneous prose/dialogue. Then as the context fills up even more we replace entire scenes with a summary.

I've tested the summarization a bit and it generally seems to be working, but I've not extensively tested the context compression yet. I want to get some validation of my general approach before I spend time with more thorough testing.

Summary of changes:

  • Update NarrationEvent to store tokens, summary, summary tokens.
  • Update LocationChangeEvent event to store tokens, summary, summary tokens.
  • Backends now need to implement a getContextLength method that allows us to get the max context size we can use with that backend.
    • For DefaultBackend we try to use OpenAI compatible /models endpoint and fall back to a default (64k for now) context length if not found. We cache the value as an instance variable in DefaultBackend class so it doesn’t need to fetch each time.
  • After every narration event, we summarize the event. After every location change, we summarize all events from last location change to current location change.
  • We incorporate a token budget into prompts that require event history context, and we use the context generation algorithm described below to generate/compress the context.

Context generation algorithm:

To make use of the summarizations, we have an algorithm that constructs the context given a token budget. There’s a few ways we could go about this. The simplest would be to progressively replace events in context with event summaries, and if we reach a point where we’ve replaced everything in context with an event summary, then start replacing the oldest scenes with their scene summaries. However, I think being able to keep the full text for newer events as much as possible will lead to higher quality. So I decided to come up with an algorithm that switches between using event summaries and scene summaries as we need more and more compression.

This is how it’s set up now:
At first it replaces the oldest events with their event summaries. Once 50% of events have been replaced by an event summary, then it goes back and starts replacing the oldest scenes with their scene summaries. Once 25% of oldest events have been replaced by their scene summaries, it switches back to event summaries again and starts replacing up to 80% of the oldest events with event summaries. Then switches to replacing with scene summaries and so on. If at any point the context fits in the token budget it stops and returns that context. All of these thresholds for switching between summary types can be easily modified.

Possible extensions to summarization:

  • Have separate model to handle summarization, maybe separate params to handle summarization.
  • For summarizing each scene (indicated by location change), if the scene was particularly massive it might have large context by itself. So we could look into compression of context when generating scene summary.

There might be some edge cases where summarization might breakdown. Ex: if the user stays in the same location for so long that just that scene by itself maxes out the context window.

@p-e-w p-e-w left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I'm a big fan of the overall code and comment quality 👍

Waidrin is a complex project and will only become more complex in the future; to keep things manageable, it's important to avoid unnecessary complexity wherever possible. For this reason, it's imperative that we build complexity bottom up, not top down. In other words, we should start with a simple solution, and add more complexity only when practical experience has shown that it is unavoidable for solving a specific problem.

Having multiple "compression strategies" is an interesting idea, but very hard to reason about. For now, please pare this PR down to the absolute minimum complexity required to make context management work at all. This is a major change, and even in its most basic form it will be difficult to verify its correctness.

Comment thread lib/backend.ts Outdated
Comment thread lib/backend.ts Outdated
Comment thread lib/backend.ts Outdated
Comment thread lib/context.ts Outdated
Comment thread lib/context.ts Outdated
Comment thread lib/context.ts Outdated
Comment thread lib/schemas.ts Outdated
Comment thread lib/prompts.ts Outdated
@bubbltaco

Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review! I'm going to make the changes and also replace the compression algorithm with the simpler approach. Also had some questions, please look through those when you get the change and let me know your thoughts

@bubbltaco

Copy link
Copy Markdown
Contributor Author

I've just replaced the context management algorithm with something much simpler. This one works in 4 steps:

Step 1: just try all events without any summarization

Step 2: replace oldest events with their summaries one by one until context fits under budget.

Step 3: replace oldest scenes with their summaries one by one until context fits under budget.

Step 4: If none of the above worked, then start removing oldest scenes. This is a just a last resort and if we do some back-of-the-napkin calculations most scene summaries will be up to 300-400 tokens, so for 50k context length the story would have to get to over a hundred scenes before Step 4 becomes relevant.

In the future we can replace step 4 with a less lossy alternative like summarizing multiple scenes or something.

@p-e-w p-e-w left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked a bit deeper into the code today, and I must admit I fear the complexity of this system. This isn't necessarily a critique of your approach; it's the problem itself that is extremely complex. Add to that the fact that we're working with imperfect information (token counts) and that prompt building is tricky, and we're dealing with something that I foresee will be very difficult to test and reason about.

After thinking about this some more, I believe we should limit ourselves to compressing scenes (all events in one location) rather than scenes + narration events. The two-level approach is responsible for much of the complexity, even though event-level compression provides only a modest degree of compaction.

In practice, I expect that most players will be using a context length between 32k and 128k. If we assume that a typical scene is 5k-10k tokens, but compresses to 3-5 paragraphs (about 500 tokens), we can always keep at least 2-3 complete scenes in the context, while having space for dozens more in compressed form. That's equivalent to multiple novels worth of narrative before we ever have to remove scenes, and we can always bring them back contextually depending e.g. on the characters involved.

The algorithm could be similar to your current approach:

  1. Put everything into the context. If it fits, that's the context used.
  2. Replace scenes before the current one with their summaries, starting with the oldest one, until the result fits the context.
  3. If that isn't enough, remove scene summaries, starting with the oldest one, until the result fits the context.
  4. If that still isn't enough (i.e. the current scene doesn't fit the context even by itself), throw an error. We need to see this happen in practice to be able to decide what to do about it.

Another important benefit of this approach is that summarization only needs to be done on location change, when there's a bunch of other processing happening anyway. This is a much smoother experience for the player than having to wait for summarization to complete after each narration event, when they're itching to do something.

Please tell me what you think. This is indeed a very difficult problem to solve.

Comment thread lib/backend.ts Outdated
Comment thread lib/context.ts
Comment thread lib/prompts.ts
Comment thread lib/prompts.ts Outdated
Comment thread lib/prompts.ts Outdated
Comment thread lib/engine.ts Outdated
@bubbltaco

Copy link
Copy Markdown
Contributor Author

I agree with your points, especially that summarizing each event adds to the wait time and it might not even be a significant difference. Since we want to build complexity bottom up, we can start with just scene summaries and see how things play out. In the future we'll probably be making significant changes to these systems as we see usage patterns and experiment, so it totally makes sense to not over complicate things right off the bat.

The algorithm you proposed will be very easy to implement from what we already have, so I can quickly make those changes.

I'll also make changes for all the comments you mentioned regarding DRY, error handling, etc.

There's just two unresolved threads from your comments, please let me know your thoughts there.

@bubbltaco

Copy link
Copy Markdown
Contributor Author

I've just updated the summarization and made the other changes you mentioned regarding DRY and error handling. Please take a look.

@p-e-w p-e-w left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just can't wrap my head around what's going on here. What are the three types of "context unit" for? My understanding of the correct algorithm is this:

  1. Break the entire context into scenes (everything that happens in one location). Each scene has a text (introduction + all of its narration), and a summary (which was obtained on location change).
  2. For each scene, keep either the text, or the summary, or nothing.

So what are the different types doing? Isn't there only one type?

Comment thread lib/backend.ts Outdated
Comment thread lib/context.ts Outdated
Comment thread lib/context.ts Outdated
Comment thread lib/prompts.ts Outdated
Comment thread lib/prompts.ts Outdated
@bubbltaco

Copy link
Copy Markdown
Contributor Author

Yeah I realize I can simplify this further. All the context unit stuff is leftover from before when we had two types of summaries and had to keep track of everything. I've just made some changes that should make everything clearer and closer to what you just described. Please let me know if this is better.

@p-e-w

p-e-w commented Aug 25, 2025

Copy link
Copy Markdown
Owner

Ok, good. I think we now have something mergeable. I'm going to make a few architectural adjustments soon, but those are easier to just do than explain, so I'll just take care of them myself.

Please run npx @biomejs/biome check --write to fix the CI errors.

@bubbltaco

Copy link
Copy Markdown
Contributor Author

Just fixed all biome errors

@p-e-w

p-e-w commented Aug 27, 2025

Copy link
Copy Markdown
Owner

Thanks! I'm now doing intensive testing, and then this is going in.

@p-e-w

p-e-w commented Aug 28, 2025

Copy link
Copy Markdown
Owner

The implementation of getContextLength fails with the llama.cpp server. It throws the error "Model not found". I don't know what the best solution is here, but llama.cpp is the core local engine I intend to support, so unfortunately, this cannot be merged without a fix.

@p-e-w

p-e-w commented Aug 28, 2025

Copy link
Copy Markdown
Owner

The llama.cpp docs claim OpenAI compatibility, which unfortunately seems to include not providing the context length over the API, even though that information is easily available to the engine.

This is just an absolute shitshow. It's a terrible user experience to have to provide the context length manually, but there doesn't seem to be any choice.

p-e-w added a commit that referenced this pull request Aug 28, 2025
@p-e-w

p-e-w commented Aug 28, 2025

Copy link
Copy Markdown
Owner

I've looked into some more docs, and it appears that regrettably, automatic context length determination for an abstract "OpenAI compatible" backend just isn't going to happen. I desperately wanted Waidrin to not require the user to provide that parameter, which many users will have no understanding of, but that doesn't seem to be possible in the current ecosystem.

I therefore decided to bite the bullet and add a context length field to the configuration screen, which I just pushed. The state object now contains contextLength and inputLength fields. Only contextLength is set by the user; inputLength is calculated from it with a simple heuristic that takes GPT-5 constraints into account.

You can therefore remove getContextLength, as well as the tokenBudget parameters everywhere. state.inputLength is the token budget, and can be directly obtained from the global state object wherever needed, eliminating the requirement to pass it around.

Your implementation of getContextLength might find future use in custom backend plugins, such as a dedicated OpenRouter backend (see #28), where we know that the specific backend supports that interface.

@bubbltaco bubbltaco force-pushed the context-management-compression branch from 566f577 to 7f8b22d Compare August 31, 2025 16:52
@bubbltaco

Copy link
Copy Markdown
Contributor Author

Given our constraints, that makes sense to me. I've just updated to get rid of getContextLength, as well passing around a tokenBudget.

@p-e-w p-e-w merged commit 58005b1 into p-e-w:master Sep 6, 2025
1 check passed
@p-e-w

p-e-w commented Sep 6, 2025

Copy link
Copy Markdown
Owner

Merged! I appreciate your patience in seeing this through.

chengkaichee pushed a commit to chengkaichee/wRPG-Plugin that referenced this pull request Sep 15, 2025
chengkaichee pushed a commit to chengkaichee/wRPG-Plugin that referenced this pull request Sep 15, 2025
* Add content management and compression

* changes based on feedback

* replace context management algorithm

* remove MAX_PAGES_TO_SEARCH from getContextLength

* simplify summarization and fix DRY issues and error handling

* rewrite context algorithm + small fixes

* fix CI errors

* remove getContextLength and tokenBudget
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants