Add context management and compression#34
Conversation
p-e-w
left a comment
There was a problem hiding this comment.
Thanks for the PR! I'm a big fan of the overall code and comment quality 👍
Waidrin is a complex project and will only become more complex in the future; to keep things manageable, it's important to avoid unnecessary complexity wherever possible. For this reason, it's imperative that we build complexity bottom up, not top down. In other words, we should start with a simple solution, and add more complexity only when practical experience has shown that it is unavoidable for solving a specific problem.
Having multiple "compression strategies" is an interesting idea, but very hard to reason about. For now, please pare this PR down to the absolute minimum complexity required to make context management work at all. This is a major change, and even in its most basic form it will be difficult to verify its correctness.
|
Thanks for the detailed review! I'm going to make the changes and also replace the compression algorithm with the simpler approach. Also had some questions, please look through those when you get the change and let me know your thoughts |
|
I've just replaced the context management algorithm with something much simpler. This one works in 4 steps: Step 1: just try all events without any summarization Step 2: replace oldest events with their summaries one by one until context fits under budget. Step 3: replace oldest scenes with their summaries one by one until context fits under budget. Step 4: If none of the above worked, then start removing oldest scenes. This is a just a last resort and if we do some back-of-the-napkin calculations most scene summaries will be up to 300-400 tokens, so for 50k context length the story would have to get to over a hundred scenes before Step 4 becomes relevant. In the future we can replace step 4 with a less lossy alternative like summarizing multiple scenes or something. |
p-e-w
left a comment
There was a problem hiding this comment.
I looked a bit deeper into the code today, and I must admit I fear the complexity of this system. This isn't necessarily a critique of your approach; it's the problem itself that is extremely complex. Add to that the fact that we're working with imperfect information (token counts) and that prompt building is tricky, and we're dealing with something that I foresee will be very difficult to test and reason about.
After thinking about this some more, I believe we should limit ourselves to compressing scenes (all events in one location) rather than scenes + narration events. The two-level approach is responsible for much of the complexity, even though event-level compression provides only a modest degree of compaction.
In practice, I expect that most players will be using a context length between 32k and 128k. If we assume that a typical scene is 5k-10k tokens, but compresses to 3-5 paragraphs (about 500 tokens), we can always keep at least 2-3 complete scenes in the context, while having space for dozens more in compressed form. That's equivalent to multiple novels worth of narrative before we ever have to remove scenes, and we can always bring them back contextually depending e.g. on the characters involved.
The algorithm could be similar to your current approach:
- Put everything into the context. If it fits, that's the context used.
- Replace scenes before the current one with their summaries, starting with the oldest one, until the result fits the context.
- If that isn't enough, remove scene summaries, starting with the oldest one, until the result fits the context.
- If that still isn't enough (i.e. the current scene doesn't fit the context even by itself), throw an error. We need to see this happen in practice to be able to decide what to do about it.
Another important benefit of this approach is that summarization only needs to be done on location change, when there's a bunch of other processing happening anyway. This is a much smoother experience for the player than having to wait for summarization to complete after each narration event, when they're itching to do something.
Please tell me what you think. This is indeed a very difficult problem to solve.
|
I agree with your points, especially that summarizing each event adds to the wait time and it might not even be a significant difference. Since we want to build complexity bottom up, we can start with just scene summaries and see how things play out. In the future we'll probably be making significant changes to these systems as we see usage patterns and experiment, so it totally makes sense to not over complicate things right off the bat. The algorithm you proposed will be very easy to implement from what we already have, so I can quickly make those changes. I'll also make changes for all the comments you mentioned regarding DRY, error handling, etc. There's just two unresolved threads from your comments, please let me know your thoughts there. |
|
I've just updated the summarization and made the other changes you mentioned regarding DRY and error handling. Please take a look. |
p-e-w
left a comment
There was a problem hiding this comment.
I just can't wrap my head around what's going on here. What are the three types of "context unit" for? My understanding of the correct algorithm is this:
- Break the entire context into scenes (everything that happens in one location). Each scene has a text (introduction + all of its narration), and a summary (which was obtained on location change).
- For each scene, keep either the text, or the summary, or nothing.
So what are the different types doing? Isn't there only one type?
|
Yeah I realize I can simplify this further. All the context unit stuff is leftover from before when we had two types of summaries and had to keep track of everything. I've just made some changes that should make everything clearer and closer to what you just described. Please let me know if this is better. |
|
Ok, good. I think we now have something mergeable. I'm going to make a few architectural adjustments soon, but those are easier to just do than explain, so I'll just take care of them myself. Please run |
|
Just fixed all biome errors |
|
Thanks! I'm now doing intensive testing, and then this is going in. |
|
The implementation of |
|
The llama.cpp docs claim OpenAI compatibility, which unfortunately seems to include not providing the context length over the API, even though that information is easily available to the engine. This is just an absolute shitshow. It's a terrible user experience to have to provide the context length manually, but there doesn't seem to be any choice. |
|
I've looked into some more docs, and it appears that regrettably, automatic context length determination for an abstract "OpenAI compatible" backend just isn't going to happen. I desperately wanted Waidrin to not require the user to provide that parameter, which many users will have no understanding of, but that doesn't seem to be possible in the current ecosystem. I therefore decided to bite the bullet and add a context length field to the configuration screen, which I just pushed. The state object now contains You can therefore remove Your implementation of |
566f577 to
7f8b22d
Compare
|
Given our constraints, that makes sense to me. I've just updated to get rid of |
|
Merged! I appreciate your patience in seeing this through. |
* Add content management and compression * changes based on feedback * replace context management algorithm * remove MAX_PAGES_TO_SEARCH from getContextLength * simplify summarization and fix DRY issues and error handling * rewrite context algorithm + small fixes * fix CI errors * remove getContextLength and tokenBudget
Here's a first draft of my changes regarding context management and compression of context. These changes should mitigate issues with maxing out the context length of a model as the story progresses. To manage context we adopt a recursive summarization strategy where we first try replacing each individual events with a summary that strips out extraneous prose/dialogue. Then as the context fills up even more we replace entire scenes with a summary.
I've tested the summarization a bit and it generally seems to be working, but I've not extensively tested the context compression yet. I want to get some validation of my general approach before I spend time with more thorough testing.
Summary of changes:
getContextLengthmethod that allows us to get the max context size we can use with that backend./modelsendpoint and fall back to a default (64k for now) context length if not found. We cache the value as an instance variable in DefaultBackend class so it doesn’t need to fetch each time.Context generation algorithm:
To make use of the summarizations, we have an algorithm that constructs the context given a token budget. There’s a few ways we could go about this. The simplest would be to progressively replace events in context with event summaries, and if we reach a point where we’ve replaced everything in context with an event summary, then start replacing the oldest scenes with their scene summaries. However, I think being able to keep the full text for newer events as much as possible will lead to higher quality. So I decided to come up with an algorithm that switches between using event summaries and scene summaries as we need more and more compression.
This is how it’s set up now:
At first it replaces the oldest events with their event summaries. Once 50% of events have been replaced by an event summary, then it goes back and starts replacing the oldest scenes with their scene summaries. Once 25% of oldest events have been replaced by their scene summaries, it switches back to event summaries again and starts replacing up to 80% of the oldest events with event summaries. Then switches to replacing with scene summaries and so on. If at any point the context fits in the token budget it stops and returns that context. All of these thresholds for switching between summary types can be easily modified.
Possible extensions to summarization:
There might be some edge cases where summarization might breakdown. Ex: if the user stays in the same location for so long that just that scene by itself maxes out the context window.