Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .changeset/compaction-counter-calibration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
"agents": minor
---

feat(experimental/memory): one consistent token accounting for compaction.

The Session now derives a single authoritative context size in model tokens — from the `compactAfter()` counter when configured, otherwise from model-reported usage on assistant message metadata (`metadata.usage` / `metadata.totalUsage`, e.g. set via the AI SDK's `messageMetadata`) — and uses that one number for both the fire decision and `createCompactFunction`'s boundary walk (via the new `CompactContext.contextTokens`), where it calibrates the built-in heuristic to the model's token scale.

This fixes #1593: on tool-heavy histories the heuristic under-counts, so `tailTokenBudget` silently degraded to `minTailMessages`. It also makes compaction zero-config when assistant messages carry usage metadata.

The boundary walk no longer accepts a user counter at all: `CompactOptions.tokenCounter`, the `CompactTokenCounter` type, and `CompactContext.tokenCounter` are removed (the per-message counter was the #1593 footgun — a whole-prompt counter passed there silently degraded the budget to `minTailMessages`). Whole-prompt totals belong on `compactAfter()` or in usage metadata. New utils: `calculateContextTokens`, `getAssistantUsage`, `estimateContextTokensFromUsage`; `SessionMessage` gained `metadata`.
36 changes: 25 additions & 11 deletions docs/sessions.md
Original file line number Diff line number Diff line change
Expand Up @@ -373,8 +373,7 @@ const session = Session.create(this)
generateText({ model: myModel, prompt }).then((r) => r.text),
protectHead: 3, // Keep first 3 messages (default: 3)
tailTokenBudget: 20000, // Protect ~20K tokens at the tail (default: 20000)
minTailMessages: 2, // Always keep at least 2 tail messages (default: 2)
tokenCounter: async (messages) => estimateWithYourTokenizer({ messages })
minTailMessages: 2 // Always keep at least 2 tail messages (default: 2)
})
)
.compactAfter(100_000); // Auto-compact at 100K estimated tokens
Expand Down Expand Up @@ -408,18 +407,27 @@ When `.compactAfter(threshold)` is set, `appendMessage()` checks the estimated t

> Auto-compaction is checked **between turns** (on each `appendMessage()`), not within a turn. A single long, tool-heavy turn can grow past the model's context window mid-flight, before the next check. `@cloudflare/think` adds opt-in mid-turn recovery on top of this — see [Context-window overflow recovery](./think/index.md#context-window-overflow-recovery).

By default, the estimate includes stored message parts plus the Session-managed frozen system prompt. That means context blocks and cached prompts managed by `Session` contribute to the threshold. The estimate does not include framework-specific prompt additions or tool schema serialization that happen outside `Session`, such as Think's final capability prompt and tool catalog.
### Token counting

There are two token-counting decisions:
The Session resolves one authoritative context size, in priority order:

- `.compactAfter(threshold, { tokenCounter })` controls when automatic compaction is triggered after writes. It can include the frozen system prompt and context blocks.
- `createCompactFunction({ tokenCounter })` controls which tail messages are protected from summarization. Use this when tool-heavy histories are much larger than the Workers-safe heuristic can estimate.
1. **`.compactAfter(threshold, { tokenCounter })`** — a whole-prompt counter. It receives the messages, the frozen system prompt, and context blocks; returning model-reported usage that ignores the input (e.g. `() => lastUsage.inputTokens`) is fine.
2. **Usage metadata on assistant messages** — when a message carries model-reported usage in `metadata.usage` or `metadata.totalUsage` (e.g. attached via the AI SDK's `messageMetadata`), the Session reads it automatically. The estimate is the last reported usage plus a heuristic for any newer messages. No configuration needed:

You usually only need to configure one counter. The `.compactAfter()` counter now also flows into `createCompactFunction`'s boundary walk (via `CompactContext`) when no explicit `createCompactFunction({ tokenCounter })` is given, so a single counter drives both "should we compact?" and "what should we compact?". Without this, a fire-only counter plus the under-counting heuristic could make compaction trigger every turn but silently no-op on tool-heavy histories (fixed in [#1593](https://github.com/cloudflare/agents/issues/1593)).
```typescript
await session.appendMessage({
id: `assistant-${crypto.randomUUID()}`,
role: "assistant",
parts,
metadata: { usage: { totalTokens: 84_213 } } // model-reported
});
```

Caveat: the flowed counter is invoked **per message** during the boundary walk. A tokenizer-style counter budgets accurately; a usage-only counter that returns a fixed whole-prompt total (e.g. `usage.inputTokens` regardless of which messages are passed) degrades the tail budget to `minTailMessages` — compaction still runs and context stays bounded, but the byte budget is effectively ignored. Pass an explicit per-message `createCompactFunction({ tokenCounter })` for precise tail budgeting.
3. **The Workers-safe heuristic**, otherwise: stored message parts plus the Session-managed frozen system prompt. The heuristic does not include framework-specific prompt additions or tool schema serialization that happen outside `Session`, such as Think's final capability prompt and tool catalog.

Use a custom counter when you have model-reported usage or your own tokenizer:
That single number decides when compaction fires, and it is also passed to the compaction function as `CompactContext.contextTokens`, where `createCompactFunction` uses it to express `tailTokenBudget` at the model's token scale. Tool-heavy histories that the heuristic under-counts therefore still honor the tail budget instead of silently no-opping (fixed in [#1593](https://github.com/cloudflare/agents/issues/1593)). The boundary walk never invokes a user counter per message.

Configure a `.compactAfter()` counter when you have model-reported usage or your own tokenizer and your messages do not carry usage metadata:

```typescript
const session = Session.create(this)
Expand All @@ -438,7 +446,7 @@ const session = Session.create(this)
});
```

> **Note:** The default token estimation is heuristic (not tiktoken). It uses `max(chars/4, words*1.3)` with 4 tokens per-message overhead, and also applies the string heuristic to the Session-managed system prompt. This is intentional — tiktoken would add 80-120MB heap overhead, which exceeds Cloudflare Workers' 128MB limit.
> **Note:** The fallback token estimation is heuristic (not tiktoken). It uses `max(chars/4, words*1.3)` with 4 tokens per-message overhead, and also applies the string heuristic to the Session-managed system prompt. This is intentional — tiktoken would add 80-120MB heap overhead, which exceeds Cloudflare Workers' 128MB limit. Attach usage metadata to assistant messages to get exact, model-reported accounting instead.

> **Gotcha:** Compaction is iterative but single-overlay. Each new compaction extends from the earliest existing compaction's `fromMessageId` to the new end. So you always have at most one active compaction overlay per session, and it keeps growing. The previous compaction rows remain in the database but are superseded by the latest one (which covers a wider range). `getCompactions()` returns all of them, but `getHistory()` applies the latest one.

Expand Down Expand Up @@ -891,11 +899,17 @@ Exported from `agents/experimental/memory/utils`:
```typescript
import {
estimateStringTokens,
estimateMessageTokens
estimateMessageTokens,
calculateContextTokens,
getAssistantUsage,
estimateContextTokensFromUsage
} from "agents/experimental/memory/utils/tokens";

estimateStringTokens("Hello world"); // heuristic: max(chars/4, words*1.3)
estimateMessageTokens(messages); // sum with 4 tokens per-message overhead
calculateContextTokens(usage); // totalTokens, or the sum of its components
getAssistantUsage(message); // usage from assistant message metadata, if any
estimateContextTokensFromUsage(messages); // last usage + heuristic for newer messages
```

### Compaction Helpers
Expand Down
2 changes: 1 addition & 1 deletion docs/think/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -865,7 +865,7 @@ export class MyAgent extends Think<Env> {

Use either layer alone, or both together: the proactive guard avoids most overflows, and the reactive backstop catches any that still slip through (for example, a turn that starts already over budget, or a single tool result so large that compaction cannot help — in which case it terminalizes cleanly). Both apply to every turn entry path (WebSocket, sub-agent `chat()`, and programmatic `saveMessages()` / `submitMessages()`), and both emit a `chat:context:compacted` [observability event](../observability.md#chat-context-events).

> A no-op compaction cannot rescue an over-budget turn, so recovery is only as effective as your compaction configuration. For tool-heavy histories, configure a `tokenCounter` on `compactAfter()` (see [Sessions](../sessions.md#auto-compaction)).
> A no-op compaction cannot rescue an over-budget turn, so recovery is only as effective as your compaction configuration. For tool-heavy histories, attach model-reported usage to assistant message metadata or configure a `tokenCounter` on `compactAfter()` (see [Sessions](../sessions.md#token-counting)).

For a runnable demo against a real Workers AI model, see [`examples/context-overflow-recovery`](../../examples/context-overflow-recovery).

Expand Down
2 changes: 2 additions & 0 deletions experimental/session-memory/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,8 @@ Session.create(agent).onCompaction(compactFn).compactAfter(20000); // token thre

After every `appendMessage`, the session estimates tokens in the history. If it exceeds the threshold, `compact()` runs automatically. Failures are non-fatal — the message is always persisted.

Token accounting resolves in priority order: a `compactAfter(threshold, { tokenCounter })` counter (whole-prompt; returning model-reported usage is fine), then model-reported usage attached to assistant message metadata (`metadata.usage` / `metadata.totalUsage`, e.g. via the AI SDK's `messageMetadata`), then a Workers-safe heuristic. The same number flows to `createCompactFunction` so the tail budget is honored at the model's token scale even when the heuristic under-counts tool-heavy histories.

### Manual Compaction

```typescript
Expand Down
40 changes: 30 additions & 10 deletions packages/agents/src/experimental/memory/session/session.ts
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,11 @@ import {
import { AgentSessionProvider, type SqlProvider } from "./providers/agent";
import { AgentContextProvider } from "./providers/agent-context";
import type { CompactResult } from "../utils/compaction-helpers";
import { estimateMessageTokens, estimateStringTokens } from "../utils/tokens";
import {
estimateContextTokensFromUsage,
estimateMessageTokens,
estimateStringTokens
} from "../utils/tokens";
import { MessageType } from "../../../types";

export type SessionContextOptions = Omit<ContextConfig, "label">;
Expand Down Expand Up @@ -435,14 +439,23 @@ export class Session {
);
}

private async _estimateTokenCount(): Promise<number> {
/**
* Authoritative whole-prompt size in model tokens, when one is available:
* the configured `compactAfter()` counter, otherwise model-reported usage
* from assistant message metadata (which already covers the system prompt;
* only newer messages need the heuristic). Undefined when only the
* heuristic is left. This single number drives both the fire decision and
* the boundary calibration (`CompactContext.contextTokens`), so the two
* decisions can never disagree.
*/
private async _knownContextTokens(): Promise<number | undefined> {
const messages = await this.getHistory();
const systemPrompt = await this.context.getSystemPromptForEstimate();

if (this._tokenCounter) {
if (!this.context.isLoaded()) {
await this.context.load();
}
const systemPrompt = await this.context.getSystemPromptForEstimate();
const contextBlocks = this.context.getBlocks();
const estimate = await this._tokenCounter({
messages,
Expand All @@ -452,6 +465,15 @@ export class Session {
return Number.isFinite(estimate) ? Math.max(0, Math.ceil(estimate)) : 0;
}

return estimateContextTokensFromUsage(messages)?.tokens;
}

private async _estimateTokenCount(): Promise<number> {
const known = await this._knownContextTokens();
if (known !== undefined) return known;

const messages = await this.getHistory();
const systemPrompt = await this.context.getSystemPromptForEstimate();
return estimateMessageTokens(messages) + estimateStringTokens(systemPrompt);
}

Expand Down Expand Up @@ -551,8 +573,8 @@ export class Session {
console.warn(
`[Session] Auto-compaction fired (~${tokenEstimate} tokens > ${this._tokenThreshold}) but the compaction function returned null, so history was not shortened. ` +
(this._tokenCounter
? `A tokenCounter is configured and now flows to the boundary logic, but it is invoked per-message there — a whole-prompt/usage counter (e.g. returning a fixed usage.inputTokens regardless of which messages are passed) degrades the tail budget to minTailMessages and can still no-op. Pass a per-message CompactOptions.tokenCounter for precise tail budgeting.`
: `If your history is tool-heavy, configure a tokenCounter on compactAfter() — it flows to createCompactFunction's boundary logic automatically.`)
? `The tokenCounter's total calibrates the boundary heuristic automatically, so a null result usually means the protected head/tail already cover the whole history — check protectHead, minTailMessages and tailTokenBudget against your history length, or that summarize() returned a non-empty string.`
: `If your history is tool-heavy, attach model-reported usage to assistant message metadata (metadata.usage / metadata.totalUsage, e.g. via the AI SDK's messageMetadata) or configure a tokenCounter on compactAfter() — both calibrate createCompactFunction's boundary logic automatically.`)
);
} else if (compacted) {
// Re-arm the one-time warning so a later regression is surfaced again.
Expand Down Expand Up @@ -629,12 +651,10 @@ export class Session {

let result: CompactResult | null;
try {
// Pass the Session's authoritative token counter so the compaction
// function's boundary logic can use the same accounting as the
// fire/no-fire decision (see CompactContext). The function still wins if
// it was given its own explicit counter.
// Hand the boundary logic the same authoritative total the fire
// decision used (see CompactContext), so the two can never disagree.
result = await this._compactionFn(await this.getHistory(), {
tokenCounter: this._tokenCounter
contextTokens: await this._knownContextTokens()
});
} catch (err) {
this._emitError(err instanceof Error ? err.message : String(err));
Expand Down
37 changes: 28 additions & 9 deletions packages/agents/src/experimental/memory/session/types.ts
Original file line number Diff line number Diff line change
Expand Up @@ -43,24 +43,35 @@ export interface CompactAfterOptions {
/**
* Override the token estimate used by auto-compaction and status broadcasts.
*
* The default is a Workers-safe heuristic over message parts plus the
* Session-managed frozen system prompt. Callers that have model-reported
* usage or a tokenizer can provide a more precise counter here.
* Usually unnecessary: when assistant messages carry model-reported usage
* in their metadata (`metadata.usage` / `metadata.totalUsage`), the Session
* uses it automatically — last reported usage plus a heuristic for newer
* messages. Without usage metadata the default is a Workers-safe heuristic
* over message parts plus the Session-managed frozen system prompt.
*
* The counter is whole-prompt scoped by signature; ignoring the input and
* returning a model-reported total (e.g. `() => lastUsage.inputTokens`) is
* legal — the boundary logic auto-calibrates around it.
*/
tokenCounter?: SessionTokenCounter;
}

/**
* Context the Session passes to the registered compaction function. Lets the
* same authoritative token accounting drive BOTH the "should we compact?"
* (`compactAfter`) and "what should we compact?" (boundary) decisions, so a
* consumer that wires a `tokenCounter` once doesn't hit the failure mode where
* compaction fires every turn but silently no-ops because the boundary logic
* used a different (under-counting) estimate.
* (`compactAfter`) and "what should we compact?" (boundary) decisions, so the
* two never disagree — compaction can't fire every turn yet silently no-op
* because the boundary logic used a different (under-counting) estimate.
*/
export interface CompactContext {
/** The Session's token counter (from `compactAfter`/options), if configured. */
tokenCounter?: SessionTokenCounter;
/**
* Best-known size of the current context in model tokens — from the
* `compactAfter()` counter if configured, otherwise from usage metadata on
* assistant messages (last reported usage plus the heuristic for newer
* messages). Undefined when only the heuristic is available. The boundary
* walk uses it to calibrate the built-in heuristic to the model's scale.
*/
contextTokens?: number;
}

export type CompactionErrorHandler = (error: unknown) => void | Promise<void>;
Expand All @@ -75,6 +86,14 @@ export interface SessionMessage {
role: string;
parts: SessionMessagePart[];
createdAt?: Date;
/**
* Arbitrary message metadata (AI SDK `UIMessage.metadata` is structurally
* compatible). When an assistant message carries model-reported usage here
* (`metadata.usage` or `metadata.totalUsage`, e.g. from the AI SDK's
* `messageMetadata` callback), the Session uses it for token accounting —
* no `tokenCounter` configuration needed.
*/
metadata?: unknown;
}

/**
Expand Down
Loading
Loading