Skip to content

Add context guardrails for LLM conversation management#902

Open
JSv4 wants to merge 5 commits intomainfrom
claude/context-guardrails-compaction-z08pI
Open

Add context guardrails for LLM conversation management#902
JSv4 wants to merge 5 commits intomainfrom
claude/context-guardrails-compaction-z08pI

Conversation

@JSv4
Copy link
Collaborator

@JSv4 JSv4 commented Feb 19, 2026

Summary

Implements a comprehensive context management system to prevent LLM context overflow on long-running conversations. Inspired by Claude Code's approach, this adds three complementary safeguards: fast heuristic token estimation, automatic conversation compaction via summarization, and hard limits on tool output sizes.

Key Changes

  • New module: opencontractserver/llms/context_guardrails.py

    • estimate_token_count(): Fast heuristic token counter (~3.5 chars/token) that avoids heavyweight tokenizer imports
    • get_context_window_for_model(): Model-aware context window lookup with exact and prefix matching
    • truncate_tool_output(): Hard ceiling on individual tool outputs before they enter conversation history
    • compact_message_history(): Main compaction algorithm that replaces old messages with concise summaries while preserving recent turns verbatim
    • should_compact(): Decision heuristic based on context window utilization threshold
    • _deterministic_summary(): Fallback non-LLM summary builder for fast/synchronous paths
    • messages_to_proxies(): Bridge between Django ORM ChatMessage objects and lightweight _MessageProxy for testing
    • CompactionConfig: Per-agent configuration dataclass for tuning compaction behavior
  • New module: opencontractserver/constants/context_guardrails.py

    • Centralized configuration for model context windows (OpenAI, Anthropic, Google models)
    • Compaction thresholds: COMPACTION_THRESHOLD_RATIO (0.75), MIN_RECENT_MESSAGES (4), MAX_RECENT_MESSAGES (20)
    • Tool output guardrails: MAX_TOOL_OUTPUT_CHARS (50K), truncation notice template
    • Summary budget: COMPACTION_SUMMARY_TARGET_TOKENS (300)
    • Token estimation: CHARS_PER_TOKEN_ESTIMATE (3.5)
  • Integration into Pydantic-AI agent adapter (opencontractserver/llms/agents/pydantic_ai_agents.py)

    • _get_message_history() now applies compaction when enabled
    • Injects compaction summary as system message so LLM understands context from compacted portion
    • Logs compaction events with token savings metrics
  • Tool output truncation (opencontractserver/llms/tools/pydantic_ai_tools.py)

    • Tool wrapper now applies truncate_tool_output() to all return values before insertion into history
  • Comprehensive test suite (opencontractserver/tests/test_context_guardrails.py)

    • 433 lines of pure-unit tests (no Django DB required) covering all public APIs
    • Tests for token estimation, model lookup, truncation, compaction logic, summary generation, and configuration
  • Updated exports (opencontractserver/constants/__init__.py)

    • Exports all context guardrails constants for public use

Notable Implementation Details

  • Conservative token estimation: Intentionally over-counts tokens (~10% margin) so compaction triggers earlier rather than risking hard overflow
  • Longest-prefix matching: Model lookup prioritizes specificity (e.g., "gpt-4o-mini" over "gpt-4o")
  • Deterministic fallback: Non-LLM summary builder extracts key lines (first/last human messages, first sentences of LLM responses) for fast synchronous compaction
  • Lightweight proxies: _MessageProxy decouples compaction logic from Django ORM, enabling fast pure-unit testing
  • Configurable per-agent: CompactionConfig allows operators to tune thresholds without code changes
  • Graceful degradation: If compaction would remove all but one message, keeps only the last message to guarantee some context

https://claude.ai/code/session_01QuMdF6v5xWrMrrMdRHvVv2

… overflow

Implement three complementary safeguards against LLM context window
overflow in long-running conversations and large document analysis:

1. Token estimation - fast heuristic counter (~3.5 chars/token) that
   intentionally over-estimates to trigger compaction conservatively.

2. Conversation compaction - when message history exceeds 75% of the
   model's context window, older messages are replaced by a concise
   summary while preserving recent turns (4-20 messages) verbatim.
   Supports pluggable summary functions for LLM-based summarization.

3. Tool output truncation - string outputs from tools are automatically
   capped at 50K characters at the PydanticAI wrapper level, with a
   notice directing the LLM to use range parameters.

Includes model context window registry for 20+ models (OpenAI,
Anthropic, Google), per-agent CompactionConfig on AgentConfig, and
30+ unit tests using SimpleTestCase for fast parallel execution.

Closes #898

https://claude.ai/code/session_01QuMdF6v5xWrMrrMdRHvVv2
@claude
Copy link

claude bot commented Feb 19, 2026

PR Review: Context Guardrails for LLM Conversation Management

Overall this is a well-structured, well-documented addition. The module separation is clean, the test suite is solid, and the architectural choices (lightweight proxies for testability, deterministic fallback, per-agent config) are all sensible. A few real bugs and a handful of design notes below.


Bugs

1. CompactionConfig.max_tool_output_chars is never used

CompactionConfig exposes a max_tool_output_chars field and the tests verify it can be set, but pydantic_ai_tools.py always calls truncate_tool_output(result) with no argument — it uses the module-level MAX_TOOL_OUTPUT_CHARS default regardless of whatever compaction.max_tool_output_chars is set to. The per-agent override silently does nothing.

Fix: thread self.config.compaction.max_tool_output_chars through to the tool wrapper, or at minimum remove the field from CompactionConfig until it is wired up, to avoid a false promise.

# pydantic_ai_tools.py — the wrapper does not receive compaction_cfg at all
result = truncate_tool_output(result)   # always uses module default

2. recent_count expansion loop ignores summary tokens in the post-compaction budget

The loop that decides how many recent messages to keep only tests whether system_prompt_tokens + recent_tokens > threshold. It does not include the eventual summary tokens. This means the selected recent_count can produce a post-compaction total (system_prompt + summary + recent) that still exceeds the threshold. The loop should account for the summary budget:

summary_budget = COMPACTION_SUMMARY_TARGET_TOKENS  # or a safe over-estimate
for candidate in range(min_recent, min(max_recent, len(messages)) + 1):
    recent_tokens = sum(m.token_estimate for m in messages[-candidate:])
    if system_prompt_tokens + summary_budget + recent_tokens > threshold:
        break
    recent_count = candidate

Design issues

3. _build_summary_prompt is dead code

The function is defined but never called anywhere — the only path to LLM-based summarization is through an externally provided summary_fn. Since _get_message_history never supplies one, the agent always falls back to _deterministic_summary. Either wire it up or remove it (CLAUDE.md: "no dead code").

4. __all__ is incomplete

CompactionConfig and messages_to_proxies are imported externally (core_agents.py, pydantic_ai_agents.py) but are absent from __all__. They are still importable, but __all__ should accurately reflect the module public API:

__all__ = [
    "estimate_token_count",
    "get_context_window_for_model",
    "truncate_tool_output",
    "CompactionResult",
    "compact_message_history",
    "should_compact",
    "CompactionConfig",        # missing
    "messages_to_proxies",     # missing
]

5. Misleading comment in constants/context_guardrails.py

The docstring says:

prefix matching (e.g. "gpt-4o-mini" matches the "gpt-4o" entry)

This is wrong — "gpt-4o-mini" is an exact key in the dict so it never falls through to prefix matching. The example should use a version-suffixed name like "gpt-4o-mini-2024-07-18".


Minor issues

6. _MessageProxy.__post_init__ treats token_estimate=0 as unset

if not self.token_estimate:
    self.token_estimate = estimate_token_count(self.content)

A caller passing token_estimate=0 explicitly will have it silently overwritten. Use if self.token_estimate == 0 or make the field Optional[int] = None for clarity.

7. No integration test for the agent adapter path

The test suite thoroughly covers the pure functions in context_guardrails.py, but there are no tests verifying that _get_message_history() actually triggers compaction, that the summary is injected as a SystemPromptPart, or that raw_messages is correctly trimmed. A test mocking get_conversation_messages() with a large history and asserting on the returned ModelMessage list would close this gap.


Nits

  • summary_fn: object = None should be summary_fn: Optional[Callable[[list[_MessageProxy]], str]] = None for accurate typing.
  • test_output_exceeding_limit_is_truncated asserts len(result) <= MAX_TOOL_OUTPUT_CHARS + 200, but the implementation guarantees exactly MAX_TOOL_OUTPUT_CHARS characters. The +200 slack is not needed and may hide a future regression; assertEqual(len(result), MAX_TOOL_OUTPUT_CHARS) would be more precise.
  • test_longest_prefix_wins verifies that a prefix match returns the same value as "gpt-4o" (both happen to be 128K), so equal values mean the test does not actually validate which prefix won. Consider using a test model pair where the two candidates have different window sizes.

Instead of recomputing compaction from scratch on every
_get_message_history() call, the system now persists the result:

- Added compaction_summary (TextField) and
  compacted_before_message_id (BigIntegerField) to Conversation model
- get_conversation_messages() filters with id__gt when a bookmark
  exists, so old messages are never loaded from the DB
- _get_message_history() reads the stored summary and only runs a
  fresh compaction pass when the remaining messages still exceed the
  context window threshold
- persist_compaction() on CoreConversationManager writes the bookmark
  atomically via asave(update_fields=...)
- Summaries merge incrementally: each compaction appends to the
  existing summary rather than replacing it

This turns a per-call O(n) token estimation + summary generation into
a one-time write followed by cheap filtered reads.

https://claude.ai/code/session_01QuMdF6v5xWrMrrMdRHvVv2
@claude
Copy link

claude bot commented Feb 19, 2026

Code Review — Context Guardrails for LLM Conversation Management

Overall this is well-structured work: clear constants separation, sensible defaults, lightweight proxy pattern for testability, and the compaction algorithm is easy to follow. The issues below range from a genuine bug to a few code quality nits.


Bug: raw_messages trimmed even when persist_compaction fails

In pydantic_ai_agents.py, the in-memory list is trimmed outside the try/except block:

try:
    await self.conversation_manager.persist_compaction(...)
    stored_summary = merged_summary   # ← only set on success
except Exception:
    logger.exception("Failed to persist compaction bookmark")

# This runs regardless of whether persist succeeded ↓
raw_messages = raw_messages[-result.preserved_count:]

If the DB write fails, stored_summary still holds the old value (no merged content covering the just-removed messages) but the recent-messages slice has already been applied. The LLM receives a trimmed history with a stale summary — effectively losing context for that single call.

The fix is straightforward: only trim when persist succeeds, or move the trim into the try block after stored_summary = merged_summary.


Bug: CompactionConfig.max_tool_output_chars is never honoured

CompactionConfig exposes a per-agent max_tool_output_chars override, but pydantic_ai_tools.py always calls truncate_tool_output(result) without passing max_chars:

# pydantic_ai_tools.py
result = truncate_tool_output(result)   # always uses global default

The per-agent override is silently ignored. Either pass it through:

result = truncate_tool_output(result, max_chars=self.config.compaction.max_tool_output_chars)

…or remove the field from CompactionConfig to avoid the false API promise.


Race condition in compaction persistence

persist_compaction uses a plain asave(update_fields=[...]) with no locking. Two concurrent requests that both see an un-compacted (or lightly-compacted) conversation could both run the compaction algorithm and both attempt to write. The second write would silently overwrite the first with a stale cutoff_message_id.

Consider wrapping the read-then-write in select_for_update(), or add an optimistic-lock check:

# Only update if the bookmark hasn't moved since we read it
Conversation.objects.filter(
    pk=self.conversation.pk,
    compacted_before_message_id=old_cutoff,
).aupdate(...)

The existing design degrades gracefully (no data loss, just redundant work), but the changelog comment "writes the bookmark atomically" is inaccurate.


_build_summary_prompt is dead code

The function is defined at module level but never called anywhere in the module — compact_message_history either uses the caller-supplied summary_fn or falls through to _deterministic_summary. If it's intended as a utility for callers building async LLM-based summary_fn implementations, add a docstring note and export it; otherwise remove it.


CompactionConfig missing from __all__

__all__ = [
    "estimate_token_count",
    "get_context_window_for_model",
    "truncate_tool_output",
    "CompactionResult",
    "compact_message_history",
    "should_compact",
]

CompactionConfig is publicly imported by core_agents.py but absent from __all__. Add it (and _MessageProxy is already used in tests, so worth considering too).


Minor: misleading comment on AgentConfig.compaction field

# ``None`` uses sensible defaults from CompactionConfig.
compaction: CompactionConfig = field(default_factory=CompactionConfig)

The comment says "None uses…" but the default is a CompactionConfig() instance, never None. The annotation CompactionConfig (not Optional[CompactionConfig]) confirms this. Update the comment to reflect reality.


Performance nit: O(n²) token summation in the recent-message sizing loop

for candidate in range(min_recent, min(max_recent, len(messages)) + 1):
    recent_tokens = sum(m.token_estimate for m in messages[-candidate:])

Each iteration re-sums from scratch. For the default max_recent = 20 this is negligible, but a simple running-total going backwards would be cleaner and O(n):

cumulative = 0
for candidate, msg in enumerate(reversed(messages[-max_recent:]), start=1):
    cumulative += msg.token_estimate
    if system_prompt_tokens + cumulative > threshold:
        break
    recent_count = candidate

Test coverage gaps

  1. Persist failure path — no test verifying that context is preserved when persist_compaction raises.
  2. Concurrent compaction — no test for double-write behaviour.
  3. max_tool_output_chars override — no test exercising the per-agent override end-to-end (which would catch the bug noted above).
  4. _build_summary_prompt — no tests if it's intended to be part of the public API.

Summary

Severity Issue
Bug raw_messages trimmed even when persist fails → context gap
Bug CompactionConfig.max_tool_output_chars ignored in tool wrapper
Bug Changelog says persist is atomic; it is not
Cleanup _build_summary_prompt is dead code
Cleanup CompactionConfig absent from __all__
Nit Misleading AgentConfig.compaction comment
Nit O(n²) summation loop (trivial at current bounds)

The core design is solid. Fixing the two data-correctness bugs before merge would be my main ask.

…_tool_output_chars, race condition

- Move raw_messages trimming inside try block so that when
  persist_compaction fails the full message list is preserved for the
  current LLM call (previously context was silently lost)

- Wire CompactionConfig.max_tool_output_chars through
  PydanticAIDependencies so per-agent tool output limits are actually
  honoured by the tool wrapper truncation calls

- Replace plain asave() in persist_compaction with optimistic-lock
  pattern (filter on old cutoff before updating) to prevent concurrent
  requests from silently overwriting each other's bookmarks

- Remove dead _build_summary_prompt function (never called anywhere)

- Add CompactionConfig to __all__ exports, fix misleading comment on
  AgentConfig.compaction field, and correct changelog atomicity claim

- Replace O(n²) recent-message token summation loop with O(n) running
  total (verified equivalent across 1000 randomized trials)

- Add tests for persist failure path, max_tool_output_chars override,
  optimistic locking, and running-total loop invariants

https://claude.ai/code/session_01QuMdF6v5xWrMrrMdRHvVv2
@claude
Copy link

claude bot commented Feb 19, 2026

PR Review: Context Guardrails for LLM Conversation Management

This is a well-structured addition — the problem is real, the approach (compact-once, bookmark-persisted) is sound, and the test coverage is solid. A few bugs and design issues worth addressing before merge.


Bugs

1. Dead-code edge case with misleading comment (context_guardrails.py, compact_message_history)

The comment above the if recent_count < 1 block says "if even min_recent messages exceed the threshold, keep only the very last message", but the code initialises recent_count = min(min_recent, len(messages)) — always ≥ 1 for any normal min_recent. The guard if recent_count < 1: recent_count = 1 never fires under standard configurations (it only triggers when min_recent == 0). The documented fallback behaviour (collapse to 1 message when even the minimum block overflows) never actually executes. Either implement the stated intent or correct the comment to match the actual behaviour.

2. Silent misbehaviour when min_recent > max_recent

If a caller passes min_recent=10, max_recent=5, the loop runs up to candidate 5 but if candidate < min_recent: continue skips every iteration, so recent_count never gets updated from its initial value (min_recent). max_recent is silently ignored. This should raise ValueError at the top of compact_message_history and in CompactionConfig.__post_init__.

3. _MessageProxy.__post_init__ falsy check

if not self.token_estimate:   # True when token_estimate=0 is passed explicitly
    self.token_estimate = estimate_token_count(self.content)

Explicitly passing token_estimate=0 for non-empty content would re-estimate rather than respect the caller's value. The condition should be if self.token_estimate == 0: (same runtime behaviour for defaults, but semantically correct) or use None as the sentinel default.


Code Quality

4. should_compact exported but unused in the integration path

compact_message_history repeats the same threshold check internally, so callers only need to invoke the one function. should_compact is exercised only in tests and listed in __all__. Two implementations of the same threshold logic create a divergence risk. Consider either:

  • Making should_compact the single source of truth, called inside compact_message_history, or
  • Removing it from __all__ and treating it as a private helper until there is a concrete external consumer.

5. Unbounded growth of stored_summary across multiple compaction cycles

Each compaction appends result.summary (which already contains COMPACTION_SUMMARY_PREFIX) to stored_summary. After N compaction cycles the stored summary is roughly N × ~300 tokens and the prefix string appears N times. At 75% utilisation on 128K-token models this won't matter for most conversations, but for very long-lived sessions the accumulation is measurable. A cap — re-summarising the stored summary when it exceeds a configurable budget — would bound the growth.

6. Deprecated asyncio.get_event_loop().run_until_complete() in tests

TestPersistCompactionOptimisticLock calls asyncio.get_event_loop().run_until_complete(...) twice. In Python 3.10+ this emits DeprecationWarning when there is no running loop. Prefer asyncio.run(...) or Django's async_to_sync helper.


Minor Issues

7. Constants-module docstring example is backwards

The comment says "gpt-4o-mini" matches the "gpt-4o" entry — but gpt-4o-mini is itself a key (exact match). The example should be something like "gpt-4o-mini-2024-07-18" matches the "gpt-4o-mini" entry.

8. Missing model entry — gpt-4-32k

gpt-4-32k (32 768 tokens) would prefix-match gpt-4 (8 192 tokens), triggering compaction roughly 4× too early. Since gpt-4-32k is deprecated the practical impact is low, but the pattern is a latent trap for any future model whose name starts with a shorter registered prefix that has a much smaller context window.

9. truncate_tool_output when max_chars < len(notice)

If a caller passes a very small max_chars (smaller than the truncation notice itself), output[:max_chars - len(notice)] is a negative slice — an empty string in Python — so the result is just the notice, which is longer than max_chars. A max(max_chars, len(notice) + 1) clamp or an early ValueError would prevent surprising output. Unlikely with defaults (50K) but reachable via CompactionConfig.


Positive Highlights

  • The optimistic-locking pattern in persist_compaction (conditional aupdate against the previously-read cutoff) is exactly right for concurrent agents.
  • The DB-level cutoff filter (id__gt=cutoff) means long conversations incur zero read overhead after the first compaction — that is the critical performance win.
  • The try/except fallback in _get_message_history that keeps the full message list when persistence fails correctly prioritises safety over efficiency.
  • SimpleTestCase for the pure-unit guardrail tests is the right call — fast and trivially parallelism-friendly with -n auto.
  • Keeping _MessageProxy decoupled from Django ORM makes the core algorithm testable without TransactionTestCase overhead.

The test used MagicMock(spec=AgentConfig) without setting cfg.compaction
or stubbing get_conversation_messages. When _get_message_history tries
arithmetic with MagicMock values (from cfg.compaction.*), unpredictable
behavior prevents the code from reaching the LLM call that the test
is actually exercising.

Fix: provide a real CompactionConfig and stub get_conversation_messages
with an empty list, matching the pattern from the passing
test_pydantic_ai_error_handling.

https://claude.ai/code/session_01QuMdF6v5xWrMrrMdRHvVv2
@claude
Copy link

claude bot commented Feb 19, 2026

PR Review: Context Guardrails for LLM Conversation Management

This is a well-structured PR that solves a real problem (context overflow) with a thoughtful design. The compact-once/read-cheaply approach is sound, and the test coverage is solid. Here are my findings, ordered by severity.


Bugs

1. truncate_tool_output produces incorrect output for small max_chars values

notice = TOOL_OUTPUT_TRUNCATION_NOTICE.format(limit=max_chars)
truncated = output[: max_chars - len(notice)] + notice

TOOL_OUTPUT_TRUNCATION_NOTICE is ~100 characters. If max_chars < len(notice), then max_chars - len(notice) is negative, causing Python's slice to take bytes from the end of the string rather than the beginning. The result is both semantically wrong (wrong portion of the output) and longer than max_chars.

Fix:

char_budget = max(0, max_chars - len(notice))
truncated = output[:char_budget] + notice

The test test_custom_max_chars uses max_chars=100 (which barely avoids triggering this since the notice is ~90 chars after formatting), and test_truncation_notice_contains_limit uses max_chars=50 — which does trigger the bug. Both tests only assert that the result contains 'truncated' and is shorter than the original, so they pass despite the incorrect behavior.


2. messages_to_proxies is missing from __all__

context_guardrails.py exports via:

__all__ = [
    "estimate_token_count",
    "get_context_window_for_model",
    "truncate_tool_output",
    "CompactionConfig",
    "CompactionResult",
    "compact_message_history",
    "should_compact",
]

But messages_to_proxies is used directly in pydantic_ai_agents.py and in the test file. It should either be added to __all__ or noted explicitly as internal-only (in which case the tests importing it directly are fine, but it should not be imported via explicit name in production code without acknowledgement of the omission).


Design Concerns

3. Double-counting stored_summary tokens in system_prompt_tokens

In _get_message_history:

system_prompt_tokens = estimate_token_count(
    (self.config.system_prompt or "") + stored_summary
)

The stored_summary is counted in system_prompt_tokens, then also injected into the message history as a ModelRequest. When compact_message_history computes total_after:

total_after = system_prompt_tokens + summary_tokens + recent_tokens

...it adds the new summary_tokens on top. After persistence, the next call will count the merged summary (old + new) in system_prompt_tokens and still include new summary tokens. Over many compaction cycles this causes token accounting drift — the system may compact earlier than intended. Consider passing stored_summary as a separate parameter to compact_message_history so the accounting doesn't double-count.

4. No ORM-level integrity check for compacted_before_message_id

Using BigIntegerField instead of a ForeignKey is a deliberate tradeoff (avoids CASCADE complications), but deleting a ChatMessage whose ID equals the cutoff would silently make the filter inconsistent. If soft-delete is not enforced for ChatMessage, it's worth adding a note in the field's help_text or a comment in get_conversation_messages explaining why __gt (not __gte) is safe.

5. COMPACTION_SUMMARY_PREFIX repeats on every compaction pass

After the second compaction, the persisted summary looks like:

[Conversation summary — earlier messages were compacted to save context]
...first summary...

[Conversation summary — earlier messages were compacted to save context]
...second summary...

The duplicate prefix is harmless but slightly noisy. Consider stripping it from subsequent merges or using a single header:

merged_summary = stored_summary.rstrip() + "\n\n--- (continued) ---\n" + summary_text

Code Quality

6. Unreachable edge case

# Edge case: if even min_recent messages exceed the threshold,
# keep only the very last message to guarantee *some* context.
if recent_count < 1:
    recent_count = 1

recent_count is initialized to min(min_recent, len(messages)). With min_recent >= 1 (enforced by the constants and config validation) and len(messages) >= 1 (the early-return guard above handles empty messages), this branch is never reachable. If you want to keep it as a safety net, a comment explaining it's defensive-only would clarify intent.

7. asyncio.get_event_loop().run_until_complete() is deprecated (Python 3.10+)

TestPersistCompactionOptimisticLock uses:

asyncio.get_event_loop().run_until_complete(manager.persist_compaction(...))

In Python 3.10+ this emits a DeprecationWarning when called outside an async context. The fix is straightforward:

asyncio.run(manager.persist_compaction(...))

Or, better, make the test method async def and use Django's async test runner / IsolatedAsyncioTestCase.

8. _deterministic_summary first-sentence heuristic is fragile

sentence = m.content.split(".")[0].strip()

LLM responses frequently use markdown, bullet points, or \n as a primary separator. A response like "- The document covers: \n1. Indemnification\n2. Term" yields "- The document covers: \n1" as the first "sentence". The 150-char truncation mitigates the worst cases, but splitting on re.split(r'[.!?\n]', ...)[0] would be meaningfully more robust.


Testing

9. test_truncation_notice_contains_limit exercises the buggy path

truncate_tool_output("y" * 200, max_chars=50) — with the notice being ~90 chars, max_chars - len(notice) is negative. The test asserts the result contains "50" and is shorter than 200 chars, both of which pass even with incorrect behavior. A stronger assertion would be:

self.assertLessEqual(len(result), 50 + len(notice) + some_slack)
self.assertTrue(result.startswith("y"))  # content from beginning, not end

10. No test for get_conversation_messages with a real DB cutoff

The optimistic-lock test is good, but there's no test verifying that get_conversation_messages actually filters by id__gt=compacted_before_message_id. Since this is the DB-level primitive that prevents stale context from being served, a TestCase (with DB) covering this path would add real confidence. I understand the tradeoff for SimpleTestCase speed — this would just be one extra TestCase class.


Minor

  • The recent_count expansion loop is a bit hard to follow at first read. An inline comment like # Expand window right-to-left (newest→oldest) until threshold is hit would help future readers.
  • CHARS_PER_TOKEN_ESTIMATE = 3.5 but the docstring says "English text averages ~4 chars/token". The comment is inconsistent with the constant — if 3.5 is intentionally conservative (over-counting tokens), clarify that it diverges from the 4 char/token average by design.

Overall this is a solid addition and the architecture (compact-once, persist bookmark, cheap reads) is well thought out. The bugs above are all fixable with small changes; the truncation one in particular should be addressed before merge.

@JSv4 JSv4 linked an issue Feb 19, 2026 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Feb 19, 2026

Signed-off-by: JSIV <5049984+JSv4@users.noreply.github.com>
@claude
Copy link

claude bot commented Feb 19, 2026

Code Review: Context Guardrails & Conversation Compaction (#902)

Overall this is a well-structured addition. The three-layer approach (token estimation → compaction → tool output truncation), the dataclass-based config, optimistic locking for concurrent writes, and the proxy pattern for pure-unit testing are all solid design choices. The test suite is thorough. A few issues worth addressing before merge:


Bugs / Correctness

1. truncate_tool_output can exceed max_chars for very small limits

# context_guardrails.py
truncated = output[: max_chars - len(notice)] + notice

If max_chars < len(notice) (e.g. max_chars=10), max_chars - len(notice) is negative, output[:negative] returns "", and the final string is just the notice — which is longer than max_chars. The production default (50K) makes this a non-issue, but the function contract says "at most max_chars characters" and the test suite only exercises the default. A simple guard like max(0, max_chars - len(notice)) or an early return when max_chars <= len(notice) would make the contract airtight.

2. Misleading comment on the recent_count < 1 guard

# compact_message_history
recent_count = min(min_recent, len(messages))  # ← starts at min_recent
...
# Edge case: if even min_recent messages exceed the threshold,
# keep only the very last message to guarantee *some* context.
if recent_count < 1:
    recent_count = 1

When len(messages) >= min_recent, recent_count is initialised to min_recent (≥ 1 by default) and can only increase in the loop, so this guard is unreachable in normal usage. The check is actually the safety net for min_recent=0 (caller-supplied) combined with a threshold exceeded at one message. The comment should describe that case accurately; as written, it implies the algorithm can reduce the count below min_recent, which it doesn't.


Design / Architecture

3. Summary grows unboundedly across compaction cycles
Each new compaction merges the full previous summary with the new one:

merged_summary = stored_summary.rstrip() + "\n\n" + result.summary

After N compaction rounds, compaction_summary approaches N × COMPACTION_SUMMARY_TARGET_TOKENS. There is currently no cap. For very long-lived conversations this column will keep growing and eventually become a meaningful fraction of the context window itself. Consider capping the stored summary to COMPACTION_SUMMARY_TARGET_TOKENS (or a small multiple) by truncating or re-summarising before merging.

4. Late import inside _get_message_history

async def _get_message_history(self) -> ...:
    from opencontractserver.llms.context_guardrails import (
        compact_message_history, estimate_token_count, messages_to_proxies,
    )

This import runs on every call. If the motivation is to break a circular import, that's fine but worth a brief comment. If there's no circular import concern, moving it to the module level avoids the repeated overhead and makes the dependency graph explicit.

5. Uncovered PydanticAIDependencies construction sites
The PR explicitly wires max_tool_output_chars=config.compaction.max_tool_output_chars in two places. Other construction sites outside the diff rely on the default constant, meaning per-agent override never reaches them. This is probably intentional (the default is sane), but worth a note in the PR or a TODO comment so future callers know to propagate the config.


Code Quality

6. _deterministic_summary sentence heuristic is fragile

sentence = m.content.split(".")[0].strip()

Splits on the first literal period, so "Dr. Smith said…" → "Dr", "1.5 million tokens" → "1", and any URL truncates at the first dot. Even a simple regex like re.split(r'(?<=[.!?])\s', content)[0] would behave better. Given the summary is only a fallback that operators can replace with an LLM-based summary_fn, this is low-risk, but it can produce confusing one-to-two character "sentences".

7. Deprecated async pattern in test

asyncio.get_event_loop().run_until_complete(manager.persist_compaction(...))

asyncio.get_event_loop() is deprecated in Python 3.10+ and emits a DeprecationWarning in 3.12. Prefer asyncio.run() or `from asgiref.sync import async_to_sync; async_to_sync(manager.persist_compaction)(...)" which is the Django-idiomatic form.

8. No validation in CompactionConfig
CompactionConfig doesn't enforce min_recent_messages <= max_recent_messages or that threshold_ratio ∈ (0, 1). A misconfigured agent (e.g. min_recent=20, max_recent=4) would silently misbehave. Adding a __post_init__ validator or at least a runtime assertion in compact_message_history would catch operator mistakes early.


Minor / Observations

  • The PR description says "433 lines of pure-unit tests" but the diff adds 648 lines to the test file — worth correcting for changelog accuracy.
  • _MessageProxy and _deterministic_summary are underscore-prefixed (semi-private) but imported directly in the test. Either add them to __all__ with a note that they are testable internals, or document in the test why private internals are being tested directly.
  • MODEL_CONTEXT_WINDOWS has no __all__ guard but is wildcard-imported in constants/__init__.py. This is consistent with the existing pattern in the codebase, just confirming it's intentional.

What's done well

  • The optimistic locking in persist_compaction via aupdate with a conditional filter is correct and race-safe.
  • The try/except around persistence in _get_message_history correctly defers trimming the in-memory list until after successful persistence — no context is silently lost on a DB write failure.
  • Using SimpleTestCase (no DB) for all compaction unit tests is the right call; these tests will be extremely fast in CI.
  • The _MessageProxy decoupling cleanly separates ORM concerns from the compaction algorithm, enabling the pure-unit test approach.
  • The migration correctly uses null=True, blank=True for the cutoff ID field (nullable foreign-key-like column) and a sensible default="" for the text summary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement conversation compaction to prevent context overflow

2 participants

Comments