Skip to content

feat: deduplicate model event inputs and call messages into shared pools#20

Draft
rasmusfaber wants to merge 482 commits intomainfrom
faber/message-pool-dedup
Draft

feat: deduplicate model event inputs and call messages into shared pools#20
rasmusfaber wants to merge 482 commits intomainfrom
faber/message-pool-dedup

Conversation

@rasmusfaber
Copy link
Copy Markdown

This PR contains:

  • New features
  • Changes to dev-tools e.g. CI config / github tooling
  • Docs
  • Bug fixes
  • Code refactor

What is the current behavior? (You can also link to an open issue here)

Every ModelEvent.input stores the full list of ChatMessage objects, even though most messages repeat across turns in an agentic loop. Similarly, ModelCall.request stores the full provider wire-format messages. For long agent runs, this causes significant log file bloat — the same system prompt and conversation history is stored N times for N model calls.

What is the new behavior?

Message pool deduplication: Repeated messages are stored once in EvalSample.message_pool (a list of ChatMessage) and EvalSample.call_pool (a list of JSON values). ModelEvent.input is replaced by input_refs — range-encoded index references (list[list[int]], each [start, end)) into the pool. Similarly, ModelCall.request messages are replaced by call_refs + call_key.

On read, pools are transparently resolved back to full inputs, so downstream consumers see no change.

Key changes:

  • New src/inspect_ai/log/_pool.py module with all pool condensing/resolving logic
  • _condense.py orchestrates attachment condensation + pool dedup via _pool.py
  • Schema version bumped to 3 (backward-compatible: old readers ignore new fields)
  • TypeScript viewer updated to resolve pools client-side
  • repair_duplicate_message_ids() fixes legacy files where content was mutated without changing the message ID

Design decisions worth discussing

Identity-based dedup (msg.id), not content equality: Messages are deduplicated by their .id field, not by deep content comparison. This is O(1) per message vs O(n) for deep equality on large content lists. The trade-off is that mutation sites (reasoning stripping, tool model_input, trim_messages) must assign a new uuid() when they modify content. This PR adds those assignments and documents the invariant in _pool.py's module docstring. An alternative would be deep equality or content hashing, but the performance cost on large messages (especially those containing images) is prohibitive.

Range encoding for refs: Rather than storing raw index arrays ([0, 1, 2, 3, 4]), refs use half-open ranges ([[0, 5]]). This compresses well for the common agentic case where each turn's input is a contiguous prefix of the pool plus a few new messages.

No nested event recursion: Pool resolution only operates on top-level events. SubtaskEvent.events and ToolEvent.events are deprecated — modern evals record all events flat. The repair_duplicate_message_ids function (for legacy file conversion) also only processes top-level events for the same reason.

Pool exclusion when events excluded: read_eval_log_sample_async now automatically excludes message_pool/call_pool when events is in exclude_fields (pools are useless without events), and forces their inclusion when events are being read (events need pools to resolve).

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

No breaking changes for consumers. The schema change is additive — old fields remain populated when reading (pools are resolved transparently). The log format version is bumped to 3, but the reader handles both v2 and v3.

Code that directly mutates ChatMessage.content on messages that will appear in ModelEvent.input must assign a new ID afterward. All known mutation sites in the codebase have been updated.

Other information:

The _pool.py module docstring documents the identity-based dedup invariant and explicitly warns against replacing it with deep equality without understanding the performance implications.

@rasmusfaber rasmusfaber force-pushed the faber/message-pool-dedup branch 4 times, most recently from f572c1b to fe891b1 Compare March 2, 2026 08:26
jjallaire and others added 7 commits March 3, 2026 08:06
… running in environments (e.g. AzureAI) where they are not supported
…e/openai-tokens-and-compact-fallback

OpenAI: Use fallback for token counting and compaction endpoints when running in environments (e.g. AzureAI) where they are not supported
* fix: don't clear selectedSampleHandle in prepareForSampleLoad

prepareForSampleLoad() was clearing selectedSampleHandle, which broke
the dependency chain for running samples. usePollSample depends on
logSelection.sample (from useSelectedSampleSummary, which depends on
selectedSampleHandle) to determine if a sample needs polling.

The stale-data prevention still works via loadGeneration counter and
sampleMatchesRequest — neither depends on clearing the handle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: allow running samples to render in SampleDetailComponent

Restore sampleMatchesRequest to return true when sample is undefined
(the normal state for running samples, where data comes via
runningEvents not selectedSampleObject). Remove redundant sample&&
check from render gate.

The stale-data guard still works: when a completed sample is loaded
with wrong id/epoch, sampleMatchesRequest returns false.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add changelog

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Charles Teague <cteague@gmail.com>
…tBEIS#3338)

* feat(view): Add Cmd+F search to LogListGrid (tasks list)

AG Grid virtualizes rows, meaning only visible rows exist in the DOM.
This causes the browser's native Cmd+F/Ctrl+F to only search visible
elements, making it impossible to find items scrolled out of view.

This PR adds a custom find bar to LogListGrid that searches through
all AG Grid data regardless of scroll position.

Features:
- Intercepts Cmd+F / Ctrl+F to show custom find bar
- Searches through all AG Grid data (name, task, model, id)
- Prev/next navigation via ↑/↓ buttons or Enter/Shift+Enter
- Reuses existing FindBand.css for consistent styling
- Uses ApplicationIcons for buttons (consistency with rest of UI)
- Shows "X of Y" match counter with wrap-around navigation
- Red "No results" indicator when no matches
- Stores row IDs instead of IRowNode references (avoids memory leaks)
- Pre-computes searchable text for O(1) lookup performance

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Refactor shared find control

* Rename type

* add changelog

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Charles Teague <cteague@gmail.com>
The Google GenAI SDK uses aiohttp internally, which calls
asyncio.get_running_loop() and fails under trio. When the trio backend
is detected, pass an httpx.AsyncClient via HttpOptions to bypass
aiohttp, since httpx is anyio-native and works with both backends.
jjallaire and others added 5 commits March 3, 2026 09:44
require explicit creation by user in custom timelines
The Grok SDK uses gRPC which is asyncio-only. Add a PrerequisiteError
at provider init when trio is detected, matching the bedrock pattern.
Also bake skip_if_trio into skip_if_no_grok so all grok tests
automatically skip under trio.
jjallaire and others added 12 commits March 3, 2026 09:55
…e/timeline-branch-detection

timelines: don't attempt to automatically detect branches
…e/timeline-utility-agent-detection

timelines: improved automatic detection of utility agents
fix(grok): raise clear error under trio async backend
fix(google): use httpx under trio async backend
Support credential-free access to public S3 buckets by passing
botocore.UNSIGNED signature_version and an explicit region_name
to both sync (boto3) and async (aiobotocore) S3 clients.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix: enable test_read_s3_zip_member to run without AWS credentials
ransomr and others added 28 commits March 9, 2026 16:09
…mple-activity

Rationalize sample loading logic
Updated changelog with recent changes and bug fixes.
…n-indentation

Remove JSON indentation inside `.eval` archives for a 10-50% size reduction
…mputer

OpenAI: Migrate computer use from preview to GA
Removed duplicate entries for post-eval editing of tags and metadata, and the tags parameter in Task.
…g_edits

feat: log editing of tags and metadata
…ntBEIS#3446)

* catch ProcessLookupError in case bash session has crashed

* Update CHANGELOG with new features and fixes

---------

Co-authored-by: jjallaire <jj.allaire@gmail.com>
…dels

Tool results containing ContentImage (e.g. from tools returning screenshots
or photos) were not being sent to the Google API for non-computer-use models.
Images are now included as sibling Part objects in the Content alongside the
FunctionResponse, since most models don't support multimodal function responses.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ogle-tool-image-results

fix(google): include images from tool results for non-computer-use models
Condense repeated ChatMessage inputs and ModelCall messages across turns
into per-sample pool tables with range-encoded refs. On write, messages
are assigned stable IDs and deduplicated; on read, compact refs are
expanded back into full message lists.

Key changes:
- Add _pool.py with condense/expand helpers and range-encoded refs
- Extend ModelEvent with input_refs/call_refs and ModelCall with
  call_refs/call_key fields
- Add message ID generation to ChatMessage and stability fixes to
  resolve_tool_model_input, trim_messages, and reasoning history
- Update eval recorder to condense samples on completion
- Update TypeScript viewer to resolve pool refs in sampleUtils
- Add comprehensive Python and TypeScript test coverage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add log_schema_version() that returns 2 by default, 3 when
INSPECT_LOG_FORMAT_V3 is set. Gate pool dedup in condense_sample()
and version numbers in recorders behind this function.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- expandRefs: use flatMap/slice (per @epatey)
- resolveEventRefs: eliminate let with ternary
- resolvePools: remove unnecessary optional chaining
- Rewrite sampleUtils tests as table-driven
- Restore _wrap model validator for in-memory ChatMessage dedup
- Rebuild lib

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Makes the fixed-size pair semantics explicit in the type system and
renames internal variables to end_exclusive for clarity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace identity-based (msg.id) dedup with hash-based dedup using
mm3_hash of the full sorted-keys JSON serialization. This is
correct-by-construction: mutated messages with stale IDs get separate
pool entries.

Remove repair_duplicate_message_ids() and its callers since hash-based
dedup handles same-id-different-content naturally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ations

Within a single condense call, the same Python object seen across
multiple events is guaranteed to have identical content, so id(obj)
is a safe cache key for the computed hash.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rasmusfaber rasmusfaber force-pushed the faber/message-pool-dedup branch from 7985326 to cadcf58 Compare March 10, 2026 15:48
The rebase introduced a mismatch: main changed read_eval_log to always
use resolve_attachments="full", but our assertions still checked the
convert-time parameter. Restore reading with the convert-time parameter
and skip walking call_pool when only_core is set.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.