feat: deduplicate model event inputs and call messages into shared pools by rasmusfaber · Pull Request #20 · METR/inspect_ai

rasmusfaber · 2026-02-28T23:56:00Z

This PR contains:

What is the current behavior? (You can also link to an open issue here)

Every ModelEvent.input stores the full list of ChatMessage objects, even though most messages repeat across turns in an agentic loop. Similarly, ModelCall.request stores the full provider wire-format messages. For long agent runs, this causes significant log file bloat — the same system prompt and conversation history is stored N times for N model calls.

What is the new behavior?

Message pool deduplication: Repeated messages are stored once in EvalSample.message_pool (a list of ChatMessage) and EvalSample.call_pool (a list of JSON values). ModelEvent.input is replaced by input_refs — range-encoded index references (list[list[int]], each [start, end)) into the pool. Similarly, ModelCall.request messages are replaced by call_refs + call_key.

On read, pools are transparently resolved back to full inputs, so downstream consumers see no change.

Key changes:

New src/inspect_ai/log/_pool.py module with all pool condensing/resolving logic
_condense.py orchestrates attachment condensation + pool dedup via _pool.py
Schema version bumped to 3 (backward-compatible: old readers ignore new fields)
TypeScript viewer updated to resolve pools client-side
repair_duplicate_message_ids() fixes legacy files where content was mutated without changing the message ID

Design decisions worth discussing

Identity-based dedup (msg.id), not content equality: Messages are deduplicated by their .id field, not by deep content comparison. This is O(1) per message vs O(n) for deep equality on large content lists. The trade-off is that mutation sites (reasoning stripping, tool model_input, trim_messages) must assign a new uuid() when they modify content. This PR adds those assignments and documents the invariant in _pool.py's module docstring. An alternative would be deep equality or content hashing, but the performance cost on large messages (especially those containing images) is prohibitive.

Range encoding for refs: Rather than storing raw index arrays ([0, 1, 2, 3, 4]), refs use half-open ranges ([[0, 5]]). This compresses well for the common agentic case where each turn's input is a contiguous prefix of the pool plus a few new messages.

No nested event recursion: Pool resolution only operates on top-level events. SubtaskEvent.events and ToolEvent.events are deprecated — modern evals record all events flat. The repair_duplicate_message_ids function (for legacy file conversion) also only processes top-level events for the same reason.

Pool exclusion when events excluded: read_eval_log_sample_async now automatically excludes message_pool/call_pool when events is in exclude_fields (pools are useless without events), and forces their inclusion when events are being read (events need pools to resolve).

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

No breaking changes for consumers. The schema change is additive — old fields remain populated when reading (pools are resolved transparently). The log format version is bumped to 3, but the reader handles both v2 and v3.

Code that directly mutates ChatMessage.content on messages that will appear in ModelEvent.input must assign a new ID afterward. All known mutation sites in the codebase have been updated.

Other information:

The _pool.py module docstring documents the identity-based dedup invariant and explicitly warns against replacing it with deep equality without understanding the performance implications.

… running in environments (e.g. AzureAI) where they are not supported

…e/openai-tokens-and-compact-fallback OpenAI: Use fallback for token counting and compaction endpoints when running in environments (e.g. AzureAI) where they are not supported

* fix: don't clear selectedSampleHandle in prepareForSampleLoad prepareForSampleLoad() was clearing selectedSampleHandle, which broke the dependency chain for running samples. usePollSample depends on logSelection.sample (from useSelectedSampleSummary, which depends on selectedSampleHandle) to determine if a sample needs polling. The stale-data prevention still works via loadGeneration counter and sampleMatchesRequest — neither depends on clearing the handle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: allow running samples to render in SampleDetailComponent Restore sampleMatchesRequest to return true when sample is undefined (the normal state for running samples, where data comes via runningEvents not selectedSampleObject). Remove redundant sample&& check from render gate. The stale-data guard still works: when a completed sample is loaded with wrong id/epoch, sampleMatchesRequest returns false. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add changelog --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Charles Teague <cteague@gmail.com>

…tBEIS#3338) * feat(view): Add Cmd+F search to LogListGrid (tasks list) AG Grid virtualizes rows, meaning only visible rows exist in the DOM. This causes the browser's native Cmd+F/Ctrl+F to only search visible elements, making it impossible to find items scrolled out of view. This PR adds a custom find bar to LogListGrid that searches through all AG Grid data regardless of scroll position. Features: - Intercepts Cmd+F / Ctrl+F to show custom find bar - Searches through all AG Grid data (name, task, model, id) - Prev/next navigation via ↑/↓ buttons or Enter/Shift+Enter - Reuses existing FindBand.css for consistent styling - Uses ApplicationIcons for buttons (consistency with rest of UI) - Shows "X of Y" match counter with wrap-around navigation - Red "No results" indicator when no matches - Stores row IDs instead of IRowNode references (avoids memory leaks) - Pre-computes searchable text for O(1) lookup performance Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Refactor shared find control * Rename type * add changelog --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Charles Teague <cteague@gmail.com>

The Google GenAI SDK uses aiohttp internally, which calls asyncio.get_running_loop() and fails under trio. When the trio backend is detected, pass an httpx.AsyncClient via HttpOptions to bypass aiohttp, since httpx is anyio-native and works with both backends.

require explicit creation by user in custom timelines

The Grok SDK uses gRPC which is asyncio-only. Add a PrerequisiteError at provider init when trio is detected, matching the bedrock pattern. Also bake skip_if_trio into skip_if_no_grok so all grok tests automatically skip under trio.

…e/timeline-branch-detection timelines: don't attempt to automatically detect branches

…e/timeline-utility-agent-detection timelines: improved automatic detection of utility agents

fix(grok): raise clear error under trio async backend

fix(google): use httpx under trio async backend

Support credential-free access to public S3 buckets by passing botocore.UNSIGNED signature_version and an explicit region_name to both sync (boto3) and async (aiobotocore) S3 clients. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: enable test_read_s3_zip_member to run without AWS credentials

…mple-activity Rationalize sample loading logic

Updated changelog with recent changes and bug fixes.

…n-indentation Remove JSON indentation inside `.eval` archives for a 10-50% size reduction

…mputer OpenAI: Migrate computer use from preview to GA

Removed duplicate entries for post-eval editing of tags and metadata, and the tags parameter in Task.

…g_edits feat: log editing of tags and metadata

…ntBEIS#3446) * catch ProcessLookupError in case bash session has crashed * Update CHANGELOG with new features and fixes --------- Co-authored-by: jjallaire <jj.allaire@gmail.com>

…dels Tool results containing ContentImage (e.g. from tools returning screenshots or photos) were not being sent to the Google API for non-computer-use models. Images are now included as sibling Part objects in the Content alongside the FunctionResponse, since most models don't support multimodal function responses. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ogle-tool-image-results fix(google): include images from tool results for non-computer-use models

Condense repeated ChatMessage inputs and ModelCall messages across turns into per-sample pool tables with range-encoded refs. On write, messages are assigned stable IDs and deduplicated; on read, compact refs are expanded back into full message lists. Key changes: - Add _pool.py with condense/expand helpers and range-encoded refs - Extend ModelEvent with input_refs/call_refs and ModelCall with call_refs/call_key fields - Add message ID generation to ChatMessage and stability fixes to resolve_tool_model_input, trim_messages, and reasoning history - Update eval recorder to condense samples on completion - Update TypeScript viewer to resolve pool refs in sampleUtils - Add comprehensive Python and TypeScript test coverage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add log_schema_version() that returns 2 by default, 3 when INSPECT_LOG_FORMAT_V3 is set. Gate pool dedup in condense_sample() and version numbers in recorders behind this function. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@epatey

- expandRefs: use flatMap/slice (per @epatey) - resolveEventRefs: eliminate let with ternary - resolvePools: remove unnecessary optional chaining - Rewrite sampleUtils tests as table-driven - Restore _wrap model validator for in-memory ChatMessage dedup - Rebuild lib Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Makes the fixed-size pair semantics explicit in the type system and renames internal variables to end_exclusive for clarity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace identity-based (msg.id) dedup with hash-based dedup using mm3_hash of the full sorted-keys JSON serialization. This is correct-by-construction: mutated messages with stale IDs get separate pool entries. Remove repair_duplicate_message_ids() and its callers since hash-based dedup handles same-id-different-content naturally. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ations Within a single condense call, the same Python object seen across multiple events is guaranteed to have identical content, so id(obj) is a safe cache key for the computed hash. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The rebase introduced a mismatch: main changed read_eval_log to always use resolve_attachments="full", but our assertions still checked the convert-time parameter. Restore reading with the convert-time parameter and skip walking call_pool when only_core is set. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rasmusfaber force-pushed the faber/message-pool-dedup branch 4 times, most recently from f572c1b to fe891b1 Compare March 2, 2026 08:26

jjallaire and others added 7 commits March 3, 2026 08:06

add changelog entry

0dec59b

Merge branch 'saikpr-main'

f4be352

OpenAI: Use fallback for token counting and compaction endpoints when…

67f9e20

… running in environments (e.g. AzureAI) where they are not supported

Merge pull request UKGovernmentBEIS#3384 from UKGovernmentBEIS/featur…

53a74ce

…e/openai-tokens-and-compact-fallback OpenAI: Use fallback for token counting and compaction endpoints when running in environments (e.g. AzureAI) where they are not supported

QuantumLove force-pushed the main branch from 8860534 to a0525c3 Compare March 3, 2026 14:42

jjallaire and others added 5 commits March 3, 2026 09:44

timelines: don't attempt to automaticlaly detect branches

59ce7cc

require explicit creation by user in custom timelines

update changelog

5084c78

Update CHANGELOG.md

3fda0c7

Merge branch 'main' into feature/timeline-branch-detection

aca709e

fix(grok): raise clear error under trio async backend

39d278a

The Grok SDK uses gRPC which is asyncio-only. Add a PrerequisiteError at provider init when trio is detected, matching the bedrock pattern. Also bake skip_if_trio into skip_if_no_grok so all grok tests automatically skip under trio.

QuantumLove force-pushed the main branch from a0525c3 to 8860534 Compare March 3, 2026 14:54

jjallaire and others added 12 commits March 3, 2026 09:55

Merge pull request UKGovernmentBEIS#3386 from UKGovernmentBEIS/featur…

4564acf

…e/timeline-branch-detection timelines: don't attempt to automatically detect branches

Merge branch 'main' into google_trio

16e9f2b

timelines: improved automatic detection of utility agents.

9b58f1f

Merge pull request UKGovernmentBEIS#3388 from UKGovernmentBEIS/featur…

18c7fd7

…e/timeline-utility-agent-detection timelines: improved automatic detection of utility agents

Merge branch 'main' into grok_trio

b150edf

Merge pull request UKGovernmentBEIS#3387 from ransomr/grok_trio

f8732f9

fix(grok): raise clear error under trio async backend

Merge branch 'main' into google_trio

2d0986a

Merge pull request UKGovernmentBEIS#3385 from ransomr/google_trio

591aae3

fix(google): use httpx under trio async backend

Merge branch 'main' into main

3d0c0e6

fix: address PR feedback — PyPI packages, docker compatibility column

9ebe510

Merge pull request UKGovernmentBEIS#3389 from ransomr/main

f60eded

fix: enable test_read_s3_zip_member to run without AWS credentials

ransomr and others added 28 commits March 9, 2026 16:09

fix typescript checks

a4f748f

regenerate types

dfd98ee

back to Anthony's namingm

5cb774a

reduce changes

e601906

Merge branch 'main' into opencomputer

1e373d4

Merge branch 'main' into bug/sample-activity

defa4d9

Merge pull request UKGovernmentBEIS#3444 from UKGovernmentBEIS/bug/sa…

b734d57

…mple-activity Rationalize sample loading logic

Merge branch 'main' into remove-eval-json-indentation

4758e11

Update CHANGELOG with new features and bug fixes

151ddd8

Updated changelog with recent changes and bug fixes.

Merge pull request UKGovernmentBEIS#3445 from tadamcz/remove-eval-jso…

4ae13c1

…n-indentation Remove JSON indentation inside `.eval` archives for a 10-50% size reduction

Merge branch 'main' into opencomputer

184f65c

Update CHANGELOG with new computer tool support

a5092cb

Merge pull request UKGovernmentBEIS#3422 from UKGovernmentBEIS/openco…

d433043

…mputer OpenAI: Migrate computer use from preview to GA

Clean up CHANGELOG by removing duplicates

cb8de06

Removed duplicate entries for post-eval editing of tags and metadata, and the tags parameter in Task.

Merge branch 'main' into log_tag_edits

d68cad2

doc updates

5617c40

Merge pull request UKGovernmentBEIS#3443 from UKGovernmentBEIS/log_ta…

c91f9fc

…g_edits feat: log editing of tags and metadata

update docs

a2ff018

catch ProcessLookupError in case bash session has crashed (UKGovernme…

f2bad73

…ntBEIS#3446) * catch ProcessLookupError in case bash session has crashed * Update CHANGELOG with new features and fixes --------- Co-authored-by: jjallaire <jj.allaire@gmail.com>

Merge pull request UKGovernmentBEIS#3451 from UKGovernmentBEIS/fix/go…

023bbb9

…ogle-tool-image-results fix(google): include images from tool results for non-computer-use models

refactor: use tuple[int, int] for input_refs and call_refs range pairs

05f9e8b

Makes the fixed-size pair semantics explicit in the type system and renames internal variables to end_exclusive for clarity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add design diagram

cadcf58

rasmusfaber force-pushed the faber/message-pool-dedup branch from 7985326 to cadcf58 Compare March 10, 2026 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: deduplicate model event inputs and call messages into shared pools#20

feat: deduplicate model event inputs and call messages into shared pools#20
rasmusfaber wants to merge 482 commits intomainfrom
faber/message-pool-dedup

rasmusfaber commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

rasmusfaber commented Feb 28, 2026

This PR contains:

What is the current behavior? (You can also link to an open issue here)

What is the new behavior?

Design decisions worth discussing

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

Other information:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants