feat: deduplicate model event inputs and call messages into shared pools#20
Draft
rasmusfaber wants to merge 482 commits intomainfrom
Draft
feat: deduplicate model event inputs and call messages into shared pools#20rasmusfaber wants to merge 482 commits intomainfrom
rasmusfaber wants to merge 482 commits intomainfrom
Conversation
f572c1b to
fe891b1
Compare
… running in environments (e.g. AzureAI) where they are not supported
…e/openai-tokens-and-compact-fallback OpenAI: Use fallback for token counting and compaction endpoints when running in environments (e.g. AzureAI) where they are not supported
* fix: don't clear selectedSampleHandle in prepareForSampleLoad prepareForSampleLoad() was clearing selectedSampleHandle, which broke the dependency chain for running samples. usePollSample depends on logSelection.sample (from useSelectedSampleSummary, which depends on selectedSampleHandle) to determine if a sample needs polling. The stale-data prevention still works via loadGeneration counter and sampleMatchesRequest — neither depends on clearing the handle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: allow running samples to render in SampleDetailComponent Restore sampleMatchesRequest to return true when sample is undefined (the normal state for running samples, where data comes via runningEvents not selectedSampleObject). Remove redundant sample&& check from render gate. The stale-data guard still works: when a completed sample is loaded with wrong id/epoch, sampleMatchesRequest returns false. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add changelog --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Charles Teague <cteague@gmail.com>
…tBEIS#3338) * feat(view): Add Cmd+F search to LogListGrid (tasks list) AG Grid virtualizes rows, meaning only visible rows exist in the DOM. This causes the browser's native Cmd+F/Ctrl+F to only search visible elements, making it impossible to find items scrolled out of view. This PR adds a custom find bar to LogListGrid that searches through all AG Grid data regardless of scroll position. Features: - Intercepts Cmd+F / Ctrl+F to show custom find bar - Searches through all AG Grid data (name, task, model, id) - Prev/next navigation via ↑/↓ buttons or Enter/Shift+Enter - Reuses existing FindBand.css for consistent styling - Uses ApplicationIcons for buttons (consistency with rest of UI) - Shows "X of Y" match counter with wrap-around navigation - Red "No results" indicator when no matches - Stores row IDs instead of IRowNode references (avoids memory leaks) - Pre-computes searchable text for O(1) lookup performance Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Refactor shared find control * Rename type * add changelog --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Charles Teague <cteague@gmail.com>
The Google GenAI SDK uses aiohttp internally, which calls asyncio.get_running_loop() and fails under trio. When the trio backend is detected, pass an httpx.AsyncClient via HttpOptions to bypass aiohttp, since httpx is anyio-native and works with both backends.
require explicit creation by user in custom timelines
The Grok SDK uses gRPC which is asyncio-only. Add a PrerequisiteError at provider init when trio is detected, matching the bedrock pattern. Also bake skip_if_trio into skip_if_no_grok so all grok tests automatically skip under trio.
…e/timeline-branch-detection timelines: don't attempt to automatically detect branches
…e/timeline-utility-agent-detection timelines: improved automatic detection of utility agents
fix(grok): raise clear error under trio async backend
fix(google): use httpx under trio async backend
Support credential-free access to public S3 buckets by passing botocore.UNSIGNED signature_version and an explicit region_name to both sync (boto3) and async (aiobotocore) S3 clients. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix: enable test_read_s3_zip_member to run without AWS credentials
…mple-activity Rationalize sample loading logic
Updated changelog with recent changes and bug fixes.
…n-indentation Remove JSON indentation inside `.eval` archives for a 10-50% size reduction
…mputer OpenAI: Migrate computer use from preview to GA
Removed duplicate entries for post-eval editing of tags and metadata, and the tags parameter in Task.
…g_edits feat: log editing of tags and metadata
…ntBEIS#3446) * catch ProcessLookupError in case bash session has crashed * Update CHANGELOG with new features and fixes --------- Co-authored-by: jjallaire <jj.allaire@gmail.com>
…dels Tool results containing ContentImage (e.g. from tools returning screenshots or photos) were not being sent to the Google API for non-computer-use models. Images are now included as sibling Part objects in the Content alongside the FunctionResponse, since most models don't support multimodal function responses. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ogle-tool-image-results fix(google): include images from tool results for non-computer-use models
Condense repeated ChatMessage inputs and ModelCall messages across turns into per-sample pool tables with range-encoded refs. On write, messages are assigned stable IDs and deduplicated; on read, compact refs are expanded back into full message lists. Key changes: - Add _pool.py with condense/expand helpers and range-encoded refs - Extend ModelEvent with input_refs/call_refs and ModelCall with call_refs/call_key fields - Add message ID generation to ChatMessage and stability fixes to resolve_tool_model_input, trim_messages, and reasoning history - Update eval recorder to condense samples on completion - Update TypeScript viewer to resolve pool refs in sampleUtils - Add comprehensive Python and TypeScript test coverage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add log_schema_version() that returns 2 by default, 3 when INSPECT_LOG_FORMAT_V3 is set. Gate pool dedup in condense_sample() and version numbers in recorders behind this function. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- expandRefs: use flatMap/slice (per @epatey) - resolveEventRefs: eliminate let with ternary - resolvePools: remove unnecessary optional chaining - Rewrite sampleUtils tests as table-driven - Restore _wrap model validator for in-memory ChatMessage dedup - Rebuild lib Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Makes the fixed-size pair semantics explicit in the type system and renames internal variables to end_exclusive for clarity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace identity-based (msg.id) dedup with hash-based dedup using mm3_hash of the full sorted-keys JSON serialization. This is correct-by-construction: mutated messages with stale IDs get separate pool entries. Remove repair_duplicate_message_ids() and its callers since hash-based dedup handles same-id-different-content naturally. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ations Within a single condense call, the same Python object seen across multiple events is guaranteed to have identical content, so id(obj) is a safe cache key for the computed hash. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7985326 to
cadcf58
Compare
The rebase introduced a mismatch: main changed read_eval_log to always use resolve_attachments="full", but our assertions still checked the convert-time parameter. Restore reading with the convert-time parameter and skip walking call_pool when only_core is set. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains:
What is the current behavior? (You can also link to an open issue here)
Every
ModelEvent.inputstores the full list ofChatMessageobjects, even though most messages repeat across turns in an agentic loop. Similarly,ModelCall.requeststores the full provider wire-format messages. For long agent runs, this causes significant log file bloat — the same system prompt and conversation history is stored N times for N model calls.What is the new behavior?
Message pool deduplication: Repeated messages are stored once in
EvalSample.message_pool(a list ofChatMessage) andEvalSample.call_pool(a list of JSON values).ModelEvent.inputis replaced byinput_refs— range-encoded index references (list[list[int]], each[start, end)) into the pool. Similarly,ModelCall.requestmessages are replaced bycall_refs+call_key.On read, pools are transparently resolved back to full inputs, so downstream consumers see no change.
Key changes:
src/inspect_ai/log/_pool.pymodule with all pool condensing/resolving logic_condense.pyorchestrates attachment condensation + pool dedup via_pool.pyrepair_duplicate_message_ids()fixes legacy files where content was mutated without changing the message IDDesign decisions worth discussing
Identity-based dedup (msg.id), not content equality: Messages are deduplicated by their
.idfield, not by deep content comparison. This is O(1) per message vs O(n) for deep equality on large content lists. The trade-off is that mutation sites (reasoning stripping, tool model_input, trim_messages) must assign a newuuid()when they modify content. This PR adds those assignments and documents the invariant in_pool.py's module docstring. An alternative would be deep equality or content hashing, but the performance cost on large messages (especially those containing images) is prohibitive.Range encoding for refs: Rather than storing raw index arrays (
[0, 1, 2, 3, 4]), refs use half-open ranges ([[0, 5]]). This compresses well for the common agentic case where each turn's input is a contiguous prefix of the pool plus a few new messages.No nested event recursion: Pool resolution only operates on top-level events.
SubtaskEvent.eventsandToolEvent.eventsare deprecated — modern evals record all events flat. Therepair_duplicate_message_idsfunction (for legacy file conversion) also only processes top-level events for the same reason.Pool exclusion when events excluded:
read_eval_log_sample_asyncnow automatically excludesmessage_pool/call_poolwheneventsis inexclude_fields(pools are useless without events), and forces their inclusion when events are being read (events need pools to resolve).Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
No breaking changes for consumers. The schema change is additive — old fields remain populated when reading (pools are resolved transparently). The log format version is bumped to 3, but the reader handles both v2 and v3.
Code that directly mutates
ChatMessage.contenton messages that will appear inModelEvent.inputmust assign a new ID afterward. All known mutation sites in the codebase have been updated.Other information:
The
_pool.pymodule docstring documents the identity-based dedup invariant and explicitly warns against replacing it with deep equality without understanding the performance implications.