fix(sdk): snapshot LLM metrics via Metrics.get_snapshot() (Q2 of #3341)#3397
Conversation
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟢 Good taste
LGTM — reusing Metrics.get_snapshot() gives LLMResponse the canonical deep-copy snapshot semantics and removes duplicate field construction.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
One-line helper refactor; no dependency, release, or eval-risk behavior changes.
VERDICT: ✅ Worth merging
KEY INSIGHT: Single source of truth for metrics snapshots fixes the ownership bug cleanly.
This review was generated by an AI agent (OpenHands) on behalf of the user.
Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26502376345
The per-response MetricsSnapshot was hand-built and passed the live accumulated_token_usage by reference, so it aliased mutable LLM state instead of capturing it. Call the canonical Metrics.get_snapshot() (which deep-copies accumulated_token_usage) directly in the two LLMResponse builders and drop the redundant _current_metrics_snapshot wrapper.
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
Verified that LLMResponse metrics now behave as real snapshots instead of aliasing live LLM metrics.
Does this PR achieve its stated goal?
Yes. The PR set out to make the MetricsSnapshot attached to each LLMResponse independent from later mutations to LLM.metrics.accumulated_token_usage. On origin/main, a real LLM.completion(...) response shared the same token-usage object as the live metrics and changed after I mutated the live model name; on the PR branch, the response snapshot remained unchanged while the live metrics changed.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully. |
| CI Status | 🟡 21 successful, 8 pending, 3 skipped, 0 failing at the time checked. |
| Functional Verification | ✅ Base branch reproduced aliasing; PR branch preserved the response snapshot after live metric mutation. |
Functional Verification
Test 1: Real SDK LLM.completion(...) response snapshot stays independent
Step 1 — Reproduce / establish baseline without the fix:
Checked out origin/main, then ran a short SDK script that creates an LLM, calls llm.completion(...) against the configured LLM proxy with a real prompt, and mutates llm.metrics.accumulated_token_usage.model after the response is returned.
Command shape:
git checkout --quiet origin/main
OPENHANDS_SUPPRESS_BANNER=1 uv run python - <<'SCRIPT'
import os
from openhands.sdk import LLM, Message, TextContent
llm = LLM(model=os.environ['LLM_MODEL'], api_key=os.environ['LLM_API_KEY'], base_url=os.environ['LLM_BASE_URL'], max_output_tokens=8)
response = llm.completion([Message(role='user', content=[TextContent(text='Reply with exactly: OK')])])
print('response usage exists:', response.metrics.accumulated_token_usage is not None)
print('snapshot and live share object:', response.metrics.accumulated_token_usage is llm.metrics.accumulated_token_usage)
print('response metric model before live mutation:', response.metrics.accumulated_token_usage.model)
llm.metrics.accumulated_token_usage.model = 'qa-mutated-live-model'
print('live metric model after mutation:', llm.metrics.accumulated_token_usage.model)
print('response metric model after live mutation:', response.metrics.accumulated_token_usage.model)
SCRIPTObserved output excerpt:
response usage exists: True
snapshot and live share object: True
response metric model before live mutation: litellm_proxy/openai/gpt-5.5
live metric model after mutation: qa-mutated-live-model
response metric model after live mutation: qa-mutated-live-model
This confirms the bug existed: the already-returned LLMResponse.metrics changed retroactively when the live LLM metrics object was mutated.
Step 2 — Apply the PR's changes:
Checked out PR commit d8ac40d4ec285e74fd4cda983b5269af05daec4e / branch fix/reuse-metrics-get-snapshot.
Step 3 — Re-run with the fix in place:
Ran the same SDK LLM completion flow on the PR branch.
Observed output excerpt:
response usage exists: True
snapshot and live share object: False
response metric model before live mutation: litellm_proxy/openai/gpt-5.5
live metric model after mutation: qa-mutated-live-model
response metric model after live mutation: litellm_proxy/openai/gpt-5.5
This confirms the fix works for the user-facing SDK path: the returned LLMResponse.metrics retains its original model after the live metrics object is mutated.
Test 2: Direct snapshot helper parity with canonical snapshot behavior
I also exercised the snapshot behavior without an external LLM call by constructing an SDK LLM, assigning a TokenUsage, calling _current_metrics_snapshot(), then mutating the live token usage. On origin/main, the snapshot and live object were identical and the snapshot changed; on the PR branch, they were distinct and metrics.get_snapshot() was also distinct.
Baseline excerpt:
snapshot and live share object before live mutation: True
snapshot model after live mutation: mutated-model
canonical get_snapshot shares object: False
PR excerpt:
snapshot and live share object before live mutation: False
snapshot model after live mutation: original-model
canonical get_snapshot shares object: False
Issues Found
None.
This review was created by an AI agent (OpenHands) on behalf of the user.
d8ac40d to
9fe1960
Compare
What
Q2 of #3341. Every
LLMResponsecarries aMetricsSnapshot. It was hand-built and passed the liveaccumulated_token_usageby reference, so the "snapshot" aliased mutable LLM state instead of capturing it. This snapshots via the canonicalMetrics.get_snapshot()directly in the twoLLMResponsebuilders (_build_completion_result,_build_responses_result):Why this is good
It makes the snapshot an actual snapshot.
get_snapshot()deepcopysaccumulated_token_usage; the hand-built version aliased the live object, so a returned snapshot could change after the fact. This is observable, not theoretical:ACPAgentmutates that object in place (acp_agent.py:736—...accumulated_token_usage.model = self.acp_model), which would retroactively rewrite themodelof anyLLMResponsesnapshot already handed out.Single source of truth.
get_snapshot()is the canonical snapshot method and is already whatConversationStatsuses (conversation_stats.py:51). The hand-built copy duplicated its field list and would silently drift — and re-introduce the aliasing — ifMetricsSnapshotever gains a field.No needless indirection. The interim
_current_metrics_snapshot()wrapper added nothing once its body was a one-line delegation (no logic, no extra typing, only 2 call sites, unreferenced by tests), so it's dropped in favour of callingget_snapshot()directly.MetricsSnapshotis no longer imported inllm.py.Why it's a small diff
The issue originally listed four hand-built
MetricsSnapshot(...)sites (llm.py:907/1032/1256/1450). #3355 / #3356 already consolidated all four into one helper while fixing the async error path, so this PR finishes Q2 by removing that helper and snapshotting directly.Testing
ruff format --checkandruff checkclean on the changed filetest_llm_metrics,test_llm_completion,test_llm_fallback,test_llm,test_stats_update_event_snapshotAgent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:9fe1960-pythonRun
All tags pushed for this build
About Multi-Architecture Support
9fe1960-python) is a multi-arch manifest supporting both amd64 and arm649fe1960-python-amd64) are also available if needed