fix(sdk): snapshot LLM metrics via Metrics.get_snapshot() (Q2 of #3341) by VascoSch92 · Pull Request #3397 · OpenHands/software-agent-sdk

VascoSch92 · 2026-05-27T09:16:59Z

What

Q2 of #3341. Every LLMResponse carries a MetricsSnapshot. It was hand-built and passed the live accumulated_token_usage by reference, so the "snapshot" aliased mutable LLM state instead of capturing it. This snapshots via the canonical Metrics.get_snapshot() directly in the two LLMResponse builders (_build_completion_result, _build_responses_result):

return LLMResponse(
    message=message,
    metrics=self.metrics.get_snapshot(),
    raw_response=resp,
)

Why this is good

It makes the snapshot an actual snapshot. get_snapshot() deepcopys accumulated_token_usage; the hand-built version aliased the live object, so a returned snapshot could change after the fact. This is observable, not theoretical: ACPAgent mutates that object in place (acp_agent.py:736 — ...accumulated_token_usage.model = self.acp_model), which would retroactively rewrite the model of any LLMResponse snapshot already handed out.
Single source of truth. get_snapshot() is the canonical snapshot method and is already what ConversationStats uses (conversation_stats.py:51). The hand-built copy duplicated its field list and would silently drift — and re-introduce the aliasing — if MetricsSnapshot ever gains a field.
No needless indirection. The interim _current_metrics_snapshot() wrapper added nothing once its body was a one-line delegation (no logic, no extra typing, only 2 call sites, unreferenced by tests), so it's dropped in favour of calling get_snapshot() directly. MetricsSnapshot is no longer imported in llm.py.

Why it's a small diff

The issue originally listed four hand-built MetricsSnapshot(...) sites (llm.py:907/1032/1256/1450). #3355 / #3356 already consolidated all four into one helper while fixing the async error path, so this PR finishes Q2 by removing that helper and snapshotting directly.

Testing

ruff format --check and ruff check clean on the changed file
141 targeted tests pass: test_llm_metrics, test_llm_completion, test_llm_fallback, test_llm, test_stats_update_event_snapshot

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:9fe1960-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-9fe1960-python \
  ghcr.io/openhands/agent-server:9fe1960-python

All tags pushed for this build

ghcr.io/openhands/agent-server:9fe1960-golang-amd64
ghcr.io/openhands/agent-server:9fe1960a7c107190e4ed075cb0d440dc77ee5c29-golang-amd64
ghcr.io/openhands/agent-server:fix-reuse-metrics-get-snapshot-golang-amd64
ghcr.io/openhands/agent-server:9fe1960-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:9fe1960-golang-arm64
ghcr.io/openhands/agent-server:9fe1960a7c107190e4ed075cb0d440dc77ee5c29-golang-arm64
ghcr.io/openhands/agent-server:fix-reuse-metrics-get-snapshot-golang-arm64
ghcr.io/openhands/agent-server:9fe1960-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:9fe1960-java-amd64
ghcr.io/openhands/agent-server:9fe1960a7c107190e4ed075cb0d440dc77ee5c29-java-amd64
ghcr.io/openhands/agent-server:fix-reuse-metrics-get-snapshot-java-amd64
ghcr.io/openhands/agent-server:9fe1960-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:9fe1960-java-arm64
ghcr.io/openhands/agent-server:9fe1960a7c107190e4ed075cb0d440dc77ee5c29-java-arm64
ghcr.io/openhands/agent-server:fix-reuse-metrics-get-snapshot-java-arm64
ghcr.io/openhands/agent-server:9fe1960-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:9fe1960-python-amd64
ghcr.io/openhands/agent-server:9fe1960a7c107190e4ed075cb0d440dc77ee5c29-python-amd64
ghcr.io/openhands/agent-server:fix-reuse-metrics-get-snapshot-python-amd64
ghcr.io/openhands/agent-server:9fe1960-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:9fe1960-python-arm64
ghcr.io/openhands/agent-server:9fe1960a7c107190e4ed075cb0d440dc77ee5c29-python-arm64
ghcr.io/openhands/agent-server:fix-reuse-metrics-get-snapshot-python-arm64
ghcr.io/openhands/agent-server:9fe1960-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:9fe1960-golang
ghcr.io/openhands/agent-server:9fe1960a7c107190e4ed075cb0d440dc77ee5c29-golang
ghcr.io/openhands/agent-server:fix-reuse-metrics-get-snapshot-golang
ghcr.io/openhands/agent-server:9fe1960-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:9fe1960-java
ghcr.io/openhands/agent-server:9fe1960a7c107190e4ed075cb0d440dc77ee5c29-java
ghcr.io/openhands/agent-server:fix-reuse-metrics-get-snapshot-java
ghcr.io/openhands/agent-server:9fe1960-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:9fe1960-python
ghcr.io/openhands/agent-server:9fe1960a7c107190e4ed075cb0d440dc77ee5c29-python
ghcr.io/openhands/agent-server:fix-reuse-metrics-get-snapshot-python
ghcr.io/openhands/agent-server:9fe1960-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

Each variant tag (e.g., 9fe1960-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 9fe1960-python-amd64) are also available if needed

github-actions · 2026-05-27T09:17:29Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-05-27T09:17:39Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

all-hands-bot

Taste Rating: 🟢 Good taste

LGTM — reusing Metrics.get_snapshot() gives LLMResponse the canonical deep-copy snapshot semantics and removes duplicate field construction.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟢 LOW
One-line helper refactor; no dependency, release, or eval-risk behavior changes.

VERDICT: ✅ Worth merging

KEY INSIGHT: Single source of truth for metrics snapshots fixes the ownership bug cleanly.

This review was generated by an AI agent (OpenHands) on behalf of the user.

Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26502376345

The per-response MetricsSnapshot was hand-built and passed the live accumulated_token_usage by reference, so it aliased mutable LLM state instead of capturing it. Call the canonical Metrics.get_snapshot() (which deep-copies accumulated_token_usage) directly in the two LLMResponse builders and drop the redundant _current_metrics_snapshot wrapper.

all-hands-bot

✅ QA Report: PASS

Verified that LLMResponse metrics now behave as real snapshots instead of aliasing live LLM metrics.

Does this PR achieve its stated goal?

Yes. The PR set out to make the MetricsSnapshot attached to each LLMResponse independent from later mutations to LLM.metrics.accumulated_token_usage. On origin/main, a real LLM.completion(...) response shared the same token-usage object as the live metrics and changed after I mutated the live model name; on the PR branch, the response snapshot remained unchanged while the live metrics changed.

Phase	Result
Environment Setup	✅ `make build` completed successfully.
CI Status	🟡 21 successful, 8 pending, 3 skipped, 0 failing at the time checked.
Functional Verification	✅ Base branch reproduced aliasing; PR branch preserved the response snapshot after live metric mutation.

Functional Verification

Test 1: Real SDK `LLM.completion(...)` response snapshot stays independent

Step 1 — Reproduce / establish baseline without the fix:
Checked out origin/main, then ran a short SDK script that creates an LLM, calls llm.completion(...) against the configured LLM proxy with a real prompt, and mutates llm.metrics.accumulated_token_usage.model after the response is returned.

Command shape:

git checkout --quiet origin/main
OPENHANDS_SUPPRESS_BANNER=1 uv run python - <<'SCRIPT'
import os
from openhands.sdk import LLM, Message, TextContent

llm = LLM(model=os.environ['LLM_MODEL'], api_key=os.environ['LLM_API_KEY'], base_url=os.environ['LLM_BASE_URL'], max_output_tokens=8)
response = llm.completion([Message(role='user', content=[TextContent(text='Reply with exactly: OK')])])
print('response usage exists:', response.metrics.accumulated_token_usage is not None)
print('snapshot and live share object:', response.metrics.accumulated_token_usage is llm.metrics.accumulated_token_usage)
print('response metric model before live mutation:', response.metrics.accumulated_token_usage.model)
llm.metrics.accumulated_token_usage.model = 'qa-mutated-live-model'
print('live metric model after mutation:', llm.metrics.accumulated_token_usage.model)
print('response metric model after live mutation:', response.metrics.accumulated_token_usage.model)
SCRIPT

Observed output excerpt:

response usage exists: True
snapshot and live share object: True
response metric model before live mutation: litellm_proxy/openai/gpt-5.5
live metric model after mutation: qa-mutated-live-model
response metric model after live mutation: qa-mutated-live-model

This confirms the bug existed: the already-returned LLMResponse.metrics changed retroactively when the live LLM metrics object was mutated.

Step 2 — Apply the PR's changes:
Checked out PR commit d8ac40d4ec285e74fd4cda983b5269af05daec4e / branch fix/reuse-metrics-get-snapshot.

Step 3 — Re-run with the fix in place:
Ran the same SDK LLM completion flow on the PR branch.

Observed output excerpt:

response usage exists: True
snapshot and live share object: False
response metric model before live mutation: litellm_proxy/openai/gpt-5.5
live metric model after mutation: qa-mutated-live-model
response metric model after live mutation: litellm_proxy/openai/gpt-5.5

This confirms the fix works for the user-facing SDK path: the returned LLMResponse.metrics retains its original model after the live metrics object is mutated.

Test 2: Direct snapshot helper parity with canonical snapshot behavior

I also exercised the snapshot behavior without an external LLM call by constructing an SDK LLM, assigning a TokenUsage, calling _current_metrics_snapshot(), then mutating the live token usage. On origin/main, the snapshot and live object were identical and the snapshot changed; on the PR branch, they were distinct and metrics.get_snapshot() was also distinct.

Baseline excerpt:

snapshot and live share object before live mutation: True
snapshot model after live mutation: mutated-model
canonical get_snapshot shares object: False

PR excerpt:

snapshot and live share object before live mutation: False
snapshot model after live mutation: original-model
canonical get_snapshot shares object: False

Issues Found

None.

This review was created by an AI agent (OpenHands) on behalf of the user.

malhotra5

LGTM thank you!

all-hands-bot approved these changes May 27, 2026

View reviewed changes

all-hands-bot reviewed May 27, 2026

View reviewed changes

VascoSch92 force-pushed the fix/reuse-metrics-get-snapshot branch from d8ac40d to 9fe1960 Compare May 27, 2026 09:24

VascoSch92 changed the title ~~fix(sdk): reuse Metrics.get_snapshot() for LLMResponse snapshot (Q2 of #3341)~~ fix(sdk): snapshot LLM metrics via Metrics.get_snapshot() (Q2 of #3341) May 27, 2026

VascoSch92 mentioned this pull request May 27, 2026

[Tracker] Async LLM support (#3284): correctness bugs + maintainability issues #3341

Open

10 tasks

VascoSch92 requested a review from malhotra5 May 27, 2026 09:25

malhotra5 approved these changes May 27, 2026

View reviewed changes

malhotra5 merged commit 5df78e8 into main May 27, 2026
36 checks passed

malhotra5 deleted the fix/reuse-metrics-get-snapshot branch May 27, 2026 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sdk): snapshot LLM metrics via Metrics.get_snapshot() (Q2 of #3341)#3397

fix(sdk): snapshot LLM metrics via Metrics.get_snapshot() (Q2 of #3341)#3397
malhotra5 merged 1 commit into
mainfrom
fix/reuse-metrics-get-snapshot

VascoSch92 commented May 27, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented May 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 27, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot left a comment

Uh oh!

malhotra5 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

VascoSch92 commented May 27, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why this is good

Why it's a small diff

Testing

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python API breakage checks — ✅ PASSED

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

REST API breakage checks (OpenAPI) — ✅ PASSED

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

✅ QA Report: PASS

Does this PR achieve its stated goal?

Test 1: Real SDK LLM.completion(...) response snapshot stays independent

Test 2: Direct snapshot helper parity with canonical snapshot behavior

Issues Found

Uh oh!

malhotra5 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

VascoSch92 commented May 27, 2026 •

edited by github-actions Bot

Loading

github-actions Bot commented May 27, 2026 •

edited

Loading

github-actions Bot commented May 27, 2026 •

edited

Loading

Test 1: Real SDK `LLM.completion(...)` response snapshot stays independent