Skip to content

Cost tracking discrepancies: SDK vs LiteLLM proxy virtual key costs diverge by agent type #603

@simonrosenberg

Description

@simonrosenberg

Summary

During PR validation runs for OpenHands/software-agent-sdk#2656, we observed three distinct cost tracking failures depending on agent type. No single cost source works correctly across all agent types.

Evidence

Runs from 2026-04-02 validating pr/acp-node22-and-defer-init-main (SDK commit 217c454), eval_limit=5:

K8s Job Benchmark Agent Type Model
eval-23899965586-claude-son swebench acp-claude claude-sonnet-4-5-20250929
eval-23899965217-claude-son swebench acp-gemini claude-sonnet-4-5-20250929
eval-23899966017-claude-4-6 swebench default claude-4.6-opus
eval-23899971488-claude-son swebenchmultimodal acp-claude claude-sonnet-4-5-20250929
eval-23899973560-claude-son swebenchmultimodal acp-gemini claude-sonnet-4-5-20250929
eval-23899975488-claude-4-6 swebenchmultimodal default claude-4.6-opus

GCS results: gs://openhands-evaluation-results/{benchmark}/{model_slug}/{eval_run_id}/

Bug 1: acp-gemini — SDK reports $0 cost, only proxy tracks spend

Gemini CLI does not report cost or token usage back to the SDK. metrics.costs and metrics.token_usages are empty arrays. metrics.accumulated_cost is $0.00.

Meanwhile, the LiteLLM proxy virtual key correctly tracks $7–13 per instance via test_result.proxy_cost.

Impact: Without proxy tracking, gemini costs are completely invisible. Also notable: gemini runs are ~50× more expensive than equivalent acp-claude runs on the same instances.

Instance                                    SDK Cost    Proxy Cost
django__django-12155                        $0.0000     $7.08
django__django-13279                        $0.0000     $12.26
django__django-14434                        $0.0000     $10.18
scikit-learn__scikit-learn-13439            $0.0000     $7.49
scikit-learn__scikit-learn-25232            $0.0000     $13.55

Bug 2: default (OpenHands agent) — Proxy reports $0, only SDK tracks cost

The default OpenHands agent has full per-turn SDK cost breakdowns (via LiteLLM response headers), but test_result.proxy_cost is $0.00 for every instance.

Root cause: Virtual keys are only created and injected for ACP agents (in benchmarks/utils/acp.py), not for the default agent path. The default agent uses the base API key directly, bypassing proxy spend tracking.

Instance                                    SDK Cost    Proxy Cost
django__django-12155                        $0.2047     $0.00
django__django-13279                        $0.5177     $0.00
django__django-14434                        $0.4271     $0.00
scikit-learn__scikit-learn-13439            $0.0988     $0.00
scikit-learn__scikit-learn-25232            $0.4557     $0.00

Bug 3: acp-claude — SDK overestimates cost by 5–38% vs proxy

Both sources report non-zero costs, but they diverge. SDK cost (from UsageUpdate.cost reported by claude-agent-acp) is consistently higher than the proxy-tracked cost.

Possible cause: claude-agent-acp reports list price while the proxy applies prompt caching discounts.

Instance                                    SDK Cost    Proxy Cost    Diff
django__django-14434                        $0.2106     $0.1978       -6.1%
django__django-13279                        $0.4767     $0.2931       -38.5%
scikit-learn__scikit-learn-13439            $0.1784     $0.1689       -5.3%
scikit-learn__scikit-learn-25232            $0.2592     $0.2376       -8.3%
django__django-12155                        $0.2011     $0.1789       -11.0%

Bug 4: accumulated_token_counts is always zero

The top-level metrics.accumulated_token_counts field (with prompt, completion, cache_read keys) is 0 for ALL agent types. Actual token data exists in the per-turn metrics.token_usages array for default and acp-claude, but is empty for acp-gemini.

This field appears to be stale or never aggregated, which means any downstream reporting that reads accumulated_token_counts will see zeros.

Summary Table

Agent Type SDK Cost Proxy Cost Token Counts Which source works?
acp-claude ✓ (overestimates ~15%) Per-turn only Both (proxy more accurate)
acp-gemini ✗ ($0) ✗ (empty) Only proxy
default ✗ ($0) Per-turn only Only SDK

Suggested Fixes

  1. Enable virtual key tracking for default agent — create/inject virtual keys in the default evaluation path, not just ACP
  2. Fix gemini SDK cost reporting — either parse gemini CLI's cost output or estimate from token counts
  3. Aggregate accumulated_token_counts — sum token_usages array into the top-level field, or deprecate it
  4. Investigate acp-claude cost divergence — determine whether SDK or proxy is billing-accurate

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions