Skip to content

GEPA optimizer crashes with KeyError on failed eval cases #6004

@thoang3

Description

@thoang3

🔴 Required Information

Describe the Bug

When using GEPA optimization (AgentOptimizer.optimize()) with evaluation sets that include failed cases (e.g., due to inference failures, user simulator errors, or API timeouts), the GEPA adapter crashes with a KeyError in gepa_root_agent_prompt_optimizer.py:150.

Failed eval cases don't populate score entries in the result.scores dictionary, but the GEPA adapter assumes all batch examples have corresponding scores. This causes the optimization loop to crash mid-evaluation.

This is Part 2 of a three-part error cascade:

  • Bug 1 (PR fix(eval): handle failed inference results without invocations #5878): Failed inferences leave inference_result.inferences = None → TypeError when iterating
  • Bug 2 (This Issue): Failed cases missing from result.scores dict → KeyError when accessing scores
  • Bug 3: Failed cases have score = None → TypeError when rounding (will be mitigated by Bug 1 fix)

See related issues: #5876, #5115, #5403, and PR #5878.

Steps to Reproduce

  1. Create an eval set with some cases that will trigger inference failures
    • E.g., use conversation_scenario (user simulation) with edge cases that cause "LLM returned only thinking tokens"
    • Or configure cases that timeout or fail authentication
  2. Run AgentOptimizer.optimize() with this eval set
  3. Expected: Optimization completes with "X PASSED, Y FAILED" and continues
  4. Observed: After initial eval if Bug 1 is fixed, crashes in iteration 2 with:
    KeyError: '<example_id>'
    File ".../gepa_root_agent_prompt_optimizer.py", line 150, in evaluate
      score = result.scores[example_id]  # ← CRASH
    

Expected Behavior

GEPA optimization should gracefully handle failed eval cases by:

  1. Assigning a default score (e.g., 0.0) to failed cases
  2. Continuing the optimization loop with successful cases
  3. Logging which cases failed and why

Observed Behavior

Traceback (most recent call last):
  ...
  File ".../google/adk/optimization/gepa_root_agent_prompt_optimizer.py", line 150, in evaluate
    score = result.scores[example_id]
            ~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: '151d97ad'

Optimization terminates prematurely, blocking the ability to use GEPA with any eval sets containing transient failures.

Environment Details

  • ADK Library Version: 1.31.1+
  • Desktop OS: macOS/Linux
  • Python Version: 3.14.0

Model Information

  • Are you using LiteLLM: Yes/No
  • Which model is being used: (e.g., gemini-2.5-pro)

🟡 Optional Information

Root Cause Analysis

File: google/adk/optimization/gepa_root_agent_prompt_optimizer.py:150

Current Code:

for example_id in batch:
  score = result.scores[example_id]  # ← CRASH if example_id not in result.scores
  scores.append(score)

  eval_data = result.data.get(example_id, {}) if result.data else {}
  outputs.append(eval_data)
  trajectories.append(eval_data)

The Problem:

  • Failed eval cases are excluded from result.scores
  • But they're still in the batch list being iterated
  • Direct dictionary access [example_id] raises KeyError

Note: This manifests AFTER Bug 1 (PR #5878) is fixed, because:

  1. Bug 1 fix → failed inferences don't crash during iteration
  2. Evaluation completes with some cases marked as FAILED
  3. Failed cases don't populate result.scores
  4. This code assumes all cases in batch have scores → KeyError

Suggested Fix

Use .get() with a conservative default of 0.0 for failed cases:

for example_id in batch:
  score = result.scores.get(example_id, 0.0)  # Default to 0.0 for failed cases
  scores.append(score)

  eval_data = result.data.get(example_id, {}) if result.data else {}
  outputs.append(eval_data)
  trajectories.append(eval_data)

Rationale:

  • Graceful degradation: Optimization continues with successful cases
  • Conservative default: 0.0 treats failure as worst performance (penalizes failing prompts)
  • Semantic correctness: Failed inference IS a failure and should contribute negatively to score
  • Minimal change: One .get() call with default value
  • No breaking changes: Behavior identical when all cases succeed

Relationship to Other Issues

Fix order:

  1. PR fix(eval): handle failed inference results without invocations #5878 merges (Bug 1) → failed cases no longer crash during iteration
  2. This PR (Bug 2) → failed cases don't crash during score aggregation
  3. Bug 3 becomes a safety net (defensive None check before rounding)

How Often This Issue Occurs

Always (100%) when:

  • GEPA optimization is run with eval sets containing failed cases
  • AND any of those failed cases are sampled during the optimization loop
  • AND the initial baseline eval completes (Bug 1 doesn't prevent baseline eval from completing with some fixes)

In production environments with transient failures (rate limits, timeouts, network errors), this is a regular occurrence.

Local Workaround

Monkeypatch the optimizer to handle missing scores:

# Monkey-patch before calling optimizer
original_evaluate = GepaRootAgentPromptOptimizer.evaluate

def patched_evaluate(self, batch):
  result = original_evaluate(self, batch)
  # Ensure all batch examples have scores
  for example_id in batch:
    if example_id not in result.scores:
      result.scores[example_id] = 0.0
  return result

GepaRootAgentPromptOptimizer.evaluate = patched_evaluate

Testing

The fix should be validated with:

  1. Unit test covering batch evaluation with missing score entry
  2. Integration test: GEPA optimization with mixed passing/failing eval cases
  3. Regression test: Verify scoring identical when all cases pass

Additional Context

This bug is part of a broader resilience issue in GEPA optimization. The three cascading bugs prevent optimization from completing when ANY eval case fails, even transiently. Together they block production usage of GEPA with real-world eval sets.

The fixes are minimal (1-2 lines each) and maintain backward compatibility while enabling graceful degradation.

Metadata

Metadata

Assignees

Labels

eval[Component] This issue is related to evaluation

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions