GEPA optimizer crashes with KeyError on failed eval cases

## 🔴 Required Information

### Describe the Bug

When using GEPA optimization (`AgentOptimizer.optimize()`) with evaluation sets that include failed cases (e.g., due to inference failures, user simulator errors, or API timeouts), the GEPA adapter crashes with a `KeyError` in `gepa_root_agent_prompt_optimizer.py:150`.

Failed eval cases don't populate score entries in the `result.scores` dictionary, but the GEPA adapter assumes all batch examples have corresponding scores. This causes the optimization loop to crash mid-evaluation.

**This is Part 2 of a three-part error cascade:**
- Bug 1 (PR #5878): Failed inferences leave `inference_result.inferences = None` → TypeError when iterating
- **Bug 2 (This Issue)**: Failed cases missing from `result.scores` dict → KeyError when accessing scores
- Bug 3: Failed cases have `score = None` → TypeError when rounding (will be mitigated by Bug 1 fix)

See related issues: #5876, #5115, #5403, and PR #5878.

### Steps to Reproduce

1. Create an eval set with some cases that will trigger inference failures
   - E.g., use `conversation_scenario` (user simulation) with edge cases that cause "LLM returned only thinking tokens"
   - Or configure cases that timeout or fail authentication
2. Run `AgentOptimizer.optimize()` with this eval set
3. **Expected**: Optimization completes with "X PASSED, Y FAILED" and continues
4. **Observed**: After initial eval if Bug 1 is fixed, crashes in iteration 2 with:
   ```
   KeyError: '<example_id>'
   File ".../gepa_root_agent_prompt_optimizer.py", line 150, in evaluate
     score = result.scores[example_id]  # ← CRASH
   ```

### Expected Behavior

GEPA optimization should gracefully handle failed eval cases by:
1. Assigning a default score (e.g., 0.0) to failed cases
2. Continuing the optimization loop with successful cases
3. Logging which cases failed and why

### Observed Behavior

```
Traceback (most recent call last):
  ...
  File ".../google/adk/optimization/gepa_root_agent_prompt_optimizer.py", line 150, in evaluate
    score = result.scores[example_id]
            ~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: '151d97ad'
```

Optimization terminates prematurely, blocking the ability to use GEPA with any eval sets containing transient failures.

### Environment Details

- ADK Library Version: 1.31.1+
- Desktop OS: macOS/Linux  
- Python Version: 3.14.0

### Model Information

- Are you using LiteLLM: Yes/No
- Which model is being used: (e.g., gemini-2.5-pro)

---

## 🟡 Optional Information

### Root Cause Analysis

**File**: `google/adk/optimization/gepa_root_agent_prompt_optimizer.py:150`

**Current Code**:
```python
for example_id in batch:
  score = result.scores[example_id]  # ← CRASH if example_id not in result.scores
  scores.append(score)

  eval_data = result.data.get(example_id, {}) if result.data else {}
  outputs.append(eval_data)
  trajectories.append(eval_data)
```

**The Problem**:
- Failed eval cases are excluded from `result.scores`
- But they're still in the `batch` list being iterated
- Direct dictionary access `[example_id]` raises KeyError

**Note**: This manifests AFTER Bug 1 (PR #5878) is fixed, because:
1. Bug 1 fix → failed inferences don't crash during iteration
2. Evaluation completes with some cases marked as FAILED
3. Failed cases don't populate `result.scores`
4. This code assumes all cases in batch have scores → KeyError

### Suggested Fix

Use `.get()` with a conservative default of `0.0` for failed cases:

```python
for example_id in batch:
  score = result.scores.get(example_id, 0.0)  # Default to 0.0 for failed cases
  scores.append(score)

  eval_data = result.data.get(example_id, {}) if result.data else {}
  outputs.append(eval_data)
  trajectories.append(eval_data)
```

**Rationale**:
- **Graceful degradation**: Optimization continues with successful cases
- **Conservative default**: 0.0 treats failure as worst performance (penalizes failing prompts)
- **Semantic correctness**: Failed inference IS a failure and should contribute negatively to score
- **Minimal change**: One `.get()` call with default value
- **No breaking changes**: Behavior identical when all cases succeed

### Relationship to Other Issues

- **PR #5878**: Fixes Bug 1 (inference_result.inferences = None) by returning early with failed EvalCaseResult
- **Issue #5876**: Reports Bug 1; PR #5878 is the pending fix
- **Issue #5115, #5403**: Report Bug 3 (round(None)) but likely not merged

**Fix order**:
1. PR #5878 merges (Bug 1) → failed cases no longer crash during iteration
2. This PR (Bug 2) → failed cases don't crash during score aggregation
3. Bug 3 becomes a safety net (defensive None check before rounding)

### How Often This Issue Occurs

**Always (100%)** when:
- GEPA optimization is run with eval sets containing failed cases
- AND any of those failed cases are sampled during the optimization loop
- AND the initial baseline eval completes (Bug 1 doesn't prevent baseline eval from completing with some fixes)

In production environments with transient failures (rate limits, timeouts, network errors), this is a regular occurrence.

### Local Workaround

Monkeypatch the optimizer to handle missing scores:

```python
# Monkey-patch before calling optimizer
original_evaluate = GepaRootAgentPromptOptimizer.evaluate

def patched_evaluate(self, batch):
  result = original_evaluate(self, batch)
  # Ensure all batch examples have scores
  for example_id in batch:
    if example_id not in result.scores:
      result.scores[example_id] = 0.0
  return result

GepaRootAgentPromptOptimizer.evaluate = patched_evaluate
```

### Testing

The fix should be validated with:
1. Unit test covering batch evaluation with missing score entry
2. Integration test: GEPA optimization with mixed passing/failing eval cases
3. Regression test: Verify scoring identical when all cases pass

---

## Additional Context

This bug is part of a broader resilience issue in GEPA optimization. The three cascading bugs prevent optimization from completing when ANY eval case fails, even transiently. Together they block production usage of GEPA with real-world eval sets.

The fixes are minimal (1-2 lines each) and maintain backward compatibility while enabling graceful degradation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GEPA optimizer crashes with KeyError on failed eval cases #6004

🔴 Required Information

Describe the Bug

Steps to Reproduce

Expected Behavior

Observed Behavior

Environment Details

Model Information

🟡 Optional Information

Root Cause Analysis

Suggested Fix

Relationship to Other Issues

How Often This Issue Occurs

Local Workaround

Testing

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

GEPA optimizer crashes with KeyError on failed eval cases #6004

Description

🔴 Required Information

Describe the Bug

Steps to Reproduce

Expected Behavior

Observed Behavior

Environment Details

Model Information

🟡 Optional Information

Root Cause Analysis

Suggested Fix

Relationship to Other Issues

How Often This Issue Occurs

Local Workaround

Testing

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions