Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion sdks/python/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ SETTINGS_DST := src/learning_commons_evaluators/settings

.PHONY: help build check-build test unit-test contract-test \
generate-settings check-generated sync-settings check-sync \
lint format format-check typecheck pip-check verify coverage
lint lint-fix format format-check typecheck pip-check verify coverage

help:
@echo "Usage: make <target>"
Expand All @@ -23,6 +23,7 @@ help:
@echo " check-build Verify build artifacts are up to date (use in CI)"
@echo ""
@echo " lint Ruff linter (src/, tests/, scripts/)"
@echo " lint-fix Ruff linter with --fix (safe auto-fixes)"
@echo " format Apply Ruff formatter"
@echo " format-check Fail if Ruff would reformat any file"
@echo " typecheck Mypy on src package + tests"
Expand Down Expand Up @@ -58,6 +59,9 @@ check-build: check-generated check-sync
lint:
$(RUFF) check src tests scripts

lint-fix:
$(RUFF) check --fix src tests scripts

format:
$(RUFF) format src tests scripts

Expand Down
100 changes: 83 additions & 17 deletions sdks/python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -302,32 +302,98 @@ config = create_config_no_telemetry(logger=create_silent_logger())

## Error handling

During a normal `evaluate()` / `evaluate_sync()` run, failures from evaluator input checks, configuration, LLM prompt steps, and output validation typically surface as subclasses of `EvaluatorError`. Failures inside LLM prompt steps are wrapped at the boundary so callers see a predictable, sanitized hierarchy instead of raw LangChain, OpenAI, Anthropic, or HTTP-client exceptions. A few documented paths still raise standard Python exceptions — for example `ValueError` from `execute_prompt_chain_step` when `json_dict_normalizer` is set without `parser_output_type`, and `RuntimeError` from `evaluate_sync()` when an asyncio event loop is already running on the current thread. Those are programmer errors, not evaluation failures.

### Hierarchy

```
EvaluatorError
├── ConfigurationError — bad config (missing provider, unknown model, malformed settings)
├── InputValidationError — caller-supplied input failed validation
└── APIError — failures originating in the LLM provider call
├── AuthenticationError — 401 / 403
├── RateLimitError — 429; carries `retry_after` (seconds)
├── NetworkError — connection refused, DNS failure, broken TLS
├── RequestTimeoutError — request exceeded the configured timeout
└── OutputValidationError — LLM response failed to parse or didn't match the expected schema
```

`InputValidationError` is named that way deliberately to avoid collision with `pydantic.ValidationError`. `RequestTimeoutError` is named that way to avoid shadowing the builtin `TimeoutError`. There is **no** `ValidationError` or `EvaluatorTimeoutError` in the public API.

### Knowing when to retry

Every `EvaluatorError` exposes a boolean `retryable` attribute. This is the single signal callers should consult when wrapping `evaluate()` in retry logic — there is no separate marker class to check. Subclasses set sensible defaults:

- Retryable by default: `RateLimitError`, `NetworkError`, `RequestTimeoutError`, `OutputValidationError`, and any `APIError` with a 5xx status code.
- Not retryable: `ConfigurationError`, `InputValidationError`, `AuthenticationError`, and `APIError` with a 4xx status code.

`retryable` is also accepted as an `__init__` kwarg on `APIError` and `NetworkError` if you need to flag a specific instance differently (e.g. a permanently-bad hostname).

```python
from learning_commons_evaluators import (
ConfigurationError, # Missing/invalid config
ValidationError, # Invalid input
AuthenticationError, # Invalid API keys (401/403)
RateLimitError, # Rate limit exceeded (429) - has retry_after
NetworkError, # Network failures
EvaluatorTimeoutError, # Request timeout
APIError, # Other API errors
)
import time
from learning_commons_evaluators import EvaluatorError, RateLimitError

for attempt in range(3):
try:
result = evaluator.evaluate_sync(input)
break
except EvaluatorError as e:
if not e.retryable or attempt == 2:
raise
delay = e.retry_after if isinstance(e, RateLimitError) and e.retry_after else 2 ** attempt
time.sleep(delay) # retry_after is in seconds
```

### Sanitization and debugging context

Error **messages** (the value returned by `str(err)`) are short and controlled. Raw provider strings — which may contain prompt echoes, user text, or fragments of API keys — are **not** interpolated into the SDK exception's message. Structured detail lives on attributes instead:

- `status_code` on `APIError` — HTTP status from the provider, when one was returned. Populated from the provider exception's `.status_code` attribute when present (preferred over message regex).
- `retry_after` on `RateLimitError` — suggested delay before retry, **in seconds**, or `None` if the provider didn't return a `Retry-After` header.
- `provider` on `APIError` — the `LLMProvider` being called when the failure occurred.
- `model` on `APIError` — the model ID requested.
- `response_body` on `APIError` — decoded response body. Opt-in for debugging; may contain echoed prompt content, so treat as sensitive.
- `request_id` on `APIError` — provider request ID, useful for support escalation.
- `validation_errors` on `OutputValidationError` — per-field entries from Pydantic's `errors()` API after `sanitize_pydantic_errors` (`loc`, `type`, optional `url`, and safe primitive `ctx` only — no `input` or `msg`, which can echo model output).

The original provider exception is preserved on `__cause__` (via `raise … from e`), so debuggers, tracebacks, and `logging.exception()` retain full detail even though `str(err)` is sanitized.

```python
import logging
import time

from learning_commons_evaluators import APIError, OutputValidationError, RateLimitError

log = logging.getLogger(__name__)
try:
result = evaluator.evaluate_sync(input)
except ConfigurationError as e:
print(f"Config issue: {e}")
except ValidationError as e:
print(f"Invalid input: {e}")
except RateLimitError as e:
print(f"Rate limited, retry after {e.retry_after}ms")
time.sleep(e.retry_after or 30) # seconds
Comment thread
czi-fsisenda marked this conversation as resolved.
except OutputValidationError as e:
# Structured entries omit Pydantic msg/input (may echo LLM text); use __cause__ for full detail.
log.warning("Bad LLM output: %s", e.validation_errors)
# Original pydantic.ValidationError / OutputParserException available as e.__cause__
except APIError as e:
print(f"API error (retryable={e.retryable}): {e}")
log.error(
"Provider call failed",
extra={
"provider": e.provider,
"model": e.model,
"status": e.status_code,
"request_id": e.request_id,
},
)
raise
```

Failures inside LLM prompt steps are passed through `wrap_provider_error()` (see `learning_commons_evaluators.schemas.errors`) so you typically see `APIError` subclasses rather than raw LangChain or HTTP client exceptions. Use `EvaluatorTimeoutError` for timeouts (the package does not export a `TimeoutError` alias, to avoid shadowing the Python builtin).
### Metadata and telemetry

On evaluation failure, `evaluation_metadata.status` is set to `failed` and `evaluation_metadata.error_details` is populated before `evaluate()` / `evaluate_sync()` re-raises (no result object is returned). `error_details` is itself sanitized:

- SDK errors record `"ClassName: <sanitized message>"`.
- Any other exception that escapes records only `"Unexpected error: ClassName"` — the message is omitted because arbitrary `ValueError`/`AttributeError`/etc. messages may contain user data or field values that aren't safe for telemetry.

On evaluation failure, `metadata.status` and `error_details` are set on the in-memory metadata object for the run and appear on the evaluation end log line; `BaseEvaluator.evaluate` / `evaluate_sync` still re-raises and does not return a result object.
The same policy applies to per-step `StepMetadata.error_details`. Both fields are emitted on the evaluation end log line.

## Creating custom evaluators

Expand Down
12 changes: 6 additions & 6 deletions sdks/python/src/learning_commons_evaluators/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,11 @@
AuthenticationError,
ConfigurationError,
EvaluatorError,
EvaluatorRetryableError,
EvaluatorTimeoutError,
InputValidationError,
NetworkError,
OutputValidationError,
RateLimitError,
ValidationError,
RequestTimeoutError,
wrap_provider_error,
)

Expand Down Expand Up @@ -113,31 +113,31 @@
"EvaluatorError",
"EvaluatorMaturity",
"EvaluatorMetadata",
"EvaluatorRetryableError",
"EvaluatorTimeoutError",
"GoogleLLMProviderConfig",
"AnyInputSpec",
"GradeInputField",
"GradeInputSpec",
"InputField",
"InputSpec",
"InputT",
"InputValidationError",
"TextInputSpec",
"LLMProvider",
"LLMProviderConfig",
"Logger",
"NetworkError",
"OpenAILLMProviderConfig",
"OutputT",
"OutputValidationError",
"PromptSettings",
"RateLimitError",
"RequestTimeoutError",
"SDK_LOGGER_NAME",
"Status",
"TelemetryConfig",
"TextComplexityEvaluationInput",
"TextInputField",
"TokenUsage",
"ValidationError",
"VocabularyEvaluationInput",
"VocabularyEvaluationSettings",
"VocabularyEvaluator",
Expand Down
16 changes: 10 additions & 6 deletions sdks/python/src/learning_commons_evaluators/errors.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,13 @@
AuthenticationError,
ConfigurationError,
EvaluatorError,
EvaluatorRetryableError,
EvaluatorTimeoutError,
InputValidationError,
NetworkError,
OutputValidationError,
RateLimitError,
ValidationError,
RequestTimeoutError,
format_error_for_metadata,
sanitize_pydantic_errors,
wrap_provider_error,
)

Expand All @@ -18,10 +20,12 @@
"AuthenticationError",
"ConfigurationError",
"EvaluatorError",
"EvaluatorRetryableError",
"EvaluatorTimeoutError",
"InputValidationError",
"NetworkError",
"OutputValidationError",
"RateLimitError",
"ValidationError",
"RequestTimeoutError",
"format_error_for_metadata",
"sanitize_pydantic_errors",
"wrap_provider_error",
]
67 changes: 58 additions & 9 deletions sdks/python/src/learning_commons_evaluators/evaluators/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from collections.abc import Awaitable, Callable
from typing import Any, Generic, TypeVar, overload

from langchain_core.exceptions import OutputParserException
from pydantic import BaseModel
from pydantic import ValidationError as PydanticValidationError

Expand All @@ -22,6 +23,9 @@
)
from learning_commons_evaluators.schemas.errors import (
EvaluatorError,
OutputValidationError,
format_error_for_metadata,
sanitize_pydantic_errors,
wrap_provider_error,
)
from learning_commons_evaluators.schemas.evaluator import (
Expand Down Expand Up @@ -105,7 +109,7 @@ async def evaluate(
on success.

Raises:
ValidationError: Input fails validation.
InputValidationError: Input fails validation.
ConfigurationError: No provider config for the required LLM provider.
APIError (or subclasses): The LLM API call failed.

Expand Down Expand Up @@ -133,7 +137,7 @@ async def evaluate(
return result
except Exception as e:
evaluation_metadata.status = Status.failed
evaluation_metadata.error_details = str(e)
evaluation_metadata.error_details = format_error_for_metadata(e)
raise
finally:
evaluation_metadata.processing_time_ms = (time.perf_counter() - start) * 1000
Expand Down Expand Up @@ -236,7 +240,7 @@ async def execute_step(
return result
except Exception as e:
step_metadata.status = Status.failed
step_metadata.error_details = str(e)
step_metadata.error_details = format_error_for_metadata(e)
raise
finally:
step_metadata.processing_time_ms = (time.perf_counter() - start) * 1000
Expand Down Expand Up @@ -310,7 +314,15 @@ async def execute_prompt_chain_step(

Raises:
ConfigurationError: No provider config for ``prompt_settings.provider_type``.
EvaluatorError: SDK errors, including :func:`~learning_commons_evaluators.schemas.errors.wrap_provider_error` output for LangChain or HTTP failures (typically :class:`~learning_commons_evaluators.schemas.errors.APIError` subclasses). Pydantic :exc:`pydantic.ValidationError` from output parsing is re-raised unchanged.
OutputValidationError: The LLM response didn't satisfy the expected
output schema — invalid JSON (LangChain ``OutputParserException``),
JSON that didn't match the Pydantic model (``pydantic.ValidationError``),
a non-object JSON value when using ``json_dict_normalizer``, or
``TypeError`` / ``ValueError`` from ``json_dict_normalizer`` itself.
The original exception is reachable via ``__cause__``.
APIError (or other ``EvaluatorError`` subclasses):
:func:`~learning_commons_evaluators.schemas.errors.wrap_provider_error`
output for LangChain or HTTP failures from the LLM provider.
ValueError: If ``json_dict_normalizer`` is set but ``parser_output_type`` is omitted.
"""
if json_dict_normalizer is not None and parser_output_type is None:
Expand All @@ -333,8 +345,25 @@ async def _run_chain() -> BaseModel | str:
loose = JsonOutputParser()
parsed_dict = await loose.ainvoke(ai_message)
if not isinstance(parsed_dict, dict):
parsed_dict = dict(parsed_dict)
normalized = json_dict_normalizer(parsed_dict)
# JSON parsed cleanly but the top-level value isn't an object
# (e.g. the LLM returned a JSON array or scalar). That's an
# output-shape failure, not a parse failure — surface it as
# OutputValidationError so callers can treat it consistently
# with schema-mismatch errors, and avoid the TypeError that
# ``dict(parsed_dict)`` would raise on a non-dict.
raise OutputValidationError(
"Model output is not a JSON object",
provider=prompt_settings.provider_type,
model=prompt_settings.model,
)
try:
normalized = json_dict_normalizer(parsed_dict)
except (TypeError, ValueError) as norm_err:
raise OutputValidationError(
"Model output could not be normalized before validation",
provider=prompt_settings.provider_type,
model=prompt_settings.model,
) from norm_err
return parser_output_type.model_validate(normalized)

parser = JsonOutputParser(pydantic_object=parser_output_type)
Expand All @@ -344,12 +373,32 @@ async def _run_chain() -> BaseModel | str:
return parser_output_type.model_validate(raw)
except EvaluatorError:
raise
except PydanticValidationError:
raise
except (PydanticValidationError, OutputParserException) as e:
# The provider returned a response that didn't match the expected schema.
# This covers both Pydantic schema-validation failures and LangChain
# JSON-parse failures (``OutputParserException``). Wrap so callers can
# discriminate output-parse failures from other API errors and so the
# message that lands in ``EvaluationMetadata.error_details`` stays
# sanitized — the original error (which may include LLM output snippets)
# is reachable via ``__cause__`` for debugging.
validation_errors = (
sanitize_pydantic_errors(e.errors())
if isinstance(e, PydanticValidationError)
else None
)
raise OutputValidationError(
provider=prompt_settings.provider_type,
model=prompt_settings.model,
validation_errors=validation_errors,
) from e
except (KeyboardInterrupt, SystemExit):
raise
except Exception as e:
raise wrap_provider_error(e) from e
raise wrap_provider_error(
e,
provider=prompt_settings.provider_type,
model=prompt_settings.model,
) from e

try:
return await self.execute_step(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
ConventionalityEvaluationSettings,
ConventionalityOutput,
)
from learning_commons_evaluators.schemas.errors import ValidationError
from learning_commons_evaluators.schemas.errors import InputValidationError
from learning_commons_evaluators.schemas.evaluator import (
EvaluationAnswer,
EvaluationExplanation,
Expand Down Expand Up @@ -70,6 +70,6 @@
"TextComplexityEvaluationInput",
"TextInputField",
"TokenUsage",
"ValidationError",
"InputValidationError",
"prompt_settings_to_extras_value",
]
Loading