learning-commons-org · czi-fsisenda · May 16, 2026 · May 16, 2026 · May 16, 2026 · May 16, 2026
diff --git a/sdks/python/Makefile b/sdks/python/Makefile
@@ -14,7 +14,7 @@ SETTINGS_DST := src/learning_commons_evaluators/settings
 
 .PHONY: help build check-build test unit-test contract-test \
         generate-settings check-generated sync-settings check-sync \
-        lint format format-check typecheck pip-check verify coverage
+        lint lint-fix format format-check typecheck pip-check verify coverage
 
 help:
 	@echo "Usage: make <target>"
@@ -23,6 +23,7 @@ help:
 	@echo "  check-build        Verify build artifacts are up to date (use in CI)"
 	@echo ""
 	@echo "  lint               Ruff linter (src/, tests/, scripts/)"
+	@echo "  lint-fix           Ruff linter with --fix (safe auto-fixes)"
 	@echo "  format             Apply Ruff formatter"
 	@echo "  format-check       Fail if Ruff would reformat any file"
 	@echo "  typecheck          Mypy on src package + tests"
@@ -58,6 +59,9 @@ check-build: check-generated check-sync
 lint:
 	$(RUFF) check src tests scripts
 
+lint-fix:
+	$(RUFF) check --fix src tests scripts
+
 format:
 	$(RUFF) format src tests scripts
 

diff --git a/sdks/python/README.md b/sdks/python/README.md
@@ -302,32 +302,98 @@ config = create_config_no_telemetry(logger=create_silent_logger())
 
 ## Error handling
 
+During a normal `evaluate()` / `evaluate_sync()` run, failures from evaluator input checks, configuration, LLM prompt steps, and output validation typically surface as subclasses of `EvaluatorError`. Failures inside LLM prompt steps are wrapped at the boundary so callers see a predictable, sanitized hierarchy instead of raw LangChain, OpenAI, Anthropic, or HTTP-client exceptions. A few documented paths still raise standard Python exceptions — for example `ValueError` from `execute_prompt_chain_step` when `json_dict_normalizer` is set without `parser_output_type`, and `RuntimeError` from `evaluate_sync()` when an asyncio event loop is already running on the current thread. Those are programmer errors, not evaluation failures.
+
+### Hierarchy
+
+```
+EvaluatorError
+├── ConfigurationError       — bad config (missing provider, unknown model, malformed settings)
+├── InputValidationError     — caller-supplied input failed validation
+└── APIError                 — failures originating in the LLM provider call
+    ├── AuthenticationError  — 401 / 403
+    ├── RateLimitError       — 429; carries `retry_after` (seconds)
+    ├── NetworkError         — connection refused, DNS failure, broken TLS
+    ├── RequestTimeoutError  — request exceeded the configured timeout
+    └── OutputValidationError — LLM response failed to parse or didn't match the expected schema
+```
+
+`InputValidationError` is named that way deliberately to avoid collision with `pydantic.ValidationError`. `RequestTimeoutError` is named that way to avoid shadowing the builtin `TimeoutError`. There is **no** `ValidationError` or `EvaluatorTimeoutError` in the public API.
+
+### Knowing when to retry
+
+Every `EvaluatorError` exposes a boolean `retryable` attribute. This is the single signal callers should consult when wrapping `evaluate()` in retry logic — there is no separate marker class to check. Subclasses set sensible defaults:
+
+- Retryable by default: `RateLimitError`, `NetworkError`, `RequestTimeoutError`, `OutputValidationError`, and any `APIError` with a 5xx status code.
+- Not retryable: `ConfigurationError`, `InputValidationError`, `AuthenticationError`, and `APIError` with a 4xx status code.
+
+`retryable` is also accepted as an `__init__` kwarg on `APIError` and `NetworkError` if you need to flag a specific instance differently (e.g. a permanently-bad hostname).
+
 ```python
-from learning_commons_evaluators import (
-    ConfigurationError,  # Missing/invalid config
-    ValidationError,     # Invalid input
-    AuthenticationError, # Invalid API keys (401/403)
-    RateLimitError,      # Rate limit exceeded (429) - has retry_after
-    NetworkError,        # Network failures
-    EvaluatorTimeoutError,  # Request timeout
-    APIError,            # Other API errors
-)
+import time
+from learning_commons_evaluators import EvaluatorError, RateLimitError
+
+for attempt in range(3):
+    try:
+        result = evaluator.evaluate_sync(input)
+        break
+    except EvaluatorError as e:
+        if not e.retryable or attempt == 2:
+            raise
+        delay = e.retry_after if isinstance(e, RateLimitError) and e.retry_after else 2 ** attempt
+        time.sleep(delay)  # retry_after is in seconds
+```
+
+### Sanitization and debugging context
+
+Error **messages** (the value returned by `str(err)`) are short and controlled. Raw provider strings — which may contain prompt echoes, user text, or fragments of API keys — are **not** interpolated into the SDK exception's message. Structured detail lives on attributes instead:
+
+- `status_code` on `APIError` — HTTP status from the provider, when one was returned. Populated from the provider exception's `.status_code` attribute when present (preferred over message regex).
+- `retry_after` on `RateLimitError` — suggested delay before retry, **in seconds**, or `None` if the provider didn't return a `Retry-After` header.
+- `provider` on `APIError` — the `LLMProvider` being called when the failure occurred.
+- `model` on `APIError` — the model ID requested.
+- `response_body` on `APIError` — decoded response body. Opt-in for debugging; may contain echoed prompt content, so treat as sensitive.
+- `request_id` on `APIError` — provider request ID, useful for support escalation.
+- `validation_errors` on `OutputValidationError` — per-field entries from Pydantic's `errors()` API after `sanitize_pydantic_errors` (`loc`, `type`, optional `url`, and safe primitive `ctx` only — no `input` or `msg`, which can echo model output).
+
+The original provider exception is preserved on `__cause__` (via `raise … from e`), so debuggers, tracebacks, and `logging.exception()` retain full detail even though `str(err)` is sanitized.
+
+```python
+import logging
+import time
+
+from learning_commons_evaluators import APIError, OutputValidationError, RateLimitError
 
+log = logging.getLogger(__name__)
 try:
     result = evaluator.evaluate_sync(input)
-except ConfigurationError as e:
-    print(f"Config issue: {e}")
-except ValidationError as e:
-    print(f"Invalid input: {e}")
 except RateLimitError as e:
-    print(f"Rate limited, retry after {e.retry_after}ms")
+    time.sleep(e.retry_after or 30)  # seconds
+except OutputValidationError as e:
+    # Structured entries omit Pydantic msg/input (may echo LLM text); use __cause__ for full detail.
+    log.warning("Bad LLM output: %s", e.validation_errors)
+    # Original pydantic.ValidationError / OutputParserException available as e.__cause__
 except APIError as e:
-    print(f"API error (retryable={e.retryable}): {e}")
+    log.error(
+        "Provider call failed",
+        extra={
+            "provider": e.provider,
+            "model": e.model,
+            "status": e.status_code,
+            "request_id": e.request_id,
+        },
+    )
+    raise
 ```
 
-Failures inside LLM prompt steps are passed through `wrap_provider_error()` (see `learning_commons_evaluators.schemas.errors`) so you typically see `APIError` subclasses rather than raw LangChain or HTTP client exceptions. Use `EvaluatorTimeoutError` for timeouts (the package does not export a `TimeoutError` alias, to avoid shadowing the Python builtin).
+### Metadata and telemetry
+
+On evaluation failure, `evaluation_metadata.status` is set to `failed` and `evaluation_metadata.error_details` is populated before `evaluate()` / `evaluate_sync()` re-raises (no result object is returned). `error_details` is itself sanitized:
+
+- SDK errors record `"ClassName: <sanitized message>"`.
+- Any other exception that escapes records only `"Unexpected error: ClassName"` — the message is omitted because arbitrary `ValueError`/`AttributeError`/etc. messages may contain user data or field values that aren't safe for telemetry.
 
-On evaluation failure, `metadata.status` and `error_details` are set on the in-memory metadata object for the run and appear on the evaluation end log line; `BaseEvaluator.evaluate` / `evaluate_sync` still re-raises and does not return a result object.
+The same policy applies to per-step `StepMetadata.error_details`. Both fields are emitted on the evaluation end log line.
 
 ## Creating custom evaluators
 

diff --git a/sdks/python/src/learning_commons_evaluators/__init__.py b/sdks/python/src/learning_commons_evaluators/__init__.py
@@ -22,11 +22,11 @@
     AuthenticationError,
     ConfigurationError,
     EvaluatorError,
-    EvaluatorRetryableError,
-    EvaluatorTimeoutError,
+    InputValidationError,
     NetworkError,
+    OutputValidationError,
     RateLimitError,
-    ValidationError,
+    RequestTimeoutError,
     wrap_provider_error,
 )
 
@@ -113,31 +113,31 @@
     "EvaluatorError",
     "EvaluatorMaturity",
     "EvaluatorMetadata",
-    "EvaluatorRetryableError",
-    "EvaluatorTimeoutError",
     "GoogleLLMProviderConfig",
     "AnyInputSpec",
     "GradeInputField",
     "GradeInputSpec",
     "InputField",
     "InputSpec",
     "InputT",
+    "InputValidationError",
     "TextInputSpec",
     "LLMProvider",
     "LLMProviderConfig",
     "Logger",
     "NetworkError",
     "OpenAILLMProviderConfig",
     "OutputT",
+    "OutputValidationError",
     "PromptSettings",
     "RateLimitError",
+    "RequestTimeoutError",
     "SDK_LOGGER_NAME",
     "Status",
     "TelemetryConfig",
     "TextComplexityEvaluationInput",
     "TextInputField",
     "TokenUsage",
-    "ValidationError",
     "VocabularyEvaluationInput",
     "VocabularyEvaluationSettings",
     "VocabularyEvaluator",

diff --git a/sdks/python/src/learning_commons_evaluators/errors.py b/sdks/python/src/learning_commons_evaluators/errors.py
@@ -5,11 +5,13 @@
     AuthenticationError,
     ConfigurationError,
     EvaluatorError,
-    EvaluatorRetryableError,
-    EvaluatorTimeoutError,
+    InputValidationError,
     NetworkError,
+    OutputValidationError,
     RateLimitError,
-    ValidationError,
+    RequestTimeoutError,
+    format_error_for_metadata,
+    sanitize_pydantic_errors,
     wrap_provider_error,
 )
 
@@ -18,10 +20,12 @@
     "AuthenticationError",
     "ConfigurationError",
     "EvaluatorError",
-    "EvaluatorRetryableError",
-    "EvaluatorTimeoutError",
+    "InputValidationError",
     "NetworkError",
+    "OutputValidationError",
     "RateLimitError",
-    "ValidationError",
+    "RequestTimeoutError",
+    "format_error_for_metadata",
+    "sanitize_pydantic_errors",
     "wrap_provider_error",
 ]
diff --git a/sdks/python/src/learning_commons_evaluators/evaluators/base.py b/sdks/python/src/learning_commons_evaluators/evaluators/base.py
@@ -8,6 +8,7 @@
 from collections.abc import Awaitable, Callable
 from typing import Any, Generic, TypeVar, overload
 
+from langchain_core.exceptions import OutputParserException
 from pydantic import BaseModel
 from pydantic import ValidationError as PydanticValidationError
 
@@ -22,6 +23,9 @@
 )
 from learning_commons_evaluators.schemas.errors import (
     EvaluatorError,
+    OutputValidationError,
+    format_error_for_metadata,
+    sanitize_pydantic_errors,
     wrap_provider_error,
 )
 from learning_commons_evaluators.schemas.evaluator import (
@@ -105,7 +109,7 @@ async def evaluate(
             on success.
 
         Raises:
-            ValidationError: Input fails validation.
+            InputValidationError: Input fails validation.
             ConfigurationError: No provider config for the required LLM provider.
             APIError (or subclasses): The LLM API call failed.
 
@@ -133,7 +137,7 @@ async def evaluate(
             return result
         except Exception as e:
             evaluation_metadata.status = Status.failed
-            evaluation_metadata.error_details = str(e)
+            evaluation_metadata.error_details = format_error_for_metadata(e)
             raise
         finally:
             evaluation_metadata.processing_time_ms = (time.perf_counter() - start) * 1000
@@ -236,7 +240,7 @@ async def execute_step(
             return result
         except Exception as e:
             step_metadata.status = Status.failed
-            step_metadata.error_details = str(e)
+            step_metadata.error_details = format_error_for_metadata(e)
             raise
         finally:
             step_metadata.processing_time_ms = (time.perf_counter() - start) * 1000
@@ -310,7 +314,15 @@ async def execute_prompt_chain_step(
 
         Raises:
             ConfigurationError: No provider config for ``prompt_settings.provider_type``.
-            EvaluatorError: SDK errors, including :func:`~learning_commons_evaluators.schemas.errors.wrap_provider_error` output for LangChain or HTTP failures (typically :class:`~learning_commons_evaluators.schemas.errors.APIError` subclasses). Pydantic :exc:`pydantic.ValidationError` from output parsing is re-raised unchanged.
+            OutputValidationError: The LLM response didn't satisfy the expected
+                output schema — invalid JSON (LangChain ``OutputParserException``),
+                JSON that didn't match the Pydantic model (``pydantic.ValidationError``),
+                a non-object JSON value when using ``json_dict_normalizer``, or
+                ``TypeError`` / ``ValueError`` from ``json_dict_normalizer`` itself.
+                The original exception is reachable via ``__cause__``.
+            APIError (or other ``EvaluatorError`` subclasses):
+                :func:`~learning_commons_evaluators.schemas.errors.wrap_provider_error`
+                output for LangChain or HTTP failures from the LLM provider.
             ValueError: If ``json_dict_normalizer`` is set but ``parser_output_type`` is omitted.
         """
         if json_dict_normalizer is not None and parser_output_type is None:
@@ -333,8 +345,25 @@ async def _run_chain() -> BaseModel | str:
                     loose = JsonOutputParser()
                     parsed_dict = await loose.ainvoke(ai_message)
                     if not isinstance(parsed_dict, dict):
-                        parsed_dict = dict(parsed_dict)
-                    normalized = json_dict_normalizer(parsed_dict)
+                        # JSON parsed cleanly but the top-level value isn't an object
+                        # (e.g. the LLM returned a JSON array or scalar). That's an
+                        # output-shape failure, not a parse failure — surface it as
+                        # OutputValidationError so callers can treat it consistently
+                        # with schema-mismatch errors, and avoid the TypeError that
+                        # ``dict(parsed_dict)`` would raise on a non-dict.
+                        raise OutputValidationError(
+                            "Model output is not a JSON object",
+                            provider=prompt_settings.provider_type,
+                            model=prompt_settings.model,
+                        )
+                    try:
+                        normalized = json_dict_normalizer(parsed_dict)
+                    except (TypeError, ValueError) as norm_err:
+                        raise OutputValidationError(
+                            "Model output could not be normalized before validation",
+                            provider=prompt_settings.provider_type,
+                            model=prompt_settings.model,
+                        ) from norm_err
                     return parser_output_type.model_validate(normalized)
 
                 parser = JsonOutputParser(pydantic_object=parser_output_type)
@@ -344,12 +373,32 @@ async def _run_chain() -> BaseModel | str:
                 return parser_output_type.model_validate(raw)
             except EvaluatorError:
                 raise
-            except PydanticValidationError:
-                raise
+            except (PydanticValidationError, OutputParserException) as e:
+                # The provider returned a response that didn't match the expected schema.
+                # This covers both Pydantic schema-validation failures and LangChain
+                # JSON-parse failures (``OutputParserException``). Wrap so callers can
+                # discriminate output-parse failures from other API errors and so the
+                # message that lands in ``EvaluationMetadata.error_details`` stays
+                # sanitized — the original error (which may include LLM output snippets)
+                # is reachable via ``__cause__`` for debugging.
+                validation_errors = (
+                    sanitize_pydantic_errors(e.errors())
+                    if isinstance(e, PydanticValidationError)
+                    else None
+                )
+                raise OutputValidationError(
+                    provider=prompt_settings.provider_type,
+                    model=prompt_settings.model,
+                    validation_errors=validation_errors,
+                ) from e
             except (KeyboardInterrupt, SystemExit):
                 raise
             except Exception as e:
-                raise wrap_provider_error(e) from e
+                raise wrap_provider_error(
+                    e,
+                    provider=prompt_settings.provider_type,
+                    model=prompt_settings.model,
+                ) from e
 
         try:
             return await self.execute_step(

diff --git a/sdks/python/src/learning_commons_evaluators/schemas/__init__.py b/sdks/python/src/learning_commons_evaluators/schemas/__init__.py
@@ -13,7 +13,7 @@
     ConventionalityEvaluationSettings,
     ConventionalityOutput,
 )
-from learning_commons_evaluators.schemas.errors import ValidationError
+from learning_commons_evaluators.schemas.errors import InputValidationError
 from learning_commons_evaluators.schemas.evaluator import (
     EvaluationAnswer,
     EvaluationExplanation,
@@ -70,6 +70,6 @@
     "TextComplexityEvaluationInput",
     "TextInputField",
     "TokenUsage",
-    "ValidationError",
+    "InputValidationError",
     "prompt_settings_to_extras_value",
 ]