feat(python-sdk)!: error handling improvements by czi-fsisenda · Pull Request #65 · learning-commons-org/evaluators

czi-fsisenda · 2026-05-16T00:30:09Z

Summary

Renames public validation and timeout types (InputValidationError, RequestTimeoutError) so they do not collide with builtin or Pydantic names, and documents a single retryable signal on EvaluatorError for client retry logic.
Expands provider error wrapping: sanitized SDK messages, structured fields (status_code, response_body, request_id, provider, model), HTTP routing (including 408 → RequestTimeoutError), OutputValidationError for bad LLM output, and sanitize_pydantic_errors for safe telemetry.
wrap_provider_error applies narrow timeout/network message patterns only when no HTTP status was extracted, so 4xx responses are not reclassified from body wording alone (aligns with the documented preference for structured signals over message parsing).
Adds make lint-fix and broad README + unit test coverage for the new hierarchy.

Test plan

Unit tests (tests/schemas/test_errors.py, tests/evaluators/test_base.py, schema and vocabulary tests updated for the new types)
Verified TestWrapProviderError including new cases for structured status vs. message-pattern precedence

Copilot

Pull request overview

This PR revamps Python SDK error handling with a renamed input-validation error, sanitized provider/output error wrapping, retryability signals, and updated documentation/tests.

Changes:

Replaces ValidationError/EvaluatorTimeoutError public API names with InputValidationError/RequestTimeoutError.
Adds structured API error context, output-validation wrapping, metadata-safe error formatting, and Pydantic error sanitization helpers.
Updates tests and README to cover the new hierarchy and behavior.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`sdks/python/src/learning_commons_evaluators/schemas/errors.py`	Defines the new error hierarchy, wrapping, retryability, and sanitization helpers.
`sdks/python/src/learning_commons_evaluators/evaluators/base.py`	Uses sanitized metadata errors and wraps LLM output parsing/schema failures.
`sdks/python/src/learning_commons_evaluators/schemas/evaluator.py`	Updates input validation aggregation to use `InputValidationError`.
`sdks/python/src/learning_commons_evaluators/schemas/common_inputs.py`	Raises `InputValidationError` from common input fields.
`sdks/python/src/learning_commons_evaluators/schemas/__init__.py`	Updates schema-level error re-export.
`sdks/python/src/learning_commons_evaluators/errors.py`	Updates package error re-exports.
`sdks/python/src/learning_commons_evaluators/__init__.py`	Updates top-level public exports.
`sdks/python/README.md`	Documents the new hierarchy, retry behavior, and sanitized metadata behavior.
`sdks/python/tests/schemas/test_errors.py`	Expands coverage for error hierarchy, wrapping, and sanitization helpers.
`sdks/python/tests/schemas/test_evaluator_schemas.py`	Updates evaluator schema validation tests for `InputValidationError`.
`sdks/python/tests/schemas/test_common_inputs.py`	Updates common input validation tests for `InputValidationError`.
`sdks/python/tests/evaluators/test_vocabulary.py`	Updates vocabulary evaluator validation expectations.
`sdks/python/tests/evaluators/test_base.py`	Updates base evaluator tests for sanitized metadata and output-validation wrapping.

Comments suppressed due to low confidence (2)

sdks/python/src/learning_commons_evaluators/schemas/errors.py:452

This final fallback makes provider exceptions with only message-based network/timeout signals (for example ECONNREFUSED, ENOTFOUND, or request timed out) non-retryable generic APIErrors. The previous implementation routed those to network/timeout errors, and the new docs still describe NetworkError as covering connection refused/DNS failures. Consider preserving a bounded keyword fallback while keeping the returned SDK message sanitized.

    return APIError(default_message, retryable=False, **common)

sdks/python/src/learning_commons_evaluators/schemas/errors.py:448

HTTP 408 responses are not handled as request timeouts here, so a provider exception with status_code=408 becomes a non-retryable generic APIError because it is below the 5xx threshold. Since RequestTimeoutError itself defaults to status 408 and is documented as retryable, route status 408 to RequestTimeoutError before the generic status-code branch.

    if status_code is not None:
        return APIError(
            f"API request failed (HTTP {status_code})",
            status_code=status_code,
            retryable=status_code >= _RETRYABLE_STATUS_MIN,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        # Honour subclass class-level default unless caller explicitly overrides.
+        if retryable is not None:
+            self.retryable = retryable



 ## Error handling

+Every exception the SDK raises inherits from `EvaluatorError`. Failures inside LLM prompt steps are wrapped at the boundary so callers see a predictable, sanitized hierarchy instead of raw LangChain, OpenAI, Anthropic, or HTTP-client exceptions.


+    """Strip raw input values from a list of :meth:`pydantic.ValidationError.errors` entries.
+
+    Pydantic's per-error dicts include an ``input`` key containing the raw
+    value that failed validation. For LLM outputs that value may echo prompt
+    content, so we drop it before the error reaches logs or telemetry. The
+    remaining keys (``loc``, ``type``, ``msg``, ``ctx`` if present) describe
+    the schema and the failure mode but not the offending content.
+    """
+    return [{key: value for key, value in error.items() if key != "input"} for error in errors]
+
+


                raise
-            except PydanticValidationError:
-                raise
+            except (PydanticValidationError, OutputParserException) as e:


Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (2)

sdks/python/src/learning_commons_evaluators/schemas/errors.py:516

This timeout check applies the message-regex fallback even when a structured non-timeout HTTP status is present. For example, a provider error with status_code=400 or 401 whose message mentions an invalid timeout setting would be classified as retryable RequestTimeoutError before the 4xx/auth handling below, contrary to the documented preference for structured attributes over message parsing. The message-only timeout fallback should be limited to status-less errors (or 408 specifically).

    if _is_timeout(error, status_code):
        return RequestTimeoutError(status_code=status_code, **common)

sdks/python/src/learning_commons_evaluators/schemas/errors.py:519

The network detector also runs its message-pattern fallback even when a structured HTTP status code was extracted. A provider 4xx/404 response whose message contains phrases like "could not resolve" (for example, an invalid model/resource message) will be returned as retryable NetworkError before the model-not-found/auth/rate-limit/status handling below. As with timeouts, the message fallback should not override a known HTTP status.

    if _is_network(error):
        return NetworkError(**common)

+        clean: dict[str, Any] = {
+            key: value for key, value in error.items() if key not in {"input", "ctx"}
+        }


+- `model` on `APIError` — the model ID requested.
+- `response_body` on `APIError` — decoded response body. Opt-in for debugging; may contain echoed prompt content, so treat as sensitive.
+- `request_id` on `APIError` — provider request ID, useful for support escalation.
+- `validation_errors` on `OutputValidationError` — per-field details from Pydantic's `errors()` API with raw input values stripped (safe for logs).


+    Prefers the structured attribute that openai-python and anthropic-python
+    both expose (``.status_code``). Falls back to ``.status`` (used by some
+    httpx-based clients) and finally to a regex on the message — narrowed to
+    HTTP-shaped 4xx/5xx codes near the start of the message to reduce false
+    positives like model names or token counts.
+    """
+    for attr in ("status_code", "status"):
+        value = getattr(error, attr, None)
+        if isinstance(value, int):
+            return value


Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

sdks/python/src/learning_commons_evaluators/schemas/errors.py:519

The message-pattern fallback still runs even when _provider_status_code() extracted a structured HTTP status. Because _is_timeout() and _is_network() both inspect str(error) unconditionally, a 400/401/404 provider response whose body mentions "timeout" or "connection refused" will be reclassified as RequestTimeoutError/NetworkError before the status-code routing below. This contradicts the documented structured-signal precedence; guard the message-based checks so they only apply when status_code is None (while preserving typed timeout/network exceptions if intended).

    if _is_timeout(error, status_code):
        return RequestTimeoutError(status_code=status_code, **common)

    if _is_network(error):
        return NetworkError(**common)

sdks/python/src/learning_commons_evaluators/schemas/errors.py:355

The .text fallback is outside the best-effort try, so response objects whose text property raises (for example, an unread/streaming HTTP response) can make wrap_provider_error() raise a new exception while trying to wrap the original provider failure. Keep the text fallback inside a guarded block, similar to the .json() call, so error wrapping never masks the original error.

    json_fn = getattr(response, "json", None)
    if callable(json_fn):
        try:
            return json_fn()
        except Exception:  # noqa: BLE001 — body is best-effort debug data
            pass
    return getattr(response, "text", None)

+        clean: dict[str, Any] = {
+            key: value for key, value in error.items() if key not in {"input", "ctx"}
+        }
+        ctx = error.get("ctx")
+        if isinstance(ctx, Mapping):
+            safe_ctx = {k: v for k, v in ctx.items() if _is_safe_ctx_value(v)}
+            if safe_ctx:
+                clean["ctx"] = safe_ctx
+        sanitized.append(clean)


+        # First entry: ctx had only an unsafe value, so the whole ctx key is gone.
+        assert sanitized[0] == {
+            "loc": ("answer",),
+            "type": "value_error",
+            "msg": "Value error, rejected text: <sensitive prompt echo>",
+        }


Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

+    but custom validators can attach *anything* there, including exception
+    objects whose ``str()`` may echo the offending input. We restrict to
+    JSON-primitive values and tuples/lists of the same.
+    """
+    if value is None or isinstance(value, (bool, int, float, str)):


+    SDK errors (:class:`EvaluatorError` subclasses) already carry sanitized
+    messages, so include both the class name and the message for debuggability.
+    Any other exception is rendered as just its class name — the message may
+    contain raw values (e.g. an :class:`AttributeError` echoing a field name
+    from user input) and isn't safe to put in telemetry.
+    """
+    if isinstance(error, EvaluatorError):
+        return f"{type(error).__name__}: {error}"


+    *,
+    default_message: str = "API request failed",


    def __init__(
        self,
        message: str,
+        *,
        status_code: int | None = None,
-        retryable: bool = False,
-        code: str | None = None,
-    ):
-        super().__init__(message, code)
+        retryable: bool | None = None,
+        provider: LLMProvider | None = None,


czi-fsisenda added 3 commits May 15, 2026 17:12

feat(python-sdk): error handling improvements

28437e8

chore: update error handling section in README

e442885

style: linting fixes

029b791

czi-fsisenda requested review from adnanrhussain, Copilot, georgemelvin and jecortez May 16, 2026 00:30

Copilot started reviewing on behalf of czi-fsisenda May 16, 2026 00:30 View session

Copilot AI reviewed May 16, 2026

View reviewed changes

czi-fsisenda added 4 commits May 15, 2026 17:37

feat: lint-fix make command

d255b28

style: more linting

4b963a0

build: lint and type issues addressed

59b0416

chore: addressing PR comments

86ac788

czi-fsisenda requested a review from Copilot May 16, 2026 01:42

Copilot started reviewing on behalf of czi-fsisenda May 16, 2026 01:42 View session

Copilot AI reviewed May 16, 2026

View reviewed changes

czi-fsisenda requested a review from Copilot May 16, 2026 01:56

Copilot started reviewing on behalf of czi-fsisenda May 16, 2026 01:56 View session

Copilot AI reviewed May 16, 2026

View reviewed changes

czi-fsisenda marked this pull request as draft May 18, 2026 15:54

chore: more PR comment fixes

1450071

czi-fsisenda requested a review from Copilot May 19, 2026 01:28

Copilot started reviewing on behalf of czi-fsisenda May 19, 2026 01:28 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python-sdk)!: error handling improvements#65

feat(python-sdk)!: error handling improvements#65
czi-fsisenda wants to merge 8 commits into
mainfrom
fsisenda/sdk_python_errors

czi-fsisenda commented May 16, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		## Error handling

		Every exception the SDK raises inherits from `EvaluatorError`. Failures inside LLM prompt steps are wrapped at the boundary so callers see a predictable, sanitized hierarchy instead of raw LangChain, OpenAI, Anthropic, or HTTP-client exceptions.

Conversation

czi-fsisenda commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

czi-fsisenda commented May 16, 2026 •

edited

Loading