Bump azure-ai-evaluation from 1.15.0 to 1.16.8 by dependabot[bot] · Pull Request #73 · Azure-Samples/Agentic-Evaluations

dependabot · 2026-05-20T04:40:51Z

Bumps azure-ai-evaluation from 1.15.0 to 1.16.8.

Release notes

Sourced from azure-ai-evaluation's releases.

azure-ai-evaluation_1.16.8

1.16.8 (2026-05-19)

Features Added

App Insights logging now forwards arbitrary evaluator-specific keys from each event's properties payload as a single gen_ai.evaluation.properties JSON attribute (carried inside internal_properties). Previously only the four red-team keys (attack_success, attack_technique, attack_complexity, attack_success_threshold) were forwarded; structured outputs such as rubric dimension_scores were silently dropped. Payloads larger than 7500 characters are replaced with a valid JSON marker ({"truncated": true, "original_size_bytes": <n>}) so consumers can always json.loads the value. Non-dict properties payloads are now safely ignored instead of raising in the red-team forwarder.

azure-ai-evaluation_1.16.7

1.16.7 (2026-05-07)

Features Added

Added extra_headers keyword argument to RaiServiceEvaluatorBase (and all content safety evaluators) to allow passing custom HTTP headers to all backend RAI service calls. SDK-owned headers (Authorization, User-Agent, Content-Type, aml-user-token, x-ms-client-request-id) cannot be overridden by extra_headers.

Added status field ("completed", "error", "skipped") on evaluation result items to indicate evaluator execution outcome.

Added skipped and errored counts to result_counts and per_testing_criteria_results in AOAI evaluation summaries.

Added skipped to ResultCount and skipped/errored to PerTestingCriteriaResult typed contracts.

Bugs Fixed

_TaskNavigationEfficiencyEvaluator now accepts JSON-stringified response and ground_truth inputs (e.g., from data pipelines that serialize list/tuple inputs to strings). String inputs are parsed as JSON; on parse failure the original value is preserved so downstream validation surfaces the error as before.

Fixed error blame attribution in _get_single_run_results to perform a case-insensitive comparison when checking the AOAI error code for UserError, ensuring failed evaluation runs are correctly classified as user errors regardless of server-side casing.

Fixed deflection_rate evaluator showing incorrect pass/fail labels where all results were labeled "pass" regardless of the actual score. The inverse metric adjustment was overriding the evaluator's correct string labels, remapping every result to "pass".

Fixed evaluate() raising EvaluationException: (InternalError) unhashable type: 'list' when an evaluator emitted a list value under a _result-suffixed column. Binary aggregation now skips such columns with a warning instead of aborting the entire run.

Fixed task_adherence red team scoring by adding scenario=redteam to the RAI scorer evaluation payload, ensuring the server-side score mapping correctly routes to Direct mapping for attack success determination.

Fixed row classification double-counting in _calculate_aoai_evaluation_summary where errored rows were counted separately and could also be counted as passed/failed. Rows are now classified into mutually exclusive buckets with priority: passed > failed > errored > skipped.

Fixed row classification where rows with empty or missing results lists were incorrectly counted as "passed" (the condition passed_count == len(results) - error_count evaluated 0 == 0 as True).

Fixed _get_metric_result prefix matching where shorter metric names (e.g., xpia) could match before longer, more-specific ones (e.g., xpia_manipulated_content). Now sorts by length descending for correct longest-prefix matching.

Fixed non-dict _properties values from evaluators causing downstream issues. Values that are not dicts are now logged and dropped gracefully.

Fixed filename length error in _inline_image by catching OSError/ValueError during local path resolution and fall back to returning a text chunk instead of throwing.

Other Changes

Moved token usage attributes (gen_ai.evaluation.usage.input_tokens, gen_ai.evaluation.usage.output_tokens) from standard App Insights event attributes into the internal_properties JSON bag to align with internal telemetry conventions.

azure-ai-evaluation_1.16.6

1.16.6 (2026-04-27)

Bugs Fixed

Fixed evaluation token usage not being emitted in the genai evaluation event, causing token consumption metrics to be missing from telemetry.

Fixed multi-turn red team attacks(RedTeamingAttack-based strategies like MultiTurn) failing silently with PyRIT 0.11. Two bugs were patched at the SDK level: (1) RedTeamingAttack._setup_async raised RuntimeError: Conversation already exists because it seeded prepended conversation messages before calling set_system_prompt; now patched per-instance on the adversarial chat target to tolerate existing conversation history. (2) RedTeamingAttack._generate_next_prompt_async returned context.next_message without calling .duplicate_message(), causing sqlite3.IntegrityError: UNIQUE constraint failed: PromptMemoryEntries.id on the second turn; now patched at module load with an idempotent wrapper that duplicates the message before returning.

Fixed sensitive_data_leakage red team attacks producing 100% false-pass rates. _extract_context_items in the Foundry execution path only handled list or dict shapes for messages[0].context; pre-curated SDL attack objectives store the document text as a str with sibling context_type/tool_name fields, so the document was silently dropped and a fallback synthesized a context item from the user prompt. The agent never received the sensitive document content and could not leak it, causing the evaluator to score every attempt as a pass. Added str handling (both message-level and top-level), normalized raw string entries inside list-shaped context, and gated the context_type fallback so it only runs when no usable context was extracted (including the context: null case).

Commits

adcf0f3 Bump azure-ai-evaluation version to 1.16.8
dfa2b30 Move evaluator properties changelog entry to 1.16.8 section
1aa3506 Run black on _evaluate.py to satisfy CI lint
42c68b2 Address Copilot review feedback
b3dfccb azure-ai-evaluation: forward evaluator properties to App Insights
e2cb236 Set CHANGELOG date to 2026-05-07 and bump version to 1.16.7
fd63edb Fix TaskNavigationEfficiencyEvaluator threshold defaulting to 3.0 for binary ...
6a0a5aa Standradize Task Navigation Efficiency Output (#46474)
fd3c19d Accept JSON string inputs in TaskNavigationEfficiencyEvaluator (#46760)
369603e fix filename length limit error (#46771)
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [azure-ai-evaluation](https://github.com/Azure/azure-sdk-for-python) from 1.15.0 to 1.16.8. - [Release notes](https://github.com/Azure/azure-sdk-for-python/releases) - [Commits](Azure/azure-sdk-for-python@azure-ai-evaluation_1.15.0...azure-ai-evaluation_1.16.8) --- updated-dependencies: - dependency-name: azure-ai-evaluation dependency-version: 1.16.8 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

dependabot Bot added dependencies Pull requests that update a dependency file python Pull requests that update python code labels May 20, 2026

Copilot AI review requested due to automatic review settings May 20, 2026 04:40

dependabot Bot added dependencies Pull requests that update a dependency file python Pull requests that update python code labels May 20, 2026

dependabot Bot mentioned this pull request May 20, 2026

Bump azure-ai-evaluation from 1.15.0 to 1.16.7 #68

Closed

Copilot AI reviewed May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump azure-ai-evaluation from 1.15.0 to 1.16.8#73

Bump azure-ai-evaluation from 1.15.0 to 1.16.8#73
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/pip/azure-ai-evaluation-1.16.8

dependabot Bot commented on behalf of github May 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dependabot Bot commented on behalf of github May 20, 2026

azure-ai-evaluation_1.16.8

1.16.8 (2026-05-19)

Features Added

azure-ai-evaluation_1.16.7

1.16.7 (2026-05-07)

Features Added

Bugs Fixed

Other Changes

azure-ai-evaluation_1.16.6

1.16.6 (2026-04-27)

Bugs Fixed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant