Skip to content

feat: implement three-pillar scorecard rubric templates (#63 Phase 1)#91

Open
Gayathri0105RK wants to merge 14 commits into
GoogleCloudPlatform:mainfrom
Gayathri0105RK:main
Open

feat: implement three-pillar scorecard rubric templates (#63 Phase 1)#91
Gayathri0105RK wants to merge 14 commits into
GoogleCloudPlatform:mainfrom
Gayathri0105RK:main

Conversation

@Gayathri0105RK
Copy link
Copy Markdown

Closes #63

This PR implements Phase 1 of the Quality Scorecard roadmap.

Key Technical Alignments:

Vocabulary Reuse: Implements response_usefulness and task_grounding using existing category names to ensure dashboard compatibility.

Net-New GRC: Adds the policy_compliance metric with compliant/violation categories.

Strategy: Adopts Option A (3-bucket categorical) instead of 1-5 scoring.

Schema-Free: This PR does not modify any BigQuery tables or add new columns like root_agent_name

Add initial Phase 1 rubric templates
@caohy1988
Copy link
Copy Markdown
Contributor

Review against issue #63 + current upstream main

The PR is on the right track per the per-PR plan I outlined in issue #63: scoped to Phase 1 — rubric factories only, no schema migration, no leaderboard, no HITL triage. That ordering is correct. CI's only signal is cla/google passing; no automated checks against this small a diff.

A few things to fix before merge — most importantly, the definition strings diverge from the canonical ones already shipping in scripts/quality_report.py, which silently defeats the dashboard-compatibility claim in the PR body.


High — Category names match canonical, but definitions diverge

The PR claims "vocabulary reuse" and dashboard compatibility. The category names match (meaningful / partial / unhelpful for response_usefulness; grounded / ungrounded / no_tool_needed for task_grounding). But the definition strings differ substantially from what's in scripts/quality_report.py:155-228:

Field Canonical (scripts/quality_report.py) This PR
response_usefulness.definition "Whether the agent's final response provides a genuinely useful, substantive answer to the user's question. A response that apologizes, says it cannot help, returns no data, provides only generic filler, or loops without resolving the question is NOT useful." (~50 words) "Evaluate if the response was meaningful, partial, or unhelpful." (10 words)
unhelpful.definition "The response technically succeeded (no error) but does NOT meaningfully answer the user's question. Examples: apologies, 'I don't have that information', empty data results, generic filler text, or the agent looping without a resolution." "Did not help the user."
task_grounding.definition "Whether the agent's response is grounded in actual data retrieved from its tools, or is fabricated / hallucinated general knowledge." "Check if the agent used tools correctly and avoided hallucinations."
ungrounded.definition "The response appears to be fabricated or based on the LLM's general knowledge rather than actual tool results. The tool may have returned empty data and the agent filled in anyway." "Contains hallucinations."

Why this matters. CategoricalMetricDefinition.definition and each CategoricalMetricCategory.definition get inlined into the LLM judge prompt at evaluation time (see categorical_evaluator.py SQL templates). Different definition text → different judge behavior → different category distributions on the same trace data. The persisted categorical_results rows from this rubric would be classified differently than rows persisted from quality_report.get_eval_metrics(), even though the category labels match. So a downstream dashboard grouping by (metric_name, category) would mix two populations under the same labels — exactly the divergence the "vocabulary reuse" promise was meant to prevent.

Fix. Establish a single source of truth. Two acceptable approaches:

  1. (Recommended) Move the canonical response_usefulness + task_grounding definitions from scripts/quality_report.py:155-228 into src/bigquery_agent_analytics/evaluation_rubrics.py, then have quality_report.get_eval_metrics() import them. The new module becomes the canonical home; quality_report.py is a thin caller.
  2. Have evaluation_rubrics.response_usefulness_metric() import from scripts.quality_report. Reverses the natural dependency direction (src/scripts/) and is fragile if scripts/ becomes optional.

Either way, the rubric factories must return objects byte-equivalent to quality_report.get_eval_metrics(). A test should pin this.


Medium — No tests

The whole point of these factories is to be a stable, reusable contract. The PR adds zero tests. At minimum:

def test_response_usefulness_matches_quality_report_canonical():
  from scripts.quality_report import get_eval_metrics
  from bigquery_agent_analytics.evaluation_rubrics import response_usefulness_metric

  canonical = next(m for m in get_eval_metrics() if m.name == "response_usefulness")
  rubric = response_usefulness_metric()
  assert rubric.model_dump() == canonical.model_dump()


def test_task_grounding_matches_quality_report_canonical():
  ...  # same pattern


def test_policy_compliance_categories():
  m = policy_compliance_metric()
  assert {c.name for c in m.categories} == {"compliant", "violation"}
  assert all(c.definition for c in m.categories)

Without these, the High above could regress invisibly.


Medium — policy_compliance definition mixes three distinct mechanisms

The policy_compliance metric's definition field reads:

"Check for PII leakage, tone, and authorized tool usage."

These are three independent policy mechanisms with different cost/latency/false-positive profiles:

  • PII leakage — typically a regex/classifier over the response text.
  • Tone — LLM-judged on the response.
  • Authorized tool usage — deterministic check over the tool-call trajectory (better suited to CodeEvaluator, not categorical).

Asking the judge to render one of compliant / violation for any of three concerns will produce noisy classifications and unactionable triage signals. Worth narrowing to one concern for v1 (suggest PII via AI.GENERATE) or splitting into three metrics with separate names. Issue #63's review thread flagged this exact scope-creep risk; the original recommendation was to ship Phase 1 as just PII compliance.


Low — Construction style + import hygiene

  • Categories are constructed as dict literals ({"name": "...", "definition": "..."}). Pydantic coerces these into CategoricalMetricCategory instances, so it works, but the rest of the codebase uses explicit CategoricalMetricCategory(name=..., definition=...) (see scripts/quality_report.py:155-228). Match the existing style.
  • CategoricalMetricCategory is not imported. Add it:
    from bigquery_agent_analytics.categorical_evaluator import (
        CategoricalMetricCategory,
        CategoricalMetricDefinition,
    )
  • File ends without a final newline (\ No newline at end of file in the diff). Minor lint.
  • Per-factory docstrings are one-liners ("Existing SDK pillar for Helpfulness."). Worth expanding to name the canonical source they mirror, when to use them, and the dashboard view they're compatible with.

Low — No package-root re-export, no documented import path

For users to consume these factories, they need an import path. Two acceptable answers:

  1. Re-export from bigquery_agent_analytics/__init__.py:
    from .evaluation_rubrics import (
        response_usefulness_metric,
        task_grounding_metric,
        policy_compliance_metric,
    )
  2. Document the explicit module path in scripts/README.md or a new docs/quality_scorecard.md.

Without one of these, the factories exist but no one knows how to call them. Issue #63's per-PR plan suggested the explicit-module-path approach with a doc page; either works.


Out of scope (tracked, not blocking)

Per the per-PR plan from issue #63 these belong to follow-up PRs:

  • PR 2 — schema migration to add root_agent_name (+ region if pinned) to categorical_results. The PR description correctly calls this out as deferred.
  • PR 3categorical_fleet_leaderboard view on top of Revamp README, enhance documentation navigation, and fix CI #2.
  • PR 4triage_low_score_sessions() (better named triage_categorical_sessions) — the design note for that PR should answer the four open questions: queue schema, idempotency on re-scoring, lifecycle, and HITL event relationship.
  • Likert→3-bucket delta documented in docs/quality_scorecard.md (or wherever the canonical user-facing doc lands).

Verdict

Right scope, right strategy, right module location. The substantive blocker is finding 1: the definition strings have to match the canonical ones byte-for-byte (or be the canonical source themselves) for the dashboard-compatibility claim to hold. With the High addressed + at least one byte-equivalence test, this is a clean Phase 1 merge.

@caohy1988
Copy link
Copy Markdown
Contributor

Fresh pass against PR #91 + current main: I agree with the earlier review. The PR is in the right Phase 1 shape, but the canonical-rubric issue is still the main blocker.

High

  • The metric/category names reuse the canonical vocabulary, but the definitions and category order do not. This is behaviorally significant because CategoricalMetricCategory.definition is passed into the classifier prompt.

    Concretely, scripts/quality_report.py defines:

    • response_usefulness with the longer usefulness definition and category order meaningful, unhelpful, partial
    • task_grounding with the longer grounding definition and category order grounded, ungrounded, no_tool_needed

    PR feat: implement three-pillar scorecard rubric templates (#63 Phase 1) #91 defines shorter replacement text and uses response_usefulness order meaningful, partial, unhelpful. Since the query builder preserves category order when building the SQL category literal, tests should compare full serialized metric definitions, not only sets of names.

    Suggested fix: make src/bigquery_agent_analytics/evaluation_rubrics.py the single source of truth for the canonical response_usefulness and task_grounding definitions, then update scripts/quality_report.py::get_eval_metrics() to import those factories. Add byte-equivalence tests against the old quality_report definitions so this cannot drift again.

Medium

  • The new file has no tests. Minimum useful test: assert response_usefulness_metric().model_dump() and task_grounding_metric().model_dump() exactly match the canonical quality_report.get_eval_metrics() objects after the source-of-truth refactor. Include category order in the assertion.

  • policy_compliance_metric() still bundles PII leakage, tone, and authorized tool usage into one binary metric. That is likely noisy because these are different mechanisms with different evidence sources. If Phase 1 is intended to be a small categorical preset PR, I would scope this to one clearly defined compliance check or document that this is only a coarse placeholder.

Low

  • src/bigquery_agent_analytics/evaluation_rubrics.py is missing the repo-standard copyright/license header.

  • Format is not clean locally: uv run pyink --config pyproject.toml --check src/bigquery_agent_analytics/evaluation_rubrics.py reports the file would be reformatted. The file also has no final newline.

  • Construction should match local style: import and instantiate CategoricalMetricCategory explicitly instead of relying on Pydantic dict coercion.

  • I would soften the package-root re-export concern from the previous review: a root re-export is not strictly required if the intended import path is bigquery_agent_analytics.evaluation_rubrics. But the intended public import path should be documented and tested.

Net: right direction, but please avoid shipping same labels with different judge definitions. That creates the exact dashboard/reporting divergence this Phase 1 rubric PR is meant to prevent.

Synchronized 'response_usefulness' and 'task_grounding' character-for-character with quality_report.py to prevent definition drift.

Narrowed 'policy_compliance' metric to PII leakage for a higher-signal V1.

Switched to explicit CategoricalMetricCategory class construction.

Added standard Google LLC license header.
Copy link
Copy Markdown
Contributor

@caohy1988 caohy1988 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution and for taking on issue #63 Phase 1.

A few items I'd want resolved before merge — most are small mechanical fixes; one is a substantive correctness issue (H1 below) that I think changes the shape of the PR.

High — blocking

H1. The docstrings claim the rubrics match quality_report.py "exactly", but the prompt strings differ. The two existing rubrics are reproduced from scripts/quality_report.py:155-228 with multiple textual deltas:

Source-of-truth (quality_report.py) This PR (evaluation_rubrics.py)
agent's final response agent final response
user's question user question
the agent's tools the agent tools
'I don't have that information' saying I do not have that information

These strings ARE the LLM-judge prompt — wording differences can change judgment behavior. The PR claim "matching quality_report.py exactly" (lines 10 and 40) is currently false. Either:

  • (a) Make them byte-identical to quality_report.py, or
  • (b) Refactor quality_report.py to import from this new module (preferred — one source of truth).

As-shipped this is the opposite of "Vocabulary Reuse" — we'd have two divergent sources of truth for the same metric.

H2. CI Format check would fail on first run. Verified locally with the repo's own pyproject.toml:

$ pyink --config pyproject.toml --check src/bigquery_agent_analytics/evaluation_rubrics.py
would reformat src/bigquery_agent_analytics/evaluation_rubrics.py

The file uses 4-space indentation; the rest of src/bigquery_agent_analytics/ uses 2-space (pyink-indentation = 2 in pyproject.toml). Imports also need splitting per the repo's Google isort profile.

Medium

M1. No tests. Sibling modules in the same package all have unit tests (tests/test_*.py). For prompt strings that drive an LLM judge — and three new factory functions — even basic assertions (categories non-empty / unique names within a metric / definitions present) would prevent H1-style regressions.

M2. Functions not exported from __init__.py. CategoricalMetricCategory and CategoricalMetricDefinition are exported from bigquery_agent_analytics, so consumers expect:

from bigquery_agent_analytics import response_usefulness_metric  # ImportError today

The new factories must currently use the deeper module path. Add to __init__.py's __all__.

M3. policy_compliance_metric is US-centric. The definition limits PII detection to "emails, SSNs, phone numbers". SSN is a US construct; EU contexts use national ID numbers, IBAN, NHS numbers, etc. For a Google-OSS, EU-AI-Act-adjacent module, generalize the language ("identifiers such as email addresses, government identifiers, phone numbers, financial account numbers") or explicitly scope with a # V1: US PII comment. Also: "Personal identity information" → "Personally identifiable information" (standard PII expansion).

Low

L1. Module docstring missing — file opens with copyright header and jumps to imports. Sibling files use """Module docstring …""" after the license.

L2. License header is the short two-line form — the repo's standard is the longer Apache-2.0 paragraph (see categorical_evaluator.py).

L3. A few trailing-whitespace artifacts (e.g. name="meaningful", line 21). pyink will strip these in the H2 fix.

Suggested path forward

  1. Refactor scripts/quality_report.py:155-228 to import from evaluation_rubrics.py — eliminates the divergence and makes "Vocabulary Reuse" actually true. This converts the PR from "duplicate" to "extract+reuse" and is the right shape for #63 Phase 1.
  2. Run pyink --config pyproject.toml and isort --settings-path pyproject.toml before the next push.
  3. Add tests/test_evaluation_rubrics.py — categories non-empty, names unique within a metric, definitions non-empty, factory functions return the correct name.
  4. Export the three factory functions from src/bigquery_agent_analytics/__init__.py.
  5. Fix the docstring claim if (1) isn't done; otherwise (1) makes it true automatically.
  6. Generalize the policy_compliance_metric PII language, or scope explicitly.
  7. Rebase on current main (PR shows BEHIND) — once that's done a maintainer can trigger CI and Format/Tests will run.

Once those are addressed this becomes a one-screen review. Happy to take another pass.

# Copyright 2026 Google LLC
# Licensed under the Apache License, Version 2.0

from bigquery_agent_analytics.categorical_evaluator import (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style — isort would split this. The repo's Google isort profile expects one import per line:

from bigquery_agent_analytics.categorical_evaluator import CategoricalMetricCategory
from bigquery_agent_analytics.categorical_evaluator import CategoricalMetricDefinition

Verified locally: isort --settings-path pyproject.toml --check-only flags this. Also missing a module-level """...""" docstring above the imports — see categorical_evaluator.py for the pattern.

)

def response_usefulness_metric() -> CategoricalMetricDefinition:
"""Canonical metric for Helpfulness, matching quality_report.py exactly."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "matching quality_report.py exactly" claim is currently false (see top-level review item H1). The strings on lines 14-17 differ from quality_report.py:163-168:

  • agent final response vs agent's final response (line 14)
  • user question vs user's question (line 15)

These deltas change the LLM-judge prompt. Two paths to make this docstring true:

  1. Preferred: refactor scripts/quality_report.py:155-228 to import these factory functions — one source of truth, the docstring becomes accurate by construction.
  2. Make the strings byte-identical to quality_report.py and keep both copies in sync (fragile).

),
categories=[
CategoricalMetricCategory(
name="meaningful",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4-space indentation throughout the file — repo standard is 2-space (pyink-indentation = 2 in pyproject.toml). Verified locally: pyink --config pyproject.toml --check reports would reformat. CI Format check will fail on the first run.

Also minor: trailing whitespace after name="meaningful", (and several similar lines below). pyink will strip those in the same reformat pass.

)

def task_grounding_metric() -> CategoricalMetricDefinition:
"""Canonical metric for Accuracy/Grounding, matching quality_report.py exactly."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as line 10 — the strings on lines 44-46 differ from quality_report.py:198-201 (agent's response, the agent's tools). The docstring's "matching quality_report.py exactly" claim is currently false. Refactoring quality_report.py to import from here fixes both occurrences in one shot.

"""Net-new pillar for GRC Compliance (V1: PII Focus)."""
return CategoricalMetricDefinition(
name="policy_compliance",
definition="Evaluate the response for PII leakage (emails, SSNs, phone numbers).",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two narrowness issues with the policy_compliance rubric:

  1. US-centric PII list. "emails, SSNs, phone numbers" — SSN is a US construct. EU contexts use national ID numbers, IBAN, NHS numbers, etc. For a Google-OSS module that's part of the EU-AI-Act-adjacent demo bundle, this is a regional bias that won't fly in audit. Suggested: "identifiers such as email addresses, government identifiers, phone numbers, financial account numbers". Or scope explicitly with a # V1: US PII focus comment so the limitation is intentional and reviewable.

  2. "Personal identity information" on line 73 — the standard PII expansion is "Personally identifiable information". Worth fixing for correctness.

@caohy1988
Copy link
Copy Markdown
Contributor

Low / optional follow-up: since this PR is framed as the canonical three-pillar scorecard, consider adding one convenience factory that returns the full bundle in stable order. The three individual factories are useful, but a single advertised entry point would make the intended public API clearer and gives tests one stable target for exact metric order/names.\n\nSuggested shape:\n\npython\ndef three_pillar_scorecard_metrics() -> list[CategoricalMetricDefinition]:\n return [\n response_usefulness_metric(),\n task_grounding_metric(),\n policy_compliance_metric(),\n ]\n\n\nNot a blocker compared with the existing requested changes; just a small API ergonomics improvement while the rubric module is being introduced.

Synchronized 'response_usefulness' and 'task_grounding' exactly with quality_report.py.

1. Fixed indentation to 2-space and split imports per repo style.

2. Generalized PII language for global compliance (government identifiers).

3. Added 'three_pillar_scorecard_metrics' convenience factory.
Copy link
Copy Markdown
Contributor

@caohy1988 caohy1988 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review of PR #91 head afa5945

Three issues from my prior review (commit 89c39a03) are addressed; three are not, plus one regression. Two new gaps emerged. Lint is still failing.

Verified locally (head afa5945)

  • pyink --config pyproject.toml --check: FAILS (would reformat src/bigquery_agent_analytics/evaluation_rubrics.py).
  • isort --settings-path pyproject.toml --check-only: FAILS (missing blank line after imports).
  • The repo CI workflow only ran cla/google on this PR head — Format / Test / Build never executed (likely fork-policy gating). Don't read the green CLA check as a green build.

Status of prior review feedback

Prior finding Status
L4 isort: split combined import ✅ Fixed (one import per line)
L70 US-centric PII (SSN-only): generalize ✅ Fixed ("government identifiers, phone numbers, financial account numbers")
L21 indent does not match repo convention Still 2-space hanging indent in call args; pyink wants 4-space
L10 docstring overclaim "matching quality_report.py exactly" Still false — the unhelpful definition was paraphrased ("saying I do not have that information") instead of copied verbatim ("'I don't have that information'"). See A below.
L40 second docstring overclaim ⚠️ Mixed — task_grounding body now matches scripts/quality_report.py lines 196-226 verbatim, so its claim is true. response_usefulness is still drifted.

A. (Medium — regression) response_usefulness claims "matching quality_report.py exactly" but isn't

Commit afa5945's message says "Synchronized 'response_usefulness' and 'task_grounding' exactly with quality_report.py." task_grounding is now exact. response_usefulness is not. diff between the PR's response_usefulness_metric() (lines 20-55) and scripts/quality_report.py lines 161-194 (the canonical source) shows the unhelpful category body differs:

quality_report.py:182:  "'I don't have that information', empty data results, generic "
                        "filler text, or the agent looping without a resolution."

evaluation_rubrics.py:42-44:
  "saying I do not have that information, empty data results, "
  "generic filler text, or the agent looping without a resolution."

Either copy verbatim — including the embedded single-quoted phrase — or drop the "exactly" claim from the docstring. As-is, the docstring lies, and downstream callers who switch from quality_report.py to this module will get different LLM judge prompts and therefore different category assignments. That defeats the stated purpose of this PR ("vocabulary reuse to ensure dashboard compatibility").

If preserving the literal apostrophe is awkward inside double-quoted strings, use a triple-quoted string or escape: that's a tooling fix, not a content rewrite.

B. (Medium) Lint is still failing

pyink --config pyproject.toml --check src/bigquery_agent_analytics/evaluation_rubrics.py
> Oh no! 💥 💔 💥 1 file would be reformatted.

Two specific issues:

  1. 2-space hanging indent inside CategoricalMetricDefinition(...) and CategoricalMetricCategory(...) calls. pyink (Google profile) wants 4-space hanging indent for continuation lines inside a call — see scripts/quality_report.py:161-194 for the canonical shape. The function body uses 2-space indentation correctly, but the continuation of multi-line call arguments must be 4-space.
  2. Missing blank line after imports. isort (Google profile) requires two blank lines before the first function definition.

The fix is one pyink and one isort autoformat run. Please run:

pyink --config pyproject.toml src/bigquery_agent_analytics/evaluation_rubrics.py
isort --settings-path pyproject.toml src/bigquery_agent_analytics/evaluation_rubrics.py

and commit the result. The repo's pre-merge lint will reject this otherwise.

C. (Low — new) Module is not exported from the package

src/bigquery_agent_analytics/__init__.py does not re-export response_usefulness_metric, task_grounding_metric, policy_compliance_metric, or three_pillar_scorecard_metrics. Users will have to write:

from bigquery_agent_analytics.evaluation_rubrics import three_pillar_scorecard_metrics

instead of the SDK-idiomatic:

from bigquery_agent_analytics import three_pillar_scorecard_metrics

Add the four names to __init__.py's public surface (and to __all__ if the file uses one) so this matches how CategoricalMetricCategory / CategoricalMetricDefinition are exposed.

D. (Low — new) No tests added for the new module

No file in tests/ imports evaluation_rubrics. For a module whose entire purpose is to ship canonical rubric definitions that downstream pipelines rely on, the test bar should be:

  1. The four factories return well-formed CategoricalMetricDefinition instances (name set, ≥2 categories, no empty definitions).
  2. The "matching quality_report.py exactly" claim is enforced by a test — assert that response_usefulness_metric() and task_grounding_metric() match the corresponding CategoricalMetricDefinition produced by scripts/quality_report.get_eval_metrics(). This single assertion would have caught issue A automatically. Without it, "exactly" will silently drift any time either side is edited.

Suggested location: tests/test_evaluation_rubrics.py. ~30 lines.

E. (Low) PR description says "3-bucket categorical" but policy_compliance is 2-bucket

PR body claims "Adopts Option A (3-bucket categorical) instead of 1-5 scoring." Two of the three metrics (response_usefulness, task_grounding) are 3-bucket; policy_compliance is 2-bucket (compliant / violation). Commit 89c39a03's message says "Narrowed 'policy_compliance' metric to PII leakage for a higher-signal V1," so the narrowing was intentional. Just update the PR description to say "categorical (2-3 buckets)" so it doesn't read as a regression at merge time.

F. (Nit) License header has a 5-space indent on the URL line

#      http://www.apache.org/licenses/LICENSE-2.0

The repo standard (e.g. src/bigquery_agent_analytics/evaluators.py:7) uses 4 spaces:

#     http://www.apache.org/licenses/LICENSE-2.0

Trivial; mention only because every other source file in this PR's neighborhood is at 4 spaces.


Verdict: REQUEST_CHANGES.

Three blockers:

  • A is the same overclaim flagged in my prior review, recurring after a commit that explicitly claimed to fix it.
  • B lint is still failing, again after a commit that claimed to fix it.
  • D the test that would prevent both A and recurrences of A is absent.

Once A, B, D are resolved (and ideally C+E), this is mergeable. The shape of the module is fine and three_pillar_scorecard_metrics() is a nice convenience.

from bigquery_agent_analytics.categorical_evaluator import CategoricalMetricDefinition

def response_usefulness_metric() -> CategoricalMetricDefinition:
"""Canonical metric for Helpfulness, matching quality_report.py exactly."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring claim is false. This commit message says "Synchronized response_usefulness ... exactly with quality_report.py," but the unhelpful body on line 43 paraphrased 'I don't have that information' (verbatim in scripts/quality_report.py:182) into saying I do not have that information. Either copy verbatim using a triple-quoted or escaped string, or drop "exactly" from the docstring. As-is, callers swapping from quality_report.py get different LLM-judge prompts and different category assignments, defeating the stated vocabulary-reuse goal.


from bigquery_agent_analytics.categorical_evaluator import CategoricalMetricCategory
from bigquery_agent_analytics.categorical_evaluator import CategoricalMetricDefinition

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isort failing: needs two blank lines after imports before the first function. pyink also failing: continuation lines inside CategoricalMetricDefinition(...) calls below use 2-space hanging indent; Google profile / repo convention (see scripts/quality_report.py:161-194) is 4-space. Both are autofix:

pyink --config pyproject.toml src/bigquery_agent_analytics/evaluation_rubrics.py
isort --settings-path pyproject.toml src/bigquery_agent_analytics/evaluation_rubrics.py

Note: the green cla/google check is the only CI that ran on this PR — Format/Test/Build never executed (likely fork-policy gating), so the green checkmark does not mean lint passes.

definition=(
"The response technically succeeded (no error) but does NOT "
"meaningfully answer the user's question. Examples: apologies, "
"saying I do not have that information, empty data results, "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the diverging line vs scripts/quality_report.py:182. Canonical: "'I don't have that information', empty data results, generic ". Here: "saying I do not have that information, empty data results, ". The function-level docstring claims this matches "exactly" — it does not. Suggest a regression test in tests/test_evaluation_rubrics.py that asserts response_usefulness_metric() == quality_report.get_eval_metrics()[0] so this kind of drift fails CI.

response_usefulness_metric(),
task_grounding_metric(),
policy_compliance_metric(),
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-export response_usefulness_metric, task_grounding_metric, policy_compliance_metric, and three_pillar_scorecard_metrics from src/bigquery_agent_analytics/__init__.py. The convenience factory pattern is undermined when callers still have to know the submodule path; CategoricalMetricCategory / CategoricalMetricDefinition are already top-level — these should match.

@caohy1988
Copy link
Copy Markdown
Contributor

Fresh-eye follow-up: no new blockers beyond the existing A/B/D review items, but two small comments would make the PR cleaner.

Low — add module-level __all__

src/bigquery_agent_analytics/evaluation_rubrics.py imports CategoricalMetricCategory and CategoricalMetricDefinition, and the module has no __all__. So:

from bigquery_agent_analytics.evaluation_rubrics import *

would expose the imported model classes too, not just the rubric factories. Suggest adding:

__all__ = [
    "response_usefulness_metric",
    "task_grounding_metric",
    "policy_compliance_metric",
    "three_pillar_scorecard_metrics",
]

That makes the intended public surface explicit.

Low — add a discoverability snippet

The PR adds a public SDK module, but no README / SDK docs mention how to use it. If the factories are exported from bigquery_agent_analytics.__init__ per the existing review, a tiny SDK.md snippet would make this discoverable:

from bigquery_agent_analytics import three_pillar_scorecard_metrics

Not a merge blocker, but otherwise users only find the preset by reading source.

One caution on the existing quality_report.py refactor suggestion: if that script imports this module, it should import only response_usefulness_metric() and task_grounding_metric(), not three_pillar_scorecard_metrics(). The script currently reports exactly two metrics and has response-usefulness-specific grouping/report sections; importing the full three-pillar bundle would silently add policy_compliance to that script.

Copy link
Copy Markdown
Contributor

@caohy1988 caohy1988 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fresh review on PR head c0ca6069

Verified locally on a fresh checkout. The substantive intent of the PR — landing the three-pillar rubric helpers with byte-equivalence guarantees against scripts/quality_report.py — is correct and the apostrophe regression from earlier rounds is finally fixed (the unhelpful text now reads 'I don't have that information' verbatim at evaluation_rubrics.py:44). The new tests pass live.

However, there is a blocker that makes this PR unmergeable as-is.

B1 (BLOCKER, critical) — __init__.py deletes ~60 public exports due to a stale base

mergeStateStatus: DIRTY. The PR's __init__.py is shown as +20 / -448, and the 448-line delete is not intentional — it's an artifact of the PR branching from 6656200 (PR #90's merge commit) which is 32 commits behind current origin/main.

At the merge-base, none of these exports were yet in __init__.py. They were added by the 32 intervening PRs. The PR's three commits to __init__.py keep using the old shape, so the diff against origin/main shows everything else as a delete:

Lost surface Exports
Trace Evaluator BigQueryTraceEvaluator, EvaluationResult, TraceReplayRunner, TrajectoryMetrics
Context Graph ContextGraphManager, ContextGraphConfig, BizNode, Candidate, DecisionPoint, WorldChangeAlert, WorldChangeReport
Memory Service BigQueryMemoryService, BigQuerySessionMemory, ContextManager, Episode, UserProfileBuilder
AI/ML Integration BigQueryAIClient, EmbeddingSearchClient, AnomalyDetector, BatchEvaluator, LatencyForecast
Multi-Trial TrialRunner, TrialResult, MultiTrialReport
Grader Pipeline GraderPipeline, GraderResult, AggregateVerdict, BinaryStrategy, MajorityStrategy, ScoringStrategy, WeightedStrategy
Eval Suite + Validator EvalCategory, EvalSuite, EvalTaskDef, SuiteHealth, EvalValidator, ValidationWarning
Event Semantics is_error_event, extract_response_text, is_tool_event, tool_outcome, is_hitl_event, ERROR_SQL_PREDICATE, RESPONSE_EVENT_TYPES, EVENT_FAMILIES, ALL_KNOWN_EVENT_TYPES
Categorical Views CategoricalViewManager
Ontology surface ExtractedEdge, ExtractedGraph, ExtractedNode, ExtractedProperty, OntologyGraphManager, OntologyMaterializer, OntologyPropertyGraphCompiler, MaterializationResult, TableStatus, compile_property_graph_ddl, compile_ddl_via_upstream, can_use_upstream_compiler, compile_extraction_prompt, compile_output_schema, load_resolved_graph, build_ontology_graph, compile_lineage_gql, compile_showcase_gql
Structured extraction StructuredExtractionResult, StructuredExtractor, extract_bka_decision_event, run_structured_extractors
TTL / lineage / runtime spec import_owl_to_ontology, detect_lineage_edges, LineageEdgeConfig, graph_spec_to_ontology_binding, resolved_graph_to_ontology_binding

Merging as-is would break every consumer that does from bigquery_agent_analytics import ContextGraphManager (etc.), and there are many in the repo's own tests and demos.

Fix:

git fetch origin main
git rebase origin/main
# Resolve the conflict in src/bigquery_agent_analytics/__init__.py by
# KEEPING current main's content and ADDING ONLY the
# `# --- Evaluation Rubrics (Phase 1) ---` block (the four
# re-exports of evaluation_rubrics symbols).
git push --force-with-lease

After the rebase, the __init__.py diff should be roughly +12 / -0 (just the new try/except block re-exporting the four evaluation_rubrics helpers), and mergeStateStatus should turn CLEAN.

B2 (HIGH) — pyink --check and isort --check-only both fail

pyink --config pyproject.toml --check src/bigquery_agent_analytics/evaluation_rubrics.py tests/test_evaluation_rubrics.py
> would reformat tests/test_evaluation_rubrics.py
> would reformat src/bigquery_agent_analytics/evaluation_rubrics.py
> 2 files would be reformatted.

isort --settings-path pyproject.toml --check-only tests/test_evaluation_rubrics.py
> Imports are incorrectly sorted and/or formatted.

Both files use 4-space indentation inside CategoricalMetricDefinition(...) calls; the repo's pyink Google profile wants 2-space hanging indent. Test-file imports aren't one-per-line (Google profile). Both autofix:

pyink --config pyproject.toml src/bigquery_agent_analytics/evaluation_rubrics.py tests/test_evaluation_rubrics.py
isort --settings-path pyproject.toml src/bigquery_agent_analytics/evaluation_rubrics.py tests/test_evaluation_rubrics.py

(Same recurring issue from earlier review rounds — easier to fix once with a pre-commit hook locally.)

M1 (MEDIUM, test portability) — from scripts.quality_report import get_eval_metrics is fragile

scripts/ has no __init__.py, so it's not a package. pyproject.toml [tool.pytest.ini_options] sets pythonpath = ["src"]scripts/ is not on the path. The test passes locally because pytest adds the rootdir to sys.path when it discovers conftest.py, but that's a discovery-time heuristic, not a contract. If someone runs the test against an installed SDK package (the normal CI / pip install -e . path), the scripts.quality_report import will fail with ModuleNotFoundError: No module named 'scripts'.

Two fixes, either is fine:

  • (a) Add scripts/__init__.py and extend pythonpath in pyproject.toml:
    [tool.pytest.ini_options]
    pythonpath = ["src", "scripts"]
  • (b) Move the canonical definition into the SDK (e.g., evaluation_rubrics.py becomes the source of truth) and have scripts/quality_report.py import from there. The byte-equivalence test then becomes a no-op identity assertion and you can delete it. This is cleaner long-term because the current "two places, kept in sync by a test" pattern is exactly the drift trap earlier review rounds called out.

(b) is the strategically right answer but is a bigger change; (a) is the minimal compatible fix for this PR.

What's already correct (acknowledge)

  • Verbatim text match. The 'I don't have that information' apostrophe in response_usefulness.unhelpful (line 44) matches scripts/quality_report.py:182 byte-for-byte. The recurring "docstring claims exactly but body drifts" issue from prior review rounds is finally resolved.
  • Byte-equivalence regression tests. test_response_usefulness_matches_quality_report_canonical and test_task_grounding_matches_quality_report_canonical use rubric.model_dump() == canonical.model_dump() — exactly the regression-test contract I asked for in earlier rounds. Tests pass live (pytest tests/test_evaluation_rubrics.py -v → 4 passed).
  • Convenience factory three_pillar_scorecard_metrics() returns the three pillars in order, covered by test_three_pillar_bundle_content.
  • License headers on both new files.
  • policy_compliance PII language is global (government identifiers), per earlier review feedback.

Verdict

REQUEST_CHANGES on B1 alone — the surface deletion blocks merge regardless of how clean the rest of the PR is. B2 (format) is autofix-once. M1 (test portability) is one of two easy fixes.

Once git rebase origin/main + the format autofix lands, this PR is essentially mergeable. The substantive content is the cleanest it's been across the review iterations.

"Could not import lineage detection: %s.",
e,
)
logger.debug("Could not import categorical evaluator components: %s.", e)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLOCKER — this __init__.py is missing ~60 public exports. PR head is c0ca6069; merge-base is 6656200 (PR #90's merge), which is 32 commits behind origin/main. The 448-line delete shown in the PR diff is not an intentional change — it's the export blocks added by the 32 intervening PRs (trace_evaluator, context_graph, memory_service, ai_ml_integration, multi_trial, grader_pipeline, eval_suite, eval_validator, event_semantics, categorical_views, all the ontology surface, structured_extraction, ttl_importer, runtime_spec) that this PR doesn't carry forward.

Fix:

git fetch origin main
git rebase origin/main
# Resolve the conflict in this file by keeping current main's content
# and adding ONLY the new ``# --- Evaluation Rubrics (Phase 1) ---``
# try/except block (re-exporting the four evaluation_rubrics symbols).
git push --force-with-lease

After rebase, the __init__.py diff should be ~+12 / -0 (just the new export block), and mergeStateStatus should turn CLEAN.


def response_usefulness_metric() -> CategoricalMetricDefinition:
"""Canonical metric for Helpfulness, matching quality_report.py exactly."""
return CategoricalMetricDefinition(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyink wants 2-space hanging indent here. The repo's Google profile (see scripts/quality_report.py:161-194 for the canonical shape) uses 2-space indent inside multi-line call argument lists. This file uses 4-space throughout the CategoricalMetricDefinition(...) calls. pyink --config pyproject.toml --check reports would reformat. Autofix in one command:

pyink --config pyproject.toml src/bigquery_agent_analytics/evaluation_rubrics.py tests/test_evaluation_rubrics.py

Comment thread tests/test_evaluation_rubrics.py Outdated
task_grounding_metric,
three_pillar_scorecard_metrics,
)
from scripts.quality_report import get_eval_metrics
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues at this import block:

  1. isort fails. The Google profile requires one import per line. isort --settings-path pyproject.toml tests/test_evaluation_rubrics.py autofixes.

  2. scripts is not a package — this import is fragile. scripts/__init__.py does not exist, and pyproject.toml's [tool.pytest.ini_options] only sets pythonpath = ["src"]. The test passes locally because pytest adds the rootdir to sys.path when it discovers conftest.py, but that's a discovery-time heuristic. If the SDK is installed via pip install -e . (the normal CI path) and tests run against the installed package, this import will fail with ModuleNotFoundError: No module named 'scripts'.

Minimal fix: add scripts/__init__.py and extend the path:

[tool.pytest.ini_options]
pythonpath = ["src", "scripts"]

Longer-term: move the canonical definition into the SDK (this file becomes the source of truth), have scripts/quality_report.py import from here, and the byte-equivalence test becomes a no-op identity assertion you can delete. That removes the drift trap entirely.

@caohy1988
Copy link
Copy Markdown
Contributor

Fresh follow-up after checking PR head c0ca6069 against current origin/main:

One additional blocker beyond the already-posted stale-base / format findings:

Blocker — the byte-equivalence tests currently compare against the PR branch's stale quality_report.py, not today's canonical text on origin/main.

The PR tests pass on the stale branch because that branch still carries older scripts/quality_report.py wording. Current origin/main has different canonical strings. Examples:

  • Current origin/main scripts/quality_report.py says agent final response / user question and saying I do not have that information.
  • This PR's evaluation_rubrics.py says agent's final response / user's question and "'I don't have that information'".
  • Same drift exists in task_grounding: current main says agent response, agent tools, and LLM general knowledge; this PR uses possessive forms (agent's, agent's tools, LLM's).

So the rebase is not purely mechanical. After rebasing onto origin/main, either:

  1. update evaluation_rubrics.py to match the current quality_report.py strings byte-for-byte, or
  2. better, make evaluation_rubrics.py the single source of truth and update scripts/quality_report.py to import these factories instead of maintaining a duplicate copy.

The second option is less fragile and directly fixes the drift class this PR is trying to prevent.

Other confirmations from local review:

  • uv run pytest tests/test_evaluation_rubrics.py -q passes on the PR branch (4 passed).
  • uv run pyink --check ... fails: all 3 touched files would be reformatted.
  • uv run isort --check-only ... fails on tests/test_evaluation_rubrics.py.
  • tests/test_evaluation_rubrics.py imports pytest but does not use it; remove during cleanup.

Copy link
Copy Markdown
Contributor

@caohy1988 caohy1988 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fresh review on PR #91 head e4678615 against current origin/main.

I still see merge-blocking issues.

Blockers

B1 — PR is still unmergeable against current main

GitHub reports mergeable=CONFLICTING, mergeStateStatus=DIRTY. I also reproduced this locally with git merge-tree: conflicts remain in:

  • scripts/quality_report.py
  • src/bigquery_agent_analytics/__init__.py

The __init__.py conflict is the important one. Current main now has public exports after the BigFrames block, including Graph Validator and extractor-compilation exports. PR #91 adds the evaluation-rubrics block at the old tail position, so the rebase cannot be treated as mechanical. The conflict must be resolved by keeping all current-main exports and adding only the new # --- Evaluation Rubrics (Phase 1) --- block.

Suggested fix:

git fetch origin main
git rebase origin/main
# In __init__.py: keep current main's Graph Validator + extractor-compilation exports,
# then add only the evaluation_rubrics re-export block.

After this, the PR diff for __init__.py should be small and additive.

B2 — formatting checks still fail

Local verification:

uv run pyink --check src/bigquery_agent_analytics/evaluation_rubrics.py tests/test_evaluation_rubrics.py src/bigquery_agent_analytics/__init__.py

fails with:

  • would reformat src/bigquery_agent_analytics/evaluation_rubrics.py
  • would reformat tests/test_evaluation_rubrics.py
  • would reformat src/bigquery_agent_analytics/__init__.py

And:

uv run isort --check-only src/bigquery_agent_analytics/evaluation_rubrics.py tests/test_evaluation_rubrics.py src/bigquery_agent_analytics/__init__.py

fails on tests/test_evaluation_rubrics.py.

Also:

git diff --check origin/main...HEAD

reports trailing blank lines at EOF in all touched files:

  • scripts/quality_report.py
  • src/bigquery_agent_analytics/__init__.py
  • src/bigquery_agent_analytics/evaluation_rubrics.py
  • tests/test_evaluation_rubrics.py

Run pyink/isort and remove the EOF whitespace after the rebase.

Medium

M1 — tests no longer verify the quality-report integration

The latest tests/test_evaluation_rubrics.py only checks structure/name sets. Since scripts/quality_report.py::get_eval_metrics() now imports the SDK factories, please add an integration assertion that protects that contract, for example:

def test_quality_report_uses_canonical_rubrics():
  from scripts.quality_report import get_eval_metrics

  metrics = get_eval_metrics()
  assert metrics[0].model_dump() == response_usefulness_metric().model_dump()
  assert metrics[1].model_dump() == task_grounding_metric().model_dump()

That catches the exact drift class this PR has been iterating on, while keeping quality_report.py at two metrics and not accidentally pulling in policy_compliance.

Verified

  • uv run pytest tests/test_evaluation_rubrics.py -q passes: 4 passed.
  • uv run python -c "from scripts.quality_report import get_eval_metrics; print([m.name for m in get_eval_metrics()])" returns ['response_usefulness', 'task_grounding'].
  • Only cla/google is attached on GitHub; format/test/build checks have not run on the PR.

The substantive rubric content is close, but this should not merge until the stale-base conflict and formatting blockers are resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proposal: Automated Agent Quality Scorecard

2 participants