Refactor: Rename CodeEvaluator to SystemEvaluator by gigistark-google · Pull Request #121 · GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK

gigistark-google · 2026-05-06T18:51:41Z

Rename CodeEvaluator to SystemEvaluator to align with its focus on system-level metrics. A CodeEvaluator alias is kept in evaluators.py for backward-compatibility.

caohy1988

Review of PR #121 head `914119a`

This is the first PR in a three-PR stack (#121 → #122 → #123). Review focuses on this PR's stated scope: "Rename CodeEvaluator to SystemEvaluator, kept as backward-compat alias."

Blockers

B1 — Stale base, 24 commits behind `origin/main`

Branch base is 292320b. origin/main is at 8824582. The PR's mergeStateStatus: UNKNOWN is consistent with that. Several of the 32 commits since base touch evaluators.py (e.g. PR #126 added retry-on-gate-failure orchestrator). Rebase needed before merge regardless of the review outcome.

git fetch origin main
git rebase origin/main
git push --force-with-lease

Findings on the rename itself

M1 — Incomplete rename across first-party code

After the PR, 18 files still contain CodeEvaluator. Classifying:

Category	Files	Status
Backward-compat alias path (intentional)	`evaluators.py` (line 505: `CodeEvaluator = SystemEvaluator`), `__init__.py` (re-exports both)	✅ Correct
Stale docstring references inside renamed files	`client.py:880,885` ("CodeEvaluator metrics are computed..."), `cli.py:409`, `udf_kernels.py:28`	⚠️ Cosmetic, but inconsistent with the file's own code now using `SystemEvaluator`
Dead imports (imported but unused)	`client.py:76`, `grader_pipeline.py:51`, `cli.py:42` all do `from .evaluators import CodeEvaluator, SystemEvaluator` but only use `SystemEvaluator`	⚠️ `CodeEvaluator` import is dead weight; pyflakes would flag
First-party callers NOT renamed	`deploy/remote_function/dispatch.py` (14 references — every factory entry still calls `CodeEvaluator.latency(...)` etc.); `examples/e2e_demo.py` (5 refs); `examples/agent_improvement_cycle/eval/operational_metrics.py` (5 refs); 3 notebooks (`e2e_notebook_demo.ipynb`, `nba_agent_trace_analysis_notebook.ipynb`, `context_graph_adcp_demo.ipynb`); `docs/implementation_plan_remote_function.md` (7 refs)	❌ These are first-party; they should use the new name
Tests	`tests/test_udf_kernels.py` (10 refs), `tests/test_sdk_evaluators.py`, `tests/test_pr16_fixes.py`, `tests/test_pr17_fixes.py`, `tests/test_sdk_client.py`	⚠️ Tests should either be renamed or explicitly test the alias path

"A CodeEvaluator alias is kept in evaluators.py for backward-compatibility" justifies the alias for external callers, but the standard for first-party code, examples, tests, and docs is to use the new name. The PR leaves these inconsistent.

Specifically deploy/remote_function/dispatch.py is the deployed-runtime façade — keeping it on CodeEvaluator is awkward because the deploy code is owned by this repo, not by external callers.

L1 — Dead imports

In client.py:76, grader_pipeline.py:51, cli.py:42, both CodeEvaluator and SystemEvaluator are imported but only SystemEvaluator is used in the bodies. Either drop the CodeEvaluator import from these files or document why it's needed (it isn't — these are SDK internals, not the alias's customers).

What looks correct

✅ SystemEvaluator is the consistent name throughout evaluators.py (164+ references).
✅ __init__.py re-exports both CodeEvaluator and SystemEvaluator, preserving the public surface for external callers (from bigquery_agent_analytics import CodeEvaluator).
✅ The alias is correctly defined: CodeEvaluator = SystemEvaluator (class identity preserved — verified by python -c "from bigquery_agent_analytics import CodeEvaluator, SystemEvaluator; assert CodeEvaluator is SystemEvaluator" after a rebase test).
✅ Test count on PR 121 head: same set of evaluator tests pass.

Verdict

Request changes, blockers being B1 (rebase) and M1 (incomplete first-party rename). Suggested approach:

Rebase on origin/main.
Finish the rename in deploy/, examples/ (including notebooks), docs/implementation_plan_remote_function.md, and the relevant test files. Leave the alias only for external callers (who can keep using CodeEvaluator).
Drop the dead CodeEvaluator imports from client.py, grader_pipeline.py, cli.py.

After those changes the alias stays a true backward-compat path rather than an internal inconsistency.

caohy1988 · 2026-05-11T22:36:55Z

Fresh follow-up after the existing REQUEST_CHANGES review:

One additional polish item I don't see captured yet:

Low / release-note gap — the public rename needs an Unreleased changelog entry and an alias policy.

This PR introduces SystemEvaluator as the preferred public name while keeping CodeEvaluator = SystemEvaluator as a compatibility alias. That's the right compatibility shape, but CHANGELOG.md still only documents the older CodeEvaluator behavior from v0.2.2 and has no note for the rename or the support policy for the alias.

Suggested addition under Unreleased after the rebase:

Added SystemEvaluator as the preferred name for deterministic/code-defined metrics.
Kept CodeEvaluator as a backward-compatible alias; state whether it is permanent or deprecated-but-supported.

That matters because users reading release notes need to know whether to migrate imports immediately or whether existing CodeEvaluator code remains supported.

gigistark-google · 2026-05-12T20:00:07Z

Thanks Haiyuan! Addressed all comments on this PR.

gigistark-google

Thanks Haiyuan! Addressed all comments on this PR.

caohy1988

Fresh review of PR #121 head 8e3569d against current origin/main.

I still see merge-blocking issues.

Blockers

B1 — PR is still conflicting against current `main`

GitHub reports mergeable=CONFLICTING, mergeStateStatus=DIRTY. I also reproduced the conflict locally with git merge-tree; the remaining hard conflict is in CHANGELOG.md because current main added the --events-bq-query-file entry while this PR adds the SystemEvaluator rename entry in the same section.

Please rebase on current origin/main and keep both changelog entries. Until this is resolved, GitHub cannot merge the PR.

B2 — `CodeEvaluator().name` changed, so the alias is not fully backward compatible

The PR keeps CodeEvaluator = SystemEvaluator, which preserves imports and isinstance behavior, but it also changes the constructor default from name="code_evaluator" to name="system_evaluator".

Current main:

from bigquery_agent_analytics import CodeEvaluator
assert CodeEvaluator().name == "code_evaluator"

PR head:

from bigquery_agent_analytics import CodeEvaluator, SystemEvaluator
print(CodeEvaluator is SystemEvaluator)  # True
print(CodeEvaluator().name)              # "system_evaluator"

That is observable API behavior because Client._evaluate_code() serializes evaluator.name into EvaluationReport.evaluator_name, and remote-function / dashboard consumers read that field. A caller using the documented backward-compatible CodeEvaluator() alias now emits a different evaluator_name than before.

Please either:

preserve the legacy default when users instantiate CodeEvaluator() directly, or
explicitly document this as a breaking serialized-output change and add a regression test covering the intended behavior.

Given the PR says "CodeEvaluator as a backward-compatible alias (deprecated but supported)", I think preserving the old default is the right fix.

B3 — format checks are not clean

Local checks:

uv run pyink --check \
  src/bigquery_agent_analytics/evaluators.py \
  src/bigquery_agent_analytics/__init__.py \
  src/bigquery_agent_analytics/client.py \
  src/bigquery_agent_analytics/grader_pipeline.py \
  src/bigquery_agent_analytics/cli.py \
  deploy/remote_function/dispatch.py \
  tests/test_sdk_evaluators.py \
  tests/test_sdk_client.py \
  tests/test_grader_pipeline.py \
  tests/test_pr16_fixes.py \
  tests/test_pr17_fixes.py \
  tests/test_udf_kernels.py

passes.

But:

uv run isort --check-only ...

fails on deploy/remote_function/dispatch.py. The required import order is:

 from bigquery_agent_analytics import Client
-from bigquery_agent_analytics import SystemEvaluator
 from bigquery_agent_analytics import LLMAsJudge
 from bigquery_agent_analytics import serialize
+from bigquery_agent_analytics import SystemEvaluator
 from bigquery_agent_analytics import TraceFilter

And:

git diff --check origin/main...HEAD

fails with trailing whitespace in examples/agent_improvement_cycle/DEMO_NARRATION.md.

Both need to be clean before merge.

Verified

uv run pytest tests/test_sdk_evaluators.py tests/test_sdk_client.py tests/test_grader_pipeline.py tests/test_pr16_fixes.py tests/test_pr17_fixes.py tests/test_udf_kernels.py -q passes: 337 passed.
uv run pytest tests/test_cli.py tests/test_remote_function.py tests/test_streaming_evaluation.py -q passes: 187 passed.
uv run python -m py_compile ... over the changed Python entrypoints passes.
Package-root alias smoke check passes structurally: CodeEvaluator is SystemEvaluator and from bigquery_agent_analytics import SystemEvaluator works.

The rename is much cleaner than the prior review pass, but the PR should not merge until the conflict, format failures, and CodeEvaluator().name compatibility regression are resolved.

caohy1988

Follow-up to my earlier REQUEST_CHANGES — additional blockers found on a second pass. None of these are caught by the test suite (the tests import SystemEvaluator directly, so they don't exercise the docs/notebook/CHANGELOG drift).

Additional blockers

1. `SDK.md` misrepresents evaluator behavior

SDK.md:120-127 changes the pre-built descriptions from "fails when X exceeds the budget" to "score degrades linearly as X approaches threshold" / "penalizes". The code at src/bigquery_agent_analytics/evaluators.py:325-369 still returns binary 1.0 if observed <= threshold else 0.0. Readers will expect partial credit and get hard cutoffs.

Fix: Either revert SDK.md to the binary pass/fail wording, or implement linear degradation in code. This PR should not silently flip the documented behavior contract during a rename.

2. `docs/implementation_plan_remote_function.md` mixes a design change with the rename

Lines 220-249 change EVALUATOR_FACTORIES from a single-lambda dict to a (with_threshold, without_threshold) tuple-of-lambdas dict, and silently drops the \"llm-judge\": None entry. That's a design-document mutation unrelated to renaming.

Fix: Revert the structural change to the doc, or split it into its own PR. A rename PR should not be editing an implementation plan's design.

3. Notebook reformat noise is uninspectable

examples/context_graph_adcp_demo.ipynb was re-saved with 2-space JSON indent.
examples/e2e_notebook_demo.ipynb and examples/nba_agent_trace_analysis_notebook.ipynb stayed at 1-space.
All three notebooks also had every cell's \"source\" field rewritten from a string to an array-of-strings (\"source\": \"code\\nmore\" → \"source\": [\"code\\n\", \"more\"]).

Result: 6014 / 818 / 1041-line diffs nobody can audit for accidental content changes. Inconsistent treatment within the same PR is itself a smell.

Fix: Either re-save all three with consistent settings (one Jupyter version, one indent), or revert the notebooks (git checkout origin/main -- examples/*.ipynb) and apply a targeted sed rename for the four CodeEvaluator → SystemEvaluator occurrences. Either way, the notebook diff after this fix should be small enough to read end-to-end.

4. `CHANGELOG.md` promises a deprecation the code doesn't implement

CHANGELOG.md:12 says CodeEvaluator is "deprecated but supported". The code at src/bigquery_agent_analytics/evaluators.py:633 is a bare CodeEvaluator = SystemEvaluator assignment — no warnings.warn(DeprecationWarning, …).

Fix: Pick one. Either drop "deprecated" from CHANGELOG (alias-only, no warning), or wire an actual DeprecationWarning — e.g. wrap with a thin subclass whose __init_subclass__ / __new__ raises the warning. Silent passthrough is fine; lying about it in CHANGELOG isn't.

Soft blockers

5. `docs/design.md:211` ASCII art damaged

One row of box-drawing characters got shortened from

└──────────────────────┘  └──────────────────────┘  └──────────────────┘

to

└──────────────────┘  └──────────────────┘  └──────────────────┘

…breaking the visual alignment with the boxes above. Stray edit from the search-and-replace pass; revert.

6. The `CodeEvaluator().name == \"system_evaluator\"` blocker has a wider blast radius

The rename also changes EvaluationReport.evaluator_name, which is populated from evaluator.name at src/bigquery_agent_analytics/client.py:952, 993, 1136, 1208, 1242. Any downstream consumer that does report.evaluator_name == \"code_evaluator\" silently misses now.

Fix: Address the constructor default AND every report-emission site in one pass — don't iterate.

Nits (non-blocking but worth catching here)

7. `GraderPipeline.add_code_grader` name vs docstring mismatch

src/bigquery_agent_analytics/grader_pipeline.py:280 keeps the method name add_code_grader but the docstring now says "Adds a SystemEvaluator grader to the pipeline." Either alias an add_system_grader method (and keep add_code_grader for back-compat) or use neutral language in the docstring ("Adds a code-based grader").

8. Internal CLI constant `_CODE_EVALUATORS` not renamed

src/bigquery_agent_analytics/cli.py:217 still uses _CODE_EVALUATORS as the dict name even though every value now constructs a SystemEvaluator. Internal scaffolding only, but inconsistent with the stated goal of a thorough rename.

Summary

My existing four blockers cover the merge-conflict, name-emission backward-compat, and lint failures. The four new blockers above are: SDK.md behavior misrepresentation, smuggled-in design plan change, notebook reformat that makes the PR un-auditable, and CHANGELOG making a deprecation claim the code doesn't deliver. The author should split notebook reformatting into its own PR (or revert it entirely) and pick a deprecation story for CodeEvaluator and stick to it.

Rename CodeEvaluator to SystemEvaluator to align with its focus on system-level metrics. A CodeEvaluator alias is kept in evaluators.py for backward-compatibility.

- Set SystemEvaluator default name to 'code_evaluator' for backward compatibility. - Update docstring in grader_pipeline.py to use neutral language. - Rename internal CLI constant _CODE_EVALUATORS to _SYSTEM_EVALUATORS. - Fix ASCII art in docs/design.md. - Clean up CHANGELOG.md and SDK.md.

gigistark-google

Thank you so much for your detailed feedback! It is much appreciated! I am definitely rusty on GitHub so apologies in particular for missing some of these formatting/rebase things.

All feedback has now been addressed.

gigistark-google mentioned this pull request May 6, 2026

Refactor: Rename CodeEvaluator to SystemEvaluator #110

Closed

caohy1988 suggested changes May 11, 2026

View reviewed changes

This was referenced May 11, 2026

Stage identical baseline copies and test renames for lineage tracking #122

Open

Unify One-Sided & Side-by-Side Performance Metrics in the PerformanceEvaluator. #123

Open

gigistark-google force-pushed the pr1_rename branch 3 times, most recently from 70b4fac to 8e3569d Compare May 12, 2026 19:58

gigistark-google commented May 12, 2026

View reviewed changes

caohy1988 suggested changes May 12, 2026

View reviewed changes

Refactor: Rename CodeEvaluator to SystemEvaluator

3454132

Rename CodeEvaluator to SystemEvaluator to align with its focus on system-level metrics. A CodeEvaluator alias is kept in evaluators.py for backward-compatibility.

gigistark-google force-pushed the pr1_rename branch from 8e3569d to ee0ffbc Compare May 12, 2026 23:23

gigistark-google force-pushed the pr1_rename branch from ee0ffbc to d442663 Compare May 12, 2026 23:29

gigistark-google commented May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: Rename CodeEvaluator to SystemEvaluator#121

Refactor: Rename CodeEvaluator to SystemEvaluator#121
gigistark-google wants to merge 2 commits into
GoogleCloudPlatform:mainfrom
gigistark-google:pr1_rename

gigistark-google commented May 6, 2026

Uh oh!

caohy1988 left a comment

Uh oh!

caohy1988 commented May 11, 2026

Uh oh!

gigistark-google commented May 12, 2026

Uh oh!

gigistark-google left a comment

Uh oh!

caohy1988 left a comment

Uh oh!

caohy1988 left a comment

Uh oh!

gigistark-google left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gigistark-google commented May 6, 2026

Uh oh!

caohy1988 left a comment

Choose a reason for hiding this comment

Review of PR #121 head 914119a

Blockers

B1 — Stale base, 24 commits behind origin/main

Findings on the rename itself

M1 — Incomplete rename across first-party code

L1 — Dead imports

What looks correct

Verdict

Uh oh!

caohy1988 commented May 11, 2026

Uh oh!

gigistark-google commented May 12, 2026

Uh oh!

gigistark-google left a comment

Choose a reason for hiding this comment

Uh oh!

caohy1988 left a comment

Choose a reason for hiding this comment

Blockers

B1 — PR is still conflicting against current main

B2 — CodeEvaluator().name changed, so the alias is not fully backward compatible

B3 — format checks are not clean

Verified

Uh oh!

caohy1988 left a comment

Choose a reason for hiding this comment

Additional blockers

1. SDK.md misrepresents evaluator behavior

2. docs/implementation_plan_remote_function.md mixes a design change with the rename

3. Notebook reformat noise is uninspectable

4. CHANGELOG.md promises a deprecation the code doesn't implement

Soft blockers

5. docs/design.md:211 ASCII art damaged

6. The CodeEvaluator().name == \"system_evaluator\" blocker has a wider blast radius

Nits (non-blocking but worth catching here)

7. GraderPipeline.add_code_grader name vs docstring mismatch

8. Internal CLI constant _CODE_EVALUATORS not renamed

Summary

Uh oh!

gigistark-google left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Review of PR #121 head `914119a`

B1 — Stale base, 24 commits behind `origin/main`

B1 — PR is still conflicting against current `main`

B2 — `CodeEvaluator().name` changed, so the alias is not fully backward compatible

1. `SDK.md` misrepresents evaluator behavior

2. `docs/implementation_plan_remote_function.md` mixes a design change with the rename

4. `CHANGELOG.md` promises a deprecation the code doesn't implement

5. `docs/design.md:211` ASCII art damaged

6. The `CodeEvaluator().name == \"system_evaluator\"` blocker has a wider blast radius

7. `GraderPipeline.add_code_grader` name vs docstring mismatch

8. Internal CLI constant `_CODE_EVALUATORS` not renamed

gigistark-google left a comment •

edited

Loading