Skip to content

Refactor: Rename CodeEvaluator to SystemEvaluator#121

Open
gigistark-google wants to merge 2 commits into
GoogleCloudPlatform:mainfrom
gigistark-google:pr1_rename
Open

Refactor: Rename CodeEvaluator to SystemEvaluator#121
gigistark-google wants to merge 2 commits into
GoogleCloudPlatform:mainfrom
gigistark-google:pr1_rename

Conversation

@gigistark-google
Copy link
Copy Markdown

Rename CodeEvaluator to SystemEvaluator to align with its focus on system-level metrics. A CodeEvaluator alias is kept in evaluators.py for backward-compatibility.

Copy link
Copy Markdown
Contributor

@caohy1988 caohy1988 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of PR #121 head 914119a

This is the first PR in a three-PR stack (#121#122#123). Review focuses on this PR's stated scope: "Rename CodeEvaluator to SystemEvaluator, kept as backward-compat alias."

Blockers

B1 — Stale base, 24 commits behind origin/main

Branch base is 292320b. origin/main is at 8824582. The PR's mergeStateStatus: UNKNOWN is consistent with that. Several of the 32 commits since base touch evaluators.py (e.g. PR #126 added retry-on-gate-failure orchestrator). Rebase needed before merge regardless of the review outcome.

git fetch origin main
git rebase origin/main
git push --force-with-lease

Findings on the rename itself

M1 — Incomplete rename across first-party code

After the PR, 18 files still contain CodeEvaluator. Classifying:

Category Files Status
Backward-compat alias path (intentional) evaluators.py (line 505: CodeEvaluator = SystemEvaluator), __init__.py (re-exports both) ✅ Correct
Stale docstring references inside renamed files client.py:880,885 ("CodeEvaluator metrics are computed..."), cli.py:409, udf_kernels.py:28 ⚠️ Cosmetic, but inconsistent with the file's own code now using SystemEvaluator
Dead imports (imported but unused) client.py:76, grader_pipeline.py:51, cli.py:42 all do from .evaluators import CodeEvaluator, SystemEvaluator but only use SystemEvaluator ⚠️ CodeEvaluator import is dead weight; pyflakes would flag
First-party callers NOT renamed deploy/remote_function/dispatch.py (14 references — every factory entry still calls CodeEvaluator.latency(...) etc.); examples/e2e_demo.py (5 refs); examples/agent_improvement_cycle/eval/operational_metrics.py (5 refs); 3 notebooks (e2e_notebook_demo.ipynb, nba_agent_trace_analysis_notebook.ipynb, context_graph_adcp_demo.ipynb); docs/implementation_plan_remote_function.md (7 refs) ❌ These are first-party; they should use the new name
Tests tests/test_udf_kernels.py (10 refs), tests/test_sdk_evaluators.py, tests/test_pr16_fixes.py, tests/test_pr17_fixes.py, tests/test_sdk_client.py ⚠️ Tests should either be renamed or explicitly test the alias path

"A CodeEvaluator alias is kept in evaluators.py for backward-compatibility" justifies the alias for external callers, but the standard for first-party code, examples, tests, and docs is to use the new name. The PR leaves these inconsistent.

Specifically deploy/remote_function/dispatch.py is the deployed-runtime façade — keeping it on CodeEvaluator is awkward because the deploy code is owned by this repo, not by external callers.

L1 — Dead imports

In client.py:76, grader_pipeline.py:51, cli.py:42, both CodeEvaluator and SystemEvaluator are imported but only SystemEvaluator is used in the bodies. Either drop the CodeEvaluator import from these files or document why it's needed (it isn't — these are SDK internals, not the alias's customers).

What looks correct

  • SystemEvaluator is the consistent name throughout evaluators.py (164+ references).
  • __init__.py re-exports both CodeEvaluator and SystemEvaluator, preserving the public surface for external callers (from bigquery_agent_analytics import CodeEvaluator).
  • ✅ The alias is correctly defined: CodeEvaluator = SystemEvaluator (class identity preserved — verified by python -c "from bigquery_agent_analytics import CodeEvaluator, SystemEvaluator; assert CodeEvaluator is SystemEvaluator" after a rebase test).
  • ✅ Test count on PR 121 head: same set of evaluator tests pass.

Verdict

Request changes, blockers being B1 (rebase) and M1 (incomplete first-party rename). Suggested approach:

  1. Rebase on origin/main.
  2. Finish the rename in deploy/, examples/ (including notebooks), docs/implementation_plan_remote_function.md, and the relevant test files. Leave the alias only for external callers (who can keep using CodeEvaluator).
  3. Drop the dead CodeEvaluator imports from client.py, grader_pipeline.py, cli.py.

After those changes the alias stays a true backward-compat path rather than an internal inconsistency.

@caohy1988
Copy link
Copy Markdown
Contributor

Fresh follow-up after the existing REQUEST_CHANGES review:

One additional polish item I don't see captured yet:

Low / release-note gap — the public rename needs an Unreleased changelog entry and an alias policy.

This PR introduces SystemEvaluator as the preferred public name while keeping CodeEvaluator = SystemEvaluator as a compatibility alias. That's the right compatibility shape, but CHANGELOG.md still only documents the older CodeEvaluator behavior from v0.2.2 and has no note for the rename or the support policy for the alias.

Suggested addition under Unreleased after the rebase:

  • Added SystemEvaluator as the preferred name for deterministic/code-defined metrics.
  • Kept CodeEvaluator as a backward-compatible alias; state whether it is permanent or deprecated-but-supported.

That matters because users reading release notes need to know whether to migrate imports immediately or whether existing CodeEvaluator code remains supported.

@gigistark-google gigistark-google force-pushed the pr1_rename branch 3 times, most recently from 70b4fac to 8e3569d Compare May 12, 2026 19:58
@gigistark-google
Copy link
Copy Markdown
Author

Thanks Haiyuan! Addressed all comments on this PR.

Copy link
Copy Markdown
Author

@gigistark-google gigistark-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Haiyuan! Addressed all comments on this PR.

Copy link
Copy Markdown
Contributor

@caohy1988 caohy1988 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fresh review of PR #121 head 8e3569d against current origin/main.

I still see merge-blocking issues.

Blockers

B1 — PR is still conflicting against current main

GitHub reports mergeable=CONFLICTING, mergeStateStatus=DIRTY. I also reproduced the conflict locally with git merge-tree; the remaining hard conflict is in CHANGELOG.md because current main added the --events-bq-query-file entry while this PR adds the SystemEvaluator rename entry in the same section.

Please rebase on current origin/main and keep both changelog entries. Until this is resolved, GitHub cannot merge the PR.

B2 — CodeEvaluator().name changed, so the alias is not fully backward compatible

The PR keeps CodeEvaluator = SystemEvaluator, which preserves imports and isinstance behavior, but it also changes the constructor default from name="code_evaluator" to name="system_evaluator".

Current main:

from bigquery_agent_analytics import CodeEvaluator
assert CodeEvaluator().name == "code_evaluator"

PR head:

from bigquery_agent_analytics import CodeEvaluator, SystemEvaluator
print(CodeEvaluator is SystemEvaluator)  # True
print(CodeEvaluator().name)              # "system_evaluator"

That is observable API behavior because Client._evaluate_code() serializes evaluator.name into EvaluationReport.evaluator_name, and remote-function / dashboard consumers read that field. A caller using the documented backward-compatible CodeEvaluator() alias now emits a different evaluator_name than before.

Please either:

  • preserve the legacy default when users instantiate CodeEvaluator() directly, or
  • explicitly document this as a breaking serialized-output change and add a regression test covering the intended behavior.

Given the PR says "CodeEvaluator as a backward-compatible alias (deprecated but supported)", I think preserving the old default is the right fix.

B3 — format checks are not clean

Local checks:

uv run pyink --check \
  src/bigquery_agent_analytics/evaluators.py \
  src/bigquery_agent_analytics/__init__.py \
  src/bigquery_agent_analytics/client.py \
  src/bigquery_agent_analytics/grader_pipeline.py \
  src/bigquery_agent_analytics/cli.py \
  deploy/remote_function/dispatch.py \
  tests/test_sdk_evaluators.py \
  tests/test_sdk_client.py \
  tests/test_grader_pipeline.py \
  tests/test_pr16_fixes.py \
  tests/test_pr17_fixes.py \
  tests/test_udf_kernels.py

passes.

But:

uv run isort --check-only ...

fails on deploy/remote_function/dispatch.py. The required import order is:

 from bigquery_agent_analytics import Client
-from bigquery_agent_analytics import SystemEvaluator
 from bigquery_agent_analytics import LLMAsJudge
 from bigquery_agent_analytics import serialize
+from bigquery_agent_analytics import SystemEvaluator
 from bigquery_agent_analytics import TraceFilter

And:

git diff --check origin/main...HEAD

fails with trailing whitespace in examples/agent_improvement_cycle/DEMO_NARRATION.md.

Both need to be clean before merge.

Verified

  • uv run pytest tests/test_sdk_evaluators.py tests/test_sdk_client.py tests/test_grader_pipeline.py tests/test_pr16_fixes.py tests/test_pr17_fixes.py tests/test_udf_kernels.py -q passes: 337 passed.
  • uv run pytest tests/test_cli.py tests/test_remote_function.py tests/test_streaming_evaluation.py -q passes: 187 passed.
  • uv run python -m py_compile ... over the changed Python entrypoints passes.
  • Package-root alias smoke check passes structurally: CodeEvaluator is SystemEvaluator and from bigquery_agent_analytics import SystemEvaluator works.

The rename is much cleaner than the prior review pass, but the PR should not merge until the conflict, format failures, and CodeEvaluator().name compatibility regression are resolved.

Copy link
Copy Markdown
Contributor

@caohy1988 caohy1988 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up to my earlier REQUEST_CHANGES — additional blockers found on a second pass. None of these are caught by the test suite (the tests import SystemEvaluator directly, so they don't exercise the docs/notebook/CHANGELOG drift).


Additional blockers

1. SDK.md misrepresents evaluator behavior

SDK.md:120-127 changes the pre-built descriptions from "fails when X exceeds the budget" to "score degrades linearly as X approaches threshold" / "penalizes". The code at src/bigquery_agent_analytics/evaluators.py:325-369 still returns binary 1.0 if observed <= threshold else 0.0. Readers will expect partial credit and get hard cutoffs.

Fix: Either revert SDK.md to the binary pass/fail wording, or implement linear degradation in code. This PR should not silently flip the documented behavior contract during a rename.

2. docs/implementation_plan_remote_function.md mixes a design change with the rename

Lines 220-249 change EVALUATOR_FACTORIES from a single-lambda dict to a (with_threshold, without_threshold) tuple-of-lambdas dict, and silently drops the \"llm-judge\": None entry. That's a design-document mutation unrelated to renaming.

Fix: Revert the structural change to the doc, or split it into its own PR. A rename PR should not be editing an implementation plan's design.

3. Notebook reformat noise is uninspectable

  • examples/context_graph_adcp_demo.ipynb was re-saved with 2-space JSON indent.
  • examples/e2e_notebook_demo.ipynb and examples/nba_agent_trace_analysis_notebook.ipynb stayed at 1-space.
  • All three notebooks also had every cell's \"source\" field rewritten from a string to an array-of-strings (\"source\": \"code\\nmore\"\"source\": [\"code\\n\", \"more\"]).

Result: 6014 / 818 / 1041-line diffs nobody can audit for accidental content changes. Inconsistent treatment within the same PR is itself a smell.

Fix: Either re-save all three with consistent settings (one Jupyter version, one indent), or revert the notebooks (git checkout origin/main -- examples/*.ipynb) and apply a targeted sed rename for the four CodeEvaluatorSystemEvaluator occurrences. Either way, the notebook diff after this fix should be small enough to read end-to-end.

4. CHANGELOG.md promises a deprecation the code doesn't implement

CHANGELOG.md:12 says CodeEvaluator is "deprecated but supported". The code at src/bigquery_agent_analytics/evaluators.py:633 is a bare CodeEvaluator = SystemEvaluator assignment — no warnings.warn(DeprecationWarning, …).

Fix: Pick one. Either drop "deprecated" from CHANGELOG (alias-only, no warning), or wire an actual DeprecationWarning — e.g. wrap with a thin subclass whose __init_subclass__ / __new__ raises the warning. Silent passthrough is fine; lying about it in CHANGELOG isn't.


Soft blockers

5. docs/design.md:211 ASCII art damaged

One row of box-drawing characters got shortened from

└──────────────────────┘  └──────────────────────┘  └──────────────────┘

to

└──────────────────┘  └──────────────────┘  └──────────────────┘

…breaking the visual alignment with the boxes above. Stray edit from the search-and-replace pass; revert.

6. The CodeEvaluator().name == \"system_evaluator\" blocker has a wider blast radius

The rename also changes EvaluationReport.evaluator_name, which is populated from evaluator.name at src/bigquery_agent_analytics/client.py:952, 993, 1136, 1208, 1242. Any downstream consumer that does report.evaluator_name == \"code_evaluator\" silently misses now.

Fix: Address the constructor default AND every report-emission site in one pass — don't iterate.


Nits (non-blocking but worth catching here)

7. GraderPipeline.add_code_grader name vs docstring mismatch

src/bigquery_agent_analytics/grader_pipeline.py:280 keeps the method name add_code_grader but the docstring now says "Adds a SystemEvaluator grader to the pipeline." Either alias an add_system_grader method (and keep add_code_grader for back-compat) or use neutral language in the docstring ("Adds a code-based grader").

8. Internal CLI constant _CODE_EVALUATORS not renamed

src/bigquery_agent_analytics/cli.py:217 still uses _CODE_EVALUATORS as the dict name even though every value now constructs a SystemEvaluator. Internal scaffolding only, but inconsistent with the stated goal of a thorough rename.


Summary

My existing four blockers cover the merge-conflict, name-emission backward-compat, and lint failures. The four new blockers above are: SDK.md behavior misrepresentation, smuggled-in design plan change, notebook reformat that makes the PR un-auditable, and CHANGELOG making a deprecation claim the code doesn't deliver. The author should split notebook reformatting into its own PR (or revert it entirely) and pick a deprecation story for CodeEvaluator and stick to it.

Rename CodeEvaluator to SystemEvaluator to align with its focus on system-level metrics. A CodeEvaluator alias is kept in evaluators.py for backward-compatibility.
- Set SystemEvaluator default name to 'code_evaluator' for backward compatibility.
- Update docstring in grader_pipeline.py to use neutral language.
- Rename internal CLI constant _CODE_EVALUATORS to _SYSTEM_EVALUATORS.
- Fix ASCII art in docs/design.md.
- Clean up CHANGELOG.md and SDK.md.
Copy link
Copy Markdown
Author

@gigistark-google gigistark-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for your detailed feedback! It is much appreciated! I am definitely rusty on GitHub so apologies in particular for missing some of these formatting/rebase things.

All feedback has now been addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants