Refactor: Rename CodeEvaluator to SystemEvaluator#121
Conversation
caohy1988
left a comment
There was a problem hiding this comment.
Review of PR #121 head 914119a
This is the first PR in a three-PR stack (#121 → #122 → #123). Review focuses on this PR's stated scope: "Rename CodeEvaluator to SystemEvaluator, kept as backward-compat alias."
Blockers
B1 — Stale base, 24 commits behind origin/main
Branch base is 292320b. origin/main is at 8824582. The PR's mergeStateStatus: UNKNOWN is consistent with that. Several of the 32 commits since base touch evaluators.py (e.g. PR #126 added retry-on-gate-failure orchestrator). Rebase needed before merge regardless of the review outcome.
git fetch origin main
git rebase origin/main
git push --force-with-leaseFindings on the rename itself
M1 — Incomplete rename across first-party code
After the PR, 18 files still contain CodeEvaluator. Classifying:
| Category | Files | Status |
|---|---|---|
| Backward-compat alias path (intentional) | evaluators.py (line 505: CodeEvaluator = SystemEvaluator), __init__.py (re-exports both) |
✅ Correct |
| Stale docstring references inside renamed files | client.py:880,885 ("CodeEvaluator metrics are computed..."), cli.py:409, udf_kernels.py:28 |
SystemEvaluator |
| Dead imports (imported but unused) | client.py:76, grader_pipeline.py:51, cli.py:42 all do from .evaluators import CodeEvaluator, SystemEvaluator but only use SystemEvaluator |
CodeEvaluator import is dead weight; pyflakes would flag |
| First-party callers NOT renamed | deploy/remote_function/dispatch.py (14 references — every factory entry still calls CodeEvaluator.latency(...) etc.); examples/e2e_demo.py (5 refs); examples/agent_improvement_cycle/eval/operational_metrics.py (5 refs); 3 notebooks (e2e_notebook_demo.ipynb, nba_agent_trace_analysis_notebook.ipynb, context_graph_adcp_demo.ipynb); docs/implementation_plan_remote_function.md (7 refs) |
❌ These are first-party; they should use the new name |
| Tests | tests/test_udf_kernels.py (10 refs), tests/test_sdk_evaluators.py, tests/test_pr16_fixes.py, tests/test_pr17_fixes.py, tests/test_sdk_client.py |
"A CodeEvaluator alias is kept in evaluators.py for backward-compatibility" justifies the alias for external callers, but the standard for first-party code, examples, tests, and docs is to use the new name. The PR leaves these inconsistent.
Specifically deploy/remote_function/dispatch.py is the deployed-runtime façade — keeping it on CodeEvaluator is awkward because the deploy code is owned by this repo, not by external callers.
L1 — Dead imports
In client.py:76, grader_pipeline.py:51, cli.py:42, both CodeEvaluator and SystemEvaluator are imported but only SystemEvaluator is used in the bodies. Either drop the CodeEvaluator import from these files or document why it's needed (it isn't — these are SDK internals, not the alias's customers).
What looks correct
- ✅
SystemEvaluatoris the consistent name throughoutevaluators.py(164+ references). - ✅
__init__.pyre-exports bothCodeEvaluatorandSystemEvaluator, preserving the public surface for external callers (from bigquery_agent_analytics import CodeEvaluator). - ✅ The alias is correctly defined:
CodeEvaluator = SystemEvaluator(class identity preserved — verified bypython -c "from bigquery_agent_analytics import CodeEvaluator, SystemEvaluator; assert CodeEvaluator is SystemEvaluator"after a rebase test). - ✅ Test count on PR 121 head: same set of evaluator tests pass.
Verdict
Request changes, blockers being B1 (rebase) and M1 (incomplete first-party rename). Suggested approach:
- Rebase on
origin/main. - Finish the rename in
deploy/,examples/(including notebooks),docs/implementation_plan_remote_function.md, and the relevant test files. Leave the alias only for external callers (who can keep usingCodeEvaluator). - Drop the dead
CodeEvaluatorimports fromclient.py,grader_pipeline.py,cli.py.
After those changes the alias stays a true backward-compat path rather than an internal inconsistency.
|
Fresh follow-up after the existing REQUEST_CHANGES review: One additional polish item I don't see captured yet: Low / release-note gap — the public rename needs an This PR introduces Suggested addition under
That matters because users reading release notes need to know whether to migrate imports immediately or whether existing |
70b4fac to
8e3569d
Compare
|
Thanks Haiyuan! Addressed all comments on this PR. |
gigistark-google
left a comment
There was a problem hiding this comment.
Thanks Haiyuan! Addressed all comments on this PR.
caohy1988
left a comment
There was a problem hiding this comment.
Fresh review of PR #121 head 8e3569d against current origin/main.
I still see merge-blocking issues.
Blockers
B1 — PR is still conflicting against current main
GitHub reports mergeable=CONFLICTING, mergeStateStatus=DIRTY. I also reproduced the conflict locally with git merge-tree; the remaining hard conflict is in CHANGELOG.md because current main added the --events-bq-query-file entry while this PR adds the SystemEvaluator rename entry in the same section.
Please rebase on current origin/main and keep both changelog entries. Until this is resolved, GitHub cannot merge the PR.
B2 — CodeEvaluator().name changed, so the alias is not fully backward compatible
The PR keeps CodeEvaluator = SystemEvaluator, which preserves imports and isinstance behavior, but it also changes the constructor default from name="code_evaluator" to name="system_evaluator".
Current main:
from bigquery_agent_analytics import CodeEvaluator
assert CodeEvaluator().name == "code_evaluator"PR head:
from bigquery_agent_analytics import CodeEvaluator, SystemEvaluator
print(CodeEvaluator is SystemEvaluator) # True
print(CodeEvaluator().name) # "system_evaluator"That is observable API behavior because Client._evaluate_code() serializes evaluator.name into EvaluationReport.evaluator_name, and remote-function / dashboard consumers read that field. A caller using the documented backward-compatible CodeEvaluator() alias now emits a different evaluator_name than before.
Please either:
- preserve the legacy default when users instantiate
CodeEvaluator()directly, or - explicitly document this as a breaking serialized-output change and add a regression test covering the intended behavior.
Given the PR says "CodeEvaluator as a backward-compatible alias (deprecated but supported)", I think preserving the old default is the right fix.
B3 — format checks are not clean
Local checks:
uv run pyink --check \
src/bigquery_agent_analytics/evaluators.py \
src/bigquery_agent_analytics/__init__.py \
src/bigquery_agent_analytics/client.py \
src/bigquery_agent_analytics/grader_pipeline.py \
src/bigquery_agent_analytics/cli.py \
deploy/remote_function/dispatch.py \
tests/test_sdk_evaluators.py \
tests/test_sdk_client.py \
tests/test_grader_pipeline.py \
tests/test_pr16_fixes.py \
tests/test_pr17_fixes.py \
tests/test_udf_kernels.pypasses.
But:
uv run isort --check-only ...fails on deploy/remote_function/dispatch.py. The required import order is:
from bigquery_agent_analytics import Client
-from bigquery_agent_analytics import SystemEvaluator
from bigquery_agent_analytics import LLMAsJudge
from bigquery_agent_analytics import serialize
+from bigquery_agent_analytics import SystemEvaluator
from bigquery_agent_analytics import TraceFilterAnd:
git diff --check origin/main...HEADfails with trailing whitespace in examples/agent_improvement_cycle/DEMO_NARRATION.md.
Both need to be clean before merge.
Verified
uv run pytest tests/test_sdk_evaluators.py tests/test_sdk_client.py tests/test_grader_pipeline.py tests/test_pr16_fixes.py tests/test_pr17_fixes.py tests/test_udf_kernels.py -qpasses:337 passed.uv run pytest tests/test_cli.py tests/test_remote_function.py tests/test_streaming_evaluation.py -qpasses:187 passed.uv run python -m py_compile ...over the changed Python entrypoints passes.- Package-root alias smoke check passes structurally:
CodeEvaluator is SystemEvaluatorandfrom bigquery_agent_analytics import SystemEvaluatorworks.
The rename is much cleaner than the prior review pass, but the PR should not merge until the conflict, format failures, and CodeEvaluator().name compatibility regression are resolved.
caohy1988
left a comment
There was a problem hiding this comment.
Follow-up to my earlier REQUEST_CHANGES — additional blockers found on a second pass. None of these are caught by the test suite (the tests import SystemEvaluator directly, so they don't exercise the docs/notebook/CHANGELOG drift).
Additional blockers
1. SDK.md misrepresents evaluator behavior
SDK.md:120-127 changes the pre-built descriptions from "fails when X exceeds the budget" to "score degrades linearly as X approaches threshold" / "penalizes". The code at src/bigquery_agent_analytics/evaluators.py:325-369 still returns binary 1.0 if observed <= threshold else 0.0. Readers will expect partial credit and get hard cutoffs.
Fix: Either revert SDK.md to the binary pass/fail wording, or implement linear degradation in code. This PR should not silently flip the documented behavior contract during a rename.
2. docs/implementation_plan_remote_function.md mixes a design change with the rename
Lines 220-249 change EVALUATOR_FACTORIES from a single-lambda dict to a (with_threshold, without_threshold) tuple-of-lambdas dict, and silently drops the \"llm-judge\": None entry. That's a design-document mutation unrelated to renaming.
Fix: Revert the structural change to the doc, or split it into its own PR. A rename PR should not be editing an implementation plan's design.
3. Notebook reformat noise is uninspectable
examples/context_graph_adcp_demo.ipynbwas re-saved with 2-space JSON indent.examples/e2e_notebook_demo.ipynbandexamples/nba_agent_trace_analysis_notebook.ipynbstayed at 1-space.- All three notebooks also had every cell's
\"source\"field rewritten from a string to an array-of-strings (\"source\": \"code\\nmore\"→\"source\": [\"code\\n\", \"more\"]).
Result: 6014 / 818 / 1041-line diffs nobody can audit for accidental content changes. Inconsistent treatment within the same PR is itself a smell.
Fix: Either re-save all three with consistent settings (one Jupyter version, one indent), or revert the notebooks (git checkout origin/main -- examples/*.ipynb) and apply a targeted sed rename for the four CodeEvaluator → SystemEvaluator occurrences. Either way, the notebook diff after this fix should be small enough to read end-to-end.
4. CHANGELOG.md promises a deprecation the code doesn't implement
CHANGELOG.md:12 says CodeEvaluator is "deprecated but supported". The code at src/bigquery_agent_analytics/evaluators.py:633 is a bare CodeEvaluator = SystemEvaluator assignment — no warnings.warn(DeprecationWarning, …).
Fix: Pick one. Either drop "deprecated" from CHANGELOG (alias-only, no warning), or wire an actual DeprecationWarning — e.g. wrap with a thin subclass whose __init_subclass__ / __new__ raises the warning. Silent passthrough is fine; lying about it in CHANGELOG isn't.
Soft blockers
5. docs/design.md:211 ASCII art damaged
One row of box-drawing characters got shortened from
└──────────────────────┘ └──────────────────────┘ └──────────────────┘
to
└──────────────────┘ └──────────────────┘ └──────────────────┘
…breaking the visual alignment with the boxes above. Stray edit from the search-and-replace pass; revert.
6. The CodeEvaluator().name == \"system_evaluator\" blocker has a wider blast radius
The rename also changes EvaluationReport.evaluator_name, which is populated from evaluator.name at src/bigquery_agent_analytics/client.py:952, 993, 1136, 1208, 1242. Any downstream consumer that does report.evaluator_name == \"code_evaluator\" silently misses now.
Fix: Address the constructor default AND every report-emission site in one pass — don't iterate.
Nits (non-blocking but worth catching here)
7. GraderPipeline.add_code_grader name vs docstring mismatch
src/bigquery_agent_analytics/grader_pipeline.py:280 keeps the method name add_code_grader but the docstring now says "Adds a SystemEvaluator grader to the pipeline." Either alias an add_system_grader method (and keep add_code_grader for back-compat) or use neutral language in the docstring ("Adds a code-based grader").
8. Internal CLI constant _CODE_EVALUATORS not renamed
src/bigquery_agent_analytics/cli.py:217 still uses _CODE_EVALUATORS as the dict name even though every value now constructs a SystemEvaluator. Internal scaffolding only, but inconsistent with the stated goal of a thorough rename.
Summary
My existing four blockers cover the merge-conflict, name-emission backward-compat, and lint failures. The four new blockers above are: SDK.md behavior misrepresentation, smuggled-in design plan change, notebook reformat that makes the PR un-auditable, and CHANGELOG making a deprecation claim the code doesn't deliver. The author should split notebook reformatting into its own PR (or revert it entirely) and pick a deprecation story for CodeEvaluator and stick to it.
Rename CodeEvaluator to SystemEvaluator to align with its focus on system-level metrics. A CodeEvaluator alias is kept in evaluators.py for backward-compatibility.
8e3569d to
ee0ffbc
Compare
- Set SystemEvaluator default name to 'code_evaluator' for backward compatibility. - Update docstring in grader_pipeline.py to use neutral language. - Rename internal CLI constant _CODE_EVALUATORS to _SYSTEM_EVALUATORS. - Fix ASCII art in docs/design.md. - Clean up CHANGELOG.md and SDK.md.
ee0ffbc to
d442663
Compare
Rename CodeEvaluator to SystemEvaluator to align with its focus on system-level metrics. A CodeEvaluator alias is kept in evaluators.py for backward-compatibility.