Skip to content

Refactor: Rename CodeEvaluator to SystemEvaluator#110

Closed
gigistark-google wants to merge 3 commits into
GoogleCloudPlatform:mainfrom
gigistark-google:gigistark-google
Closed

Refactor: Rename CodeEvaluator to SystemEvaluator#110
gigistark-google wants to merge 3 commits into
GoogleCloudPlatform:mainfrom
gigistark-google:gigistark-google

Conversation

@gigistark-google
Copy link
Copy Markdown

Rename CodeEvaluator to SystemEvaluator to align with its focus on system-level metrics.

Some of the files import and export both CodeEvaluator and SystemEvaluator to maintain strict backward compatibility.

  • The Alias (evaluators.py): We defined CodeEvaluator = SystemEvaluator inside evaluators.py.
  • The Imports (init.py, client.py, cli.py, grader_pipeline.py): By importing and re-exporting both CodeEvaluator and SystemEvaluator, we ensure:
    • Existing user scripts and pipelines (e.g., notebooks or custom scripts written by other team members) that currently import CodeEvaluator from the SDK will continue working seamlessly without any import errors.
    • New code and updated documentation can immediately transition to importing and instantiating SystemEvaluator.

Once the team has successfully migrated all internal pipelines to SystemEvaluator, we can safely delete the CodeEvaluator alias and double-imports in a future major version release.

Rename CodeEvaluator to SystemEvaluator to align with its focus on system-level metrics. A CodeEvaluator alias is kept in evaluators.py for backward-compatibility.
@gigistark-google gigistark-google force-pushed the gigistark-google branch 22 times, most recently from a8fc1a0 to d079d31 Compare May 3, 2026 17:46
@gigistark-google gigistark-google force-pushed the gigistark-google branch 6 times, most recently from ed76ad1 to bd557c6 Compare May 3, 2026 18:39
@gigistark-google gigistark-google force-pushed the gigistark-google branch 16 times, most recently from 23391df to d48e249 Compare May 4, 2026 02:06
…Evaluator, but don't add new metrics to the MultiTrialPerformance Evaluator yet

Purged obsolete criteria-list LLMAsJudge implementations, replacing them natively with PerformanceEvaluator for folded Tone, Faithfulness, Correctness, and Efficiency evaluations.

- Decoupled system and performance modules cleanly, making system_evaluator.py pure to SystemEvaluator.
- Overrode the backwards-compatible LLMAsJudge subclass in evaluators.py with required static factories for correctness, hallucination, and sentiment.
- PURGED criteria-list BQML execution code from client.py, and deleted legacy _criteria and _JudgeCriterion list validations throughout test suites.
- Fixed Jupyter event-loop context constraints via robust asyncio running event-loop setters inside Client._evaluate_performance.
- Refactored strip_markdown_fences in utils.py to drop trailing prose after fenced markdown closing backticks cleanly.
- Verified 1,997 collected unit tests PASSING 100% green successfully.

TAG=agy
CONV=bf5607ce-a7fc-4a29-a7fb-c6074580e613
Copy link
Copy Markdown
Contributor

@caohy1988 caohy1988 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the substantial refactor. There's a lot of good motion here — extracting SystemEvaluator, splitting performance vs system, adding the rich Aggregate Grader. Two CI-blocking quality issues and one substantive correctness issue I'd want addressed before merge.

High — blocking

H1. Backward compatibility for LLMAsJudge is broken, despite the PR description claiming it's preserved.

The PR body says: "Existing user scripts and pipelines that currently import CodeEvaluator from the SDK will continue working seamlessly without any import errors." CodeEvaluator → SystemEvaluator is fine (clean alias). But for LLMAsJudge, verified live with the PR head:

>>> from bigquery_agent_analytics import LLMAsJudge
>>> j = LLMAsJudge.correctness()
>>> j.project_id, j.dataset_id
('proj', 'ds')                              # ← hardcoded NONSENSE defaults
>>> hasattr(j, 'add_criterion')
False                                       # ← old method missing
>>> import inspect; inspect.signature(j.evaluate_session)
(session_id, golden_trajectory=None, …, use_llm_judge=False, …) -> EvaluationResult
                                            # ← old signature was
                                            #    (trace_text, final_response) -> SessionScore
>>> from bigquery_agent_analytics.evaluators import _JudgeCriterion
ImportError                                 # ← was importable on main

So existing user code along these lines:

judge = LLMAsJudge.correctness()
judge.add_criterion(name="x", prompt_template="...", score_key="x", threshold=0.5)
score: SessionScore = await judge.evaluate_session(trace_text, final_response)

will either (a) AttributeError at add_criterion, or (b) silently mis-route (trace_text, final_response) into (session_id, golden_trajectory) and try to query bogus BigQuery dataset proj.ds.agent_events. Two paths forward:

  • Make the shim a real adapter — reimplement add_criterion + the old (trace_text, final_response) signature on LLMAsJudge in evaluators.py. Drop the hardcoded project_id="proj" / dataset_id="ds" defaults; the old LLMAsJudge ran via google-genai directly and didn't need a BQ project.
  • Or be honest about the break — drop the seamless-compat claim from the PR description, document the migration path, and bump a major version.

Currently the PR claims a property it doesn't deliver, which is the worst of both worlds.

H2. CI Format check (pyink) will fail on first run. Verified locally with the repo's own pyproject.toml:

$ pyink --config pyproject.toml --check src/.../...py tests/...
9 files would be reformatted

Affected: evaluators.py, performance_evaluator.py, client.py, and the four migrated test files. Mostly mechanical: missing blank lines after imports, long-form def __init__ wraps, trailing whitespace.

H3. CI isort check will fail. 8 files have import-order or grouping violations: system_evaluator.py, performance_evaluator.py, multi_trial_performance_evaluator.py, aggregate_grader.py, evaluators.py, trace_evaluator.py, grader_pipeline.py, __init__.py. Sample violation:

- from .utils import _parse_json_from_text, strip_markdown_fences
+ from .utils import _parse_json_from_text
+ from .utils import strip_markdown_fences

H2 + H3 are mechanical: pyink --config pyproject.toml and isort --settings-path pyproject.toml followed by a single squash-fixup commit resolves both.

Medium

M1. Wildcard imports in shim files are an anti-pattern and create order-dependent name collisions. evaluators.py:18-19 does from .system_evaluator import * then from .performance_evaluator import *. system_evaluator.py:580 already aliases from .performance_evaluator import PerformanceEvaluator as LLMAsJudge, so the symbol LLMAsJudge ends up defined three times in the shim's namespace before line 21 finally defines its own. Replace import * with explicit import lines + __all__.

M2. The architectural claim "Decoupled system and performance modules cleanly, making system_evaluator.py pure to SystemEvaluator" (commit d80d5f8 message) is false. system_evaluator.py:580 imports from performance_evaluator.py to alias LLMAsJudge — the opposite of "pure to SystemEvaluator." The alias belongs in the shim layer (evaluators.py), not the new pure module.

M3. TestMergeCriterionMissingCriteria deleted but _merge_criterion_reports still in client.py (lines 1082, 1116, 2516). The function has two callers and zero remaining tests. The deleted test exercised the missing-criterion-defaults-to-0.0 fail path — now untested. Either keep the test (port it to use the new shapes) or delete _merge_criterion_reports if it's no longer needed.

M4. The LLMAsJudge shim's _criteria property fabricates synthetic objects with prompt_template="". Anything that introspects judge._criteria[0].prompt_template (debugging tools, downstream test fixtures, Phase 1 scorecard rubric inspection) silently gets the empty string.

M5. Strip internal-tooling tags from commit d80d5f8. The commit message ends with:

TAG=agy
CONV=bf5607ce-a7fc-4a29-a7fb-c6074580e613

Looks like internal provenance; doesn't belong in OSS git history. git commit --amend before the next push.

M6. PR title understates scope. "Refactor: Rename CodeEvaluator to SystemEvaluator" implies a pure rename, but commit d80d5f8 admits "Unify One-Sided & Side-by-Side Performance Metrics in the PerformanceEvaluator", "Purged obsolete criteria-list LLMAsJudge implementations", "Decoupled system and performance modules", "Fixed Jupyter event-loop context constraints", "Refactored strip_markdown_fences in utils.py". Either rename the PR to reflect the full scope, or split commit 3 into its own PR for cleaner review.

Low

L1. evaluators.py doesn't have from __future__ import annotations. The -> LLMAsJudge forward reference in static factories happens to work at runtime because the class body is complete by the time @staticmethod annotations evaluate, but it's brittle — one refactor away from NameError.

L2. evaluators.py:20 re-exports _parse_json_from_text and _extract_json_from_text (underscore-prefixed names imply "internal"). Either drop the underscore in utils.py or don't re-export.

L3. Test class deletions in test_pr16_fixes.py removed coverage for _JudgeCriterion. If the type is genuinely gone (it is — the import fails now), the deletions are correct, but there's no migration story for users who relied on it. Document in CHANGELOG.

L4. Commit 1 (the rename) and commit 3 (the substantive refactor) could be one PR each. Reviewers focused on the rename commit may miss the behavior changes in commit 3.

Verification I ran locally

Check Result
Package import
pytest tests/ (with pytest-asyncio from pyproject.toml test deps) 1952 passed, 18 skipped — matches the PR's claim
pyink --config pyproject.toml --check on changed files 9 files would be reformatted — Format check would fail in CI
isort --settings-path pyproject.toml --check-only on changed files 8 files have import-order violations — would fail in CI
LLMAsJudge.correctness() backward-compat sanity Broken — see H1

Suggested path forward

  1. Reshape the LLMAsJudge shim so old call patterns either work or fail with a clear DeprecationWarning + ImportError("use PerformanceEvaluator.evaluate_session instead"). Silent mis-routing is the worst outcome.
  2. Drop project_id="proj" / dataset_id="ds" hardcoded defaults in evaluators.py:25-29.
  3. Run pyink --config pyproject.toml and isort --settings-path pyproject.toml; squash-fixup.
  4. Move the LLMAsJudge alias OUT of system_evaluator.py:580 and INTO the shim.
  5. Replace wildcard imports in shims with explicit imports + __all__.
  6. Strip TAG=agy / CONV=… from commit d80d5f8's message.
  7. Either retitle the PR to reflect the full scope or split commit 3 into its own PR.
  8. Either restore the TestMergeCriterionMissingCriteria test against the new shapes, or delete _merge_criterion_reports.
  9. Once these land, ask a maintainer to trigger CI — Format/Tests should then go green.

Once H1, H2, H3 are addressed, this becomes a one-screen re-review.

``evaluate()`` function orchestrates batch evaluation using BigQuery's
native AI functions for scalable, zero-ETL assessment.
from typing import Optional
from .system_evaluator import *
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wildcard imports + missing __all__ discipline = order-dependent name collision (M1).

system_evaluator.py:580 already aliases from .performance_evaluator import PerformanceEvaluator as LLMAsJudge. So by the time line 21 of this file defines class LLMAsJudge(PerformanceEvaluator), the symbol has already been bound twice in this namespace (once from each star import), and only line 21 wins.

Suggested:

from .system_evaluator import (
    CodeEvaluator, EvaluationReport, SessionScore, SystemEvaluator,
    SESSION_SUMMARY_QUERY,
)
from .performance_evaluator import PerformanceEvaluator
from .utils import _parse_json_from_text, _extract_json_from_text, strip_markdown_fences

__all__ = [
    "CodeEvaluator", "EvaluationReport", "SessionScore", "SystemEvaluator",
    "PerformanceEvaluator", "LLMAsJudge",
    "_parse_json_from_text", "_extract_json_from_text", "strip_markdown_fences",
    "SESSION_SUMMARY_QUERY",
]

Re-exporting _parse_json_from_text and _extract_json_from_text is a Low concern (L2): the leading underscore implies "internal" — either drop the underscore in utils.py or stop re-exporting them here.

from bigquery_agent_analytics.evaluators import (
CodeEvaluator, LLMAsJudge,
def __init__(self, name: str = "llm_judge", model: Optional[str] = None, threshold: float = 0.5, *args, **kwargs):
super().__init__(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded NONSENSE defaults — project_id="proj" and dataset_id="ds" (part of H1).

The pre-built static factories below pass project_id="proj" and dataset_id="ds" explicitly, so they always land here. The constructed object is technically a PerformanceEvaluator against BigQuery dataset proj.ds.agent_events — which doesn't exist anywhere. Any method on the object that hits BQ will fail at runtime.

The OLD LLMAsJudge ran via google-genai directly and didn't need a BQ project at all. The shim should preserve that path: either accept project_id=None and gate BQ-bound methods with a clear error, or implement a compatibility adapter that doesn't construct a PerformanceEvaluator for the API-only judge case.

threshold=threshold,
)
return judge
def correctness(threshold: float = 0.5, model: Optional[str] = None) -> LLMAsJudge:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The static factories don't preserve the old API contract (H1).

A user with code from main:

judge = LLMAsJudge.correctness()
judge.add_criterion(name="x", prompt_template="...", score_key="x", threshold=0.5)
score: SessionScore = await judge.evaluate_session(trace_text, final_response)

…will hit:

  1. add_criterionAttributeError (this shim doesn't define it).
  2. If they skip step 1, evaluate_session(trace_text, final_response) mis-routes the args into PerformanceEvaluator.evaluate_session(session_id, golden_trajectory, ...)trace_text becomes session_id, final_response becomes golden_trajectory (wrong type). Then it tries to query BigQuery dataset proj.ds.agent_events and fails.

To honor the seamless-compat claim in the PR description, the shim needs to genuinely implement the OLD (trace_text, final_response) -> SessionScore path against google-genai, plus add_criterion. The current static factories just construct broken objects.


# LLMAsJudge is completely folded into PerformanceEvaluator in performance_evaluator.py.
# Safe alias preserved here for backward-compatibility.
from .performance_evaluator import PerformanceEvaluator as LLMAsJudge
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "decoupled system and performance modules cleanly, making system_evaluator.py pure to SystemEvaluator" claim from commit d80d5f8 is false at this line (M2).

system_evaluator.py imports from performance_evaluator.py here just to expose a backward-compat alias. That's the opposite of "pure to SystemEvaluator" — it makes system_evaluator depend on performance_evaluator.

Move this alias OUT of system_evaluator.py and INTO the shim (evaluators.py), where the rest of the backward-compat plumbing already lives. system_evaluator.py should have no LLMAsJudge symbol at all — that's the whole point of the rename.

Comment thread tests/test_pr16_fixes.py
@@ -139,125 +139,7 @@ def test_hitl_empty_results(self):


# ================================================================== #
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests removed but _merge_criterion_reports is still live (M3).

The deletions in this hunk remove TestMergeCriterionMissingCriteria (test_missing_criterion_fails_session + test_all_criteria_present_passes). Both tests imported _JudgeCriterion from evaluators.py, which is no longer exported — so the deletion makes sense as a mechanical cleanup.

But _merge_criterion_reports itself is still in client.py:2516 and called from client.py:1082 and client.py:1116. The missing-criterion-defaults-to-0.0 contract that those tests exercised is now untested. Either:

  • Port the deleted tests to use the new shapes (a public JudgeCriterion would help — see L3 in the top-level review), or
  • Remove _merge_criterion_reports if the callers no longer need the merge logic post-refactor.

@caohy1988
Copy link
Copy Markdown
Contributor

Fresh-eye follow-up: I found three additional issues worth addressing alongside the existing review.

High

  1. AggregateGrader.add_performance_grader() is non-functional with the real PerformanceEvaluator.

src/bigquery_agent_analytics/aggregate_grader.py:450-458 calls:

await evaluator.evaluate_session(trace_text=trace_text, final_response=final_response)

but PerformanceEvaluator.evaluate_session() takes session_id, not trace_text / final_response. It also returns EvaluationResult, which has eval_status, not passed. The current tests mask this by monkeypatching evaluate_session with an AsyncMock that returns SessionScore, so the test is not exercising the real public contract. This also makes the aggregate-grader example in SDK.md fail for users.

Suggested fix: either keep add_performance_grader() session-id based and plumb session_id through AggregateGrader.evaluate(...), or keep text-based grading separate as an LLMAsJudge/rubric adapter. Add one test that uses a real PerformanceEvaluator subclass/fake overriding get_session_trace() rather than replacing evaluate_session() with a different signature/return type.

Medium

  1. PerformanceEvaluator.evaluate_session() no longer computes overall_score for non-side-by-side paths.

src/bigquery_agent_analytics/performance_evaluator.py:761-766 now sets:

overall_score=scores.get("llm_judge_correctness")

That means deterministic trajectory evaluation, one-sided LLM judge, and custom-rubric-only evaluations can produce valid scores but return overall_score=None. The previous implementation computed the mean of all scores when any scores existed. This is observable with a trajectory-only result: {"trajectory_exact_match": 1.0, "step_efficiency": 1.0} but overall_score is None.

Suggested fix: restore the prior sum(scores.values()) / len(scores) behavior, or document and test a deliberate replacement.

  1. Client.evaluate(PerformanceEvaluator(...)) imports an undeclared dependency in running event loops.

src/bigquery_agent_analytics/client.py:994-998 does:

if loop.is_running():
  import nest_asyncio
  nest_asyncio.apply()

eval_results = loop.run_until_complete(evaluate_all())

nest-asyncio is not in pyproject.toml, so notebook / async-app callers can hit ModuleNotFoundError only on this path. The same file already has _run_sync() at client.py:270-287, which handles running loops via a thread executor without adding a dependency.

Suggested fix: reuse _run_sync(evaluate_all()) here instead of importing nest_asyncio, or add nest-asyncio as a declared dependency if that behavior is intentional.

Low / fold into the existing H1 compatibility fix

  1. LLMAsJudge.correctness(model=...) silently ignores the model override.

src/bigquery_agent_analytics/evaluators.py:53-54 passes llm_judge_model=model, but LLMAsJudge.__init__() forwards only its model parameter to PerformanceEvaluator, not kwargs["llm_judge_model"]. So:

LLMAsJudge.correctness(model="gemini-custom").llm_judge_model

currently resolves to the default gemini-2.5-flash. This should be covered by the backward-compat tests for the shim.

These are in addition to the existing changes-requested review; the first one is the main new blocker because it affects a newly documented public path.

@gigistark-google
Copy link
Copy Markdown
Author

Thanks for this detailed feedback Haiyuan! I appreciate it! I have addressed all of this feedback 3 new separate pull requests: #121, #122, #123. Look forward to getting your thoughts on these!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants