Skip to content

Unify One-Sided & Side-by-Side Performance Metrics in the PerformanceEvaluator.#123

Open
gigistark-google wants to merge 3 commits into
GoogleCloudPlatform:mainfrom
gigistark-google:pr3_refactor
Open

Unify One-Sided & Side-by-Side Performance Metrics in the PerformanceEvaluator.#123
gigistark-google wants to merge 3 commits into
GoogleCloudPlatform:mainfrom
gigistark-google:pr3_refactor

Conversation

@gigistark-google
Copy link
Copy Markdown

This change isolates system metrics in the system-evaluator and performance-metrics in the performance evaluator.

Note that in order to pass pyink --config pyproject.toml --check, some additional files were reformatted as part of this change.

Rename CodeEvaluator to SystemEvaluator to align with its focus on system-level metrics. A CodeEvaluator alias is kept in evaluators.py for backward-compatibility.
…Evaluator, but don't add new metrics to the MultiTrialPerformance Evaluator yet

Purged obsolete criteria-list LLMAsJudge implementations, replacing them natively with PerformanceEvaluator for folded Tone, Faithfulness, Correctness, and Efficiency evaluations.

- Decoupled system and performance modules cleanly, making system_evaluator.py pure to SystemEvaluator.
- Overrode the backwards-compatible LLMAsJudge subclass in evaluators.py with required static factories for correctness, hallucination, and sentiment.
- PURGED criteria-list BQML execution code from client.py, and deleted legacy _criteria and _JudgeCriterion list validations throughout test suites.
- Fixed Jupyter event-loop context constraints via robust asyncio running event-loop setters inside Client._evaluate_performance.
- Refactored strip_markdown_fences in utils.py to drop trailing prose after fenced markdown closing backticks cleanly.
- Verified 1,997 collected unit tests PASSING 100% green successfully.

TAG=agy
CONV=bf5607ce-a7fc-4a29-a7fb-c6074580e613
Copy link
Copy Markdown
Contributor

@caohy1988 caohy1988 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of PR #123 head 081a92d

Third PR in the stack (#121#122#123). Verified locally on a fresh checkout. The split-and-shim pattern is well executed — class identity is preserved across the backward-compat shims, the full test suite passes (1772 / 5 skipped), and lint is clean. But the PR also performs silent public-API removals that the CHANGELOG doesn't fully document, and that's the blocker.

What was verified clean

  • Test suite green: pytest tests/ -q1772 passed, 5 skipped, 17 warnings.

  • Lint clean: pyink --check src/ tests/ and isort --check-only src/ tests/ both pass.

  • Class identity preserved across shims:

    from bigquery_agent_analytics import CodeEvaluator, SystemEvaluator
    assert CodeEvaluator is SystemEvaluator                              # True
    
    from bigquery_agent_analytics.evaluators import CodeEvaluator as A
    from bigquery_agent_analytics.system_evaluator import SystemEvaluator as B
    assert A is B                                                         # True
    
    from bigquery_agent_analytics.trace_evaluator import BigQueryTraceEvaluator
    from bigquery_agent_analytics.performance_evaluator import PerformanceEvaluator
    assert BigQueryTraceEvaluator is PerformanceEvaluator                # True
    
    from bigquery_agent_analytics.grader_pipeline import GraderPipeline
    from bigquery_agent_analytics.aggregate_grader import AggregateGrader
    assert GraderPipeline is AggregateGrader                              # True

    The four shim files (evaluators.py, trace_evaluator.py, multi_trial.py, grader_pipeline.py) are now 17–239 lines each, re-exporting from the canonical new modules. Standard Python import semantics give a single class object, accessible from both old and new paths.

  • The split is clean: system_evaluator.py is pure to SystemEvaluator; performance_evaluator.py houses PerformanceEvaluator (with both one-sided and side-by-side judge prompts unified at lines 824-874); aggregate_grader.py houses AggregateGrader; multi_trial_performance_evaluator.py houses the multi-trial harness.

Blockers

B1 — Public API removed silently; CHANGELOG migration story is incomplete

The PR deletes four documented public symbols from evaluators.py:

  • render_ai_generate_judge_query
  • AI_GENERATE_JUDGE_BATCH_QUERY
  • LLM_JUDGE_BATCH_QUERY
  • split_judge_prompt_template

These were the SQL-batch path for LLM-as-judge — submitting AI.GENERATE via BigQuery rather than per-session API calls. The previous release, [0.2.3] (2026-04-27), was specifically a fix to that path:

Fixed

  • LLM-as-Judge AI.GENERATE path now executes against current BigQuery. Earlier versions emitted a table-valued [...]

Eight days later this PR removes the feature wholesale. The Unreleased CHANGELOG entry covers:

Purged obsolete criteria-list LLMAsJudge implementations, replacing them natively with PerformanceEvaluator [...]
Migration Story: Users should no longer construct _JudgeCriterion objects directly. Instead, use LLMAsJudge.add_criterion(...).

But this only addresses the _JudgeCriterion constructor change. It does not mention:

  1. render_ai_generate_judge_query is gone (grep -rn "render_ai_generate_judge_query" src/ → 0 hits).
  2. AI_GENERATE_JUDGE_BATCH_QUERY is gone.
  3. LLM_JUDGE_BATCH_QUERY is gone.
  4. split_judge_prompt_template is gone.
  5. Client.evaluate(LLMAsJudge_instance) now raises TypeErrorclient.py:864-912 shows the new dispatch only accepts SystemEvaluator or PerformanceEvaluator. Anyone who did client.evaluate(evaluator=LLMAsJudge.correctness()) (a documented pattern in the previous SDK.md and shipped tests) will now get a runtime error.
  6. The live integration test that guarded the BQ-side judge SQL is deleted (tests/test_ai_generate_judge_live.py, 203 lines).

Three asks:

  • Restore the public symbols as backward-compat shims in evaluators.py, even if they raise DeprecationWarning. They were public exports in the previous SDK.md.

  • Extend the CHANGELOG migration story to spell out:

    • That Client.evaluate(LLMAsJudge) no longer works and how to migrate (presumably PerformanceEvaluator + judge factory).
    • That the AI.GENERATE batch path is gone; per-session API calls are the new path.
    • That callers depending on render_ai_generate_judge_query / AI_GENERATE_JUDGE_BATCH_QUERY need to migrate.
  • Restore live integration coverage for the new per-session API path. The deleted file's docstring is explicit about why it existed:

    "These tests submit the exact SQL produced by render_ai_generate_judge_query to a real BigQuery project [...] Mocks alone won't catch that class of bug."

    The new _evaluate_performance path in client.py uses client.aio.models.generate_content (in performance_evaluator.py:850) — that's an entirely different integration contract, with no live test guarding it. The 0.2.3 fix's lesson applies in reverse: now that the SDK calls Vertex AI directly per-session, a mock-only test surface won't catch when (e.g.) the genai client signature changes or the response schema drifts. A skip-by-default live test for the new path would restore the safety net.

B2 — Stack inherits 24-commits-behind stale base

Same as #121 / #122. After the rebase, several files in PR 123's diff (notably client.py, dashboard/app.py, examples/*.ipynb) will need conflict resolution. The client.py change is particularly large (97+ / 406-) — worth re-checking after rebase that no surface intended to stay was lost during conflict resolution.

Concerns

C1 — client.py evaluate() docstring claim is inconsistent with implementation

client.py:875-876 reads:

"PerformanceEvaluator metrics use BQML's ML.GENERATE_TEXT for zero-ETL evaluation."

But the new _evaluate_performance dispatch (and PerformanceEvaluator.judge() at performance_evaluator.py:824-874) calls client.aio.models.generate_content directly through google-genai. There's no ML.GENERATE_TEXT call in the path. The docstring describes the deleted path, not the new one. Fix the docstring to match the actual implementation ("uses the Vertex AI Generative API per session") or, if BQML support is still planned, mark it as a TODO.

C2 — dashboard/app.py 142+/90- — what changed?

Not strictly in the evaluator/grader/multi-trial scope. Worth a sentence in the PR body explaining whether this is a related cleanup or just pyink reformatting (the PR body says "some additional files were reformatted as part of this change"). If it's purely formatting, the diff stat should be much smaller than 142+/90-. Suggest a glance at this file's actual changes to confirm.

C3 — Notebook diffs are massive

Three notebooks have major reformats: e2e_notebook_demo.ipynb (238/210), nba_agent_trace_analysis_notebook.ipynb (115/99), context_graph_adcp_demo.ipynb (1041/1019). These are notebook JSON, so the line counts are misleading — but if the PR body intent is "isolate system metrics in system_evaluator and performance metrics in performance_evaluator," the notebooks should change only insofar as their imports point at the new modules. A 1041-line change in context_graph_adcp_demo.ipynb is too big for that. Worth a confirmation: did the cells re-execute and serialize new outputs into the notebook diff? If yes, those changes should be a separate PR or stripped (the standard repo practice is nbstripout or equivalent for executed-cell noise).

C4 — examples/agent_improvement_cycle/agent/tools.py 35+/24- and improver_agent.py 64+/76-

Substantive logic changes in an example agent. Not in the stated PR scope ("isolate system / performance metrics"). What changed and why? If it's part of the rename refactor, OK. If it's an unrelated improvement, split it out.

What's already correct (acknowledge)

  • ✅ The shim pattern in evaluators.py, trace_evaluator.py, multi_trial.py, grader_pipeline.py is textbook. Re-exports preserve class identity. Hint to future readers via the module docstring ("Backward-compatibility module mapping for ...") is helpful.
  • system_evaluator.py and performance_evaluator.py are clean splits — no cross-module circular dependencies, no shared mutable state.
  • ✅ Lint clean across the entire src/ and tests/ (the PR description's caveat about "some additional files were reformatted as part of this change" matches the observable lint state).
  • ✅ Test suite green: 1772 passed.
  • utils.py (new file, 110 lines) hosts _parse_json_from_text — the right place for it after the evaluators split.
  • ✅ Unified one-sided / side-by-side judge prompts in PerformanceEvaluator (lines 824-874): the PR title's main claim is delivered.

Verdict

Request changes primarily on B1 (silent public API removal + missing CHANGELOG coverage + deleted live integration test with no replacement). The architecture of the refactor is sound; the unstated migration cost is what blocks merge.

B2 (rebase) and C1-C4 are normal cleanup. Once B1's symbols are either restored as deprecated shims or the CHANGELOG honestly states they're removed (with migration paths spelled out), this PR is mergeable.

@caohy1988
Copy link
Copy Markdown
Contributor

Fresh follow-up after the existing REQUEST_CHANGES review:

High — bq-agent-sdk evaluate --evaluator=llm-judge is now broken by the same LLMAsJudge / Client.evaluate mismatch.

The review already calls out that Client.evaluate(LLMAsJudge_instance) now raises TypeError, but there is a concrete CLI path still constructing exactly that unsupported object:

  • src/bigquery_agent_analytics/cli.py maps correctness, hallucination, and sentiment to LLMAsJudge.*() in _LLM_JUDGES.
  • The command then calls client.evaluate(evaluator=ev, ...) unconditionally.
  • Client.evaluate() now accepts only SystemEvaluator | PerformanceEvaluator, so a real CLI invocation with --evaluator=llm-judge --criterion=correctness fails with Unsupported evaluator type: <class '...LLMAsJudge'>.

The current CLI test misses this because it mocks client.evaluate and only asserts that the constructed object is an LLMAsJudge; it never exercises the real Client.evaluate type gate.

Fix options:

  1. Keep LLMAsJudge supported in Client.evaluate() as a backward-compatible shim, or
  2. Change the CLI to build the new supported PerformanceEvaluator path for LLM judging, and update tests to exercise the real type compatibility instead of only the mock call argument.

Until this is fixed, this PR silently regresses a documented/user-facing CLI evaluation mode, not just the Python API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants