feat(llma): add evaluation report agent#54364
feat(llma): add evaluation report agent#54364andrewm4894 wants to merge 4 commits intoandy/llma-eval-reports-1-models-apifrom
Conversation
|
Hey @andrewm4894! 👋\nThis pull request seems to contain no description. Please add useful context, rationale, and/or any other information that will help make sense of this change now and in the distant Mars-based future. |
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a7d458be2e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if content.citations: | ||
| refs_lines = [] | ||
| for i, c in enumerate(content.citations, 1): | ||
| refs_lines.append(f"{i}. `{c.generation_id}` — {c.reason}") | ||
| content.sections.append( |
There was a problem hiding this comment.
Enforce section cap after adding references
This appends a References section after validation has already accepted up to MAX_REPORT_SECTIONS, so a valid 6-section report with citations becomes 7 sections and violates the schema/prompt contract. That can break downstream assumptions about section bounds (for rendering or delivery formatting) specifically when the agent uses the full section budget and includes citations.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Valid -- the References append can push section count past MAX_REPORT_SECTIONS when the agent uses all 6 slots. Will fix by replacing the last agent section with References when at the cap (the agent's add_section tool already blocks at MAX, so the replaced section is the least important one), or by accounting for the References slot in validation. Pushing a fix.
There was a problem hiding this comment.
Fixed in 914abb7 — when at MAX_REPORT_SECTIONS, References replaces the last agent section instead of exceeding the cap.
| AND timestamp < '{ts_end}' | ||
| {filter_clause} | ||
| ORDER BY timestamp DESC | ||
| LIMIT {limit} |
There was a problem hiding this comment.
Clamp sample size before interpolating LIMIT
The limit argument is not bounded before being inserted into the HogQL LIMIT, so the model can be steered (e.g., via custom report guidance) into requesting very large result sets, causing expensive queries and oversized tool responses that can exhaust context or hit activity timeouts. This is most likely to surface when prompt injection or malformed tool planning asks for a large limit.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Agreed, clamping limit before interpolation is the right call. Will add limit = min(max(1, limit), 500) to both sample_eval_results and get_top_failure_reasons. Same fix addresses the greptile comment at line 209.
There was a problem hiding this comment.
Fixed in 914abb7 — both functions now clamp limit to [1, 500].
Prompt To Fix All With AIThis is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/report_agent/graph.py
Line: 58
Comment:
**Unnecessary late import — already imported at module level**
`_ch_ts` and `_execute_hogql` are imported late here to avoid a perceived circular dependency, but `graph.py` already imports from `tools` at the top level (`from posthog.temporal.llm_analytics.eval_reports.report_agent.tools import EVAL_REPORT_TOOLS`). Since no circular import actually exists, these can simply join that top-level import.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/report_agent/graph.py
Line: 67-96
Comment:
**Duplicate HogQL queries — OnceAndOnlyOnce**
The two queries here are character-for-character copies of the ones in `get_summary_metrics` in `tools.py` (lines 84–114). The current and previous period selects, the `countIf` expressions, and the row-parsing arithmetic are all repeated. If the evaluation schema changes (e.g. a new applicable flag), both places must be updated in sync.
Consider extracting the raw query execution and row-parsing into a shared private helper (e.g. `_fetch_period_counts(team_id, evaluation_id, ts_start, ts_end)`) that both `_compute_metrics` and `get_summary_metrics` call.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/report_agent/test/test_graph.py
Line: 67-90
Comment:
**Non-parameterized tests for trend direction**
`test_populated_metrics_stable_trend`, `test_populated_metrics_trend_up`, and `test_populated_metrics_trend_down` are structurally identical — they differ only in `(pass_rate, previous_pass_rate, expected_phrase)`. Per project convention, these should be a single `@parameterized.expand` test.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/report_agent/tools.py
Line: 206-209
Comment:
**Unbounded LLM-controlled `limit` in query tools**
The LLM can pass any integer as `limit`. Without a cap, a misbehaving or confused agent could request `limit=1_000_000` and generate a very expensive ClickHouse scan. The same applies to `get_top_failure_reasons` (line 601). A simple guard like `limit = min(max(1, limit), 500)` keeps things predictable.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "feat(llma): add evaluation report agent" | Re-trigger Greptile |
| with zero counts and logs the exception. The agent cannot fabricate numbers | ||
| because this function is the sole source of truth for `content.metrics`. | ||
| """ | ||
| from posthog.temporal.llm_analytics.eval_reports.report_agent.tools import _ch_ts, _execute_hogql |
There was a problem hiding this comment.
Unnecessary late import — already imported at module level
_ch_ts and _execute_hogql are imported late here to avoid a perceived circular dependency, but graph.py already imports from tools at the top level (from posthog.temporal.llm_analytics.eval_reports.report_agent.tools import EVAL_REPORT_TOOLS). Since no circular import actually exists, these can simply join that top-level import.
Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/report_agent/graph.py
Line: 58
Comment:
**Unnecessary late import — already imported at module level**
`_ch_ts` and `_execute_hogql` are imported late here to avoid a perceived circular dependency, but `graph.py` already imports from `tools` at the top level (`from posthog.temporal.llm_analytics.eval_reports.report_agent.tools import EVAL_REPORT_TOOLS`). Since no circular import actually exists, these can simply join that top-level import.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Good catch -- you're right, there's no circular import here. graph.py already imports from tools at module level. Will hoist _ch_ts and _execute_hogql into that top-level import.
There was a problem hiding this comment.
Fixed in 914abb7 — hoisted to top-level imports.
| current_rows = _execute_hogql( | ||
| team_id, | ||
| f""" | ||
| SELECT | ||
| countIf(properties.$ai_evaluation_result = true AND (isNull(properties.$ai_evaluation_applicable) OR properties.$ai_evaluation_applicable != false)) as pass_count, | ||
| countIf(properties.$ai_evaluation_result = false AND (isNull(properties.$ai_evaluation_applicable) OR properties.$ai_evaluation_applicable != false)) as fail_count, | ||
| countIf(properties.$ai_evaluation_applicable = false) as na_count, | ||
| count() as total | ||
| FROM events | ||
| WHERE event = '$ai_evaluation' | ||
| AND properties.$ai_evaluation_id = '{evaluation_id}' | ||
| AND timestamp >= '{ts_start}' | ||
| AND timestamp < '{ts_end}' | ||
| """, | ||
| ) | ||
|
|
||
| previous_rows = _execute_hogql( | ||
| team_id, | ||
| f""" | ||
| SELECT | ||
| countIf(properties.$ai_evaluation_result = true AND (isNull(properties.$ai_evaluation_applicable) OR properties.$ai_evaluation_applicable != false)) as pass_count, | ||
| countIf(properties.$ai_evaluation_result = false AND (isNull(properties.$ai_evaluation_applicable) OR properties.$ai_evaluation_applicable != false)) as fail_count, | ||
| count() as total | ||
| FROM events | ||
| WHERE event = '$ai_evaluation' | ||
| AND properties.$ai_evaluation_id = '{evaluation_id}' | ||
| AND timestamp >= '{ts_prev_start}' | ||
| AND timestamp < '{ts_start}' | ||
| """, | ||
| ) |
There was a problem hiding this comment.
Duplicate HogQL queries — OnceAndOnlyOnce
The two queries here are character-for-character copies of the ones in get_summary_metrics in tools.py (lines 84–114). The current and previous period selects, the countIf expressions, and the row-parsing arithmetic are all repeated. If the evaluation schema changes (e.g. a new applicable flag), both places must be updated in sync.
Consider extracting the raw query execution and row-parsing into a shared private helper (e.g. _fetch_period_counts(team_id, evaluation_id, ts_start, ts_end)) that both _compute_metrics and get_summary_metrics call.
Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/report_agent/graph.py
Line: 67-96
Comment:
**Duplicate HogQL queries — OnceAndOnlyOnce**
The two queries here are character-for-character copies of the ones in `get_summary_metrics` in `tools.py` (lines 84–114). The current and previous period selects, the `countIf` expressions, and the row-parsing arithmetic are all repeated. If the evaluation schema changes (e.g. a new applicable flag), both places must be updated in sync.
Consider extracting the raw query execution and row-parsing into a shared private helper (e.g. `_fetch_period_counts(team_id, evaluation_id, ts_start, ts_end)`) that both `_compute_metrics` and `get_summary_metrics` call.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Intentional duplication for now. _compute_metrics and get_summary_metrics serve different trust boundaries: _compute_metrics produces the authoritative EvalReportMetrics dataclass (attached directly to the report, agent cannot mutate), while get_summary_metrics is an agent-facing tool that returns a JSON string the LLM reads for analysis context. They also diverge slightly (the tool returns the previous na_count, _compute_metrics doesn't).
Extracting a shared helper would couple the trusted-metrics path to the tool's return shape. If the queries diverge further (e.g., adding cost breakdowns only to the tool), the shared helper becomes a leaky abstraction. Happy to revisit if they stay identical after the stack lands.
| def test_populated_metrics_stable_trend(self): | ||
| metrics = EvalReportMetrics( | ||
| total_runs=100, | ||
| pass_count=80, | ||
| fail_count=20, | ||
| na_count=0, | ||
| pass_rate=80.0, | ||
| previous_pass_rate=80.0, | ||
| ) | ||
| content = _fallback_content("Helpfulness", metrics, "validation failed") | ||
| self.assertEqual(len(content.sections), 1) | ||
| body = content.sections[0].content | ||
| self.assertIn("80.0%", body) | ||
| self.assertIn("stable", body) | ||
|
|
||
| def test_populated_metrics_trend_up(self): | ||
| metrics = EvalReportMetrics(total_runs=10, pass_count=9, fail_count=1, pass_rate=90.0, previous_pass_rate=70.0) | ||
| content = _fallback_content("X", metrics, "why") | ||
| self.assertIn("up from", content.sections[0].content) | ||
|
|
||
| def test_populated_metrics_trend_down(self): | ||
| metrics = EvalReportMetrics(total_runs=10, pass_count=5, fail_count=5, pass_rate=50.0, previous_pass_rate=80.0) | ||
| content = _fallback_content("X", metrics, "why") | ||
| self.assertIn("down from", content.sections[0].content) |
There was a problem hiding this comment.
Non-parameterized tests for trend direction
test_populated_metrics_stable_trend, test_populated_metrics_trend_up, and test_populated_metrics_trend_down are structurally identical — they differ only in (pass_rate, previous_pass_rate, expected_phrase). Per project convention, these should be a single @parameterized.expand test.
Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/report_agent/test/test_graph.py
Line: 67-90
Comment:
**Non-parameterized tests for trend direction**
`test_populated_metrics_stable_trend`, `test_populated_metrics_trend_up`, and `test_populated_metrics_trend_down` are structurally identical — they differ only in `(pass_rate, previous_pass_rate, expected_phrase)`. Per project convention, these should be a single `@parameterized.expand` test.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Good call -- this is a textbook case for @parameterized.expand. Will consolidate the three trend-direction tests into a single parameterized test.
There was a problem hiding this comment.
Fixed in 914abb7 — consolidated into @parameterized.expand.
| def sample_eval_results( | ||
| state: Annotated[dict, InjectedState], | ||
| filter: str = "all", | ||
| limit: int = 50, |
There was a problem hiding this comment.
Unbounded LLM-controlled
limit in query tools
The LLM can pass any integer as limit. Without a cap, a misbehaving or confused agent could request limit=1_000_000 and generate a very expensive ClickHouse scan. The same applies to get_top_failure_reasons (line 601). A simple guard like limit = min(max(1, limit), 500) keeps things predictable.
Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/report_agent/tools.py
Line: 206-209
Comment:
**Unbounded LLM-controlled `limit` in query tools**
The LLM can pass any integer as `limit`. Without a cap, a misbehaving or confused agent could request `limit=1_000_000` and generate a very expensive ClickHouse scan. The same applies to `get_top_failure_reasons` (line 601). A simple guard like `limit = min(max(1, limit), 500)` keeps things predictable.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Agreed, same issue as the codex comment above. Will add limit = min(max(1, limit), 500) guard to both sample_eval_results and get_top_failure_reasons.
a7d458b to
7f1ec1f
Compare
Add LangGraph-based agent for generating evaluation reports. Includes 11 tools for querying ClickHouse metrics/examples/trends and producing structured output, a system prompt with pass/fail analysis guidance, and typed schema for report content (sections, metrics, citations). Pure logic with no side effects — wired to Temporal in a follow-up PR.
7b3cefa to
628c96b
Compare
7f1ec1f to
b44acb4
Compare
- Cap References section: replace last agent section when at MAX_REPORT_SECTIONS instead of exceeding the limit - Clamp LLM-controlled limit to [1, 500] in sample_eval_results and get_top_failure_reasons to prevent expensive queries - Hoist _ch_ts and _execute_hogql to top-level imports in graph.py - Parameterize trend direction tests with @parameterized.expand
New compact overview tool returns every eval result as a condensed row (verdict | generation_id | truncated reasoning). Gives the agent a full picture of all results early in its workflow before drilling into specific examples. Added to suggested workflow as step 2.
When there are more than 500 eval results in the period, return a random sample of 500 instead of all results to avoid blowing up the LLM context window. Header indicates total count and that results are sampled.

Problem
The evaluation report system needs an LLM agent that can analyze evaluation data and produce structured reports with metrics, examples, and citations.
Part 2 of 5 in a stacked PR series. Depends on #54363.
Stack:
Changes
EvalReportContentwith sections, metrics, citationsEvalReportAgentStateHow did you test this code?
No
Docs update
skip-inkeep-docs
🤖 LLM context
Agent-assisted PR stack creation from a single reference branch.