Skip to content

feat(llma): add evaluation report agent#54364

Open
andrewm4894 wants to merge 4 commits intoandy/llma-eval-reports-1-models-apifrom
andy/llma-eval-reports-2-report-agent
Open

feat(llma): add evaluation report agent#54364
andrewm4894 wants to merge 4 commits intoandy/llma-eval-reports-1-models-apifrom
andy/llma-eval-reports-2-report-agent

Conversation

@andrewm4894
Copy link
Copy Markdown
Member

@andrewm4894 andrewm4894 commented Apr 13, 2026

Problem

The evaluation report system needs an LLM agent that can analyze evaluation data and produce structured reports with metrics, examples, and citations.

Part 2 of 5 in a stacked PR series. Depends on #54363.

Stack:

  1. Models + API (feat(llma): add evaluation report models and API #54363)
  2. Report Agent (this PR)
  3. Delivery + Workflows (feat(llma): add evaluation report delivery and temporal workflows #54365)
  4. Frontend (feat(llma): add evaluation reports frontend #54366)
  5. Wiring + Generated (feat(llma): wire up auto-create hook, generated types, and MCP tools #54367)

Changes

  • Add LangGraph-based report agent with 11 tools for ClickHouse queries and structured output
  • System prompt with pass/fail analysis guidance
  • Typed schema: EvalReportContent with sections, metrics, citations
  • Agent state management via EvalReportAgentState
  • Constants and types for workflow integration
  • Pure logic with zero side effects — wired to Temporal in PR 3

How did you test this code?

  • Unit tests for agent graph construction and tool binding
  • Schema validation tests for report content structure
  • Tool unit tests for query building and output formatting
  • All tests pass locally

No

Docs update

skip-inkeep-docs

🤖 LLM context

Agent-assisted PR stack creation from a single reference branch.

@github-actions
Copy link
Copy Markdown
Contributor

Hey @andrewm4894! 👋\nThis pull request seems to contain no description. Please add useful context, rationale, and/or any other information that will help make sense of this change now and in the distant Mars-based future.

Copy link
Copy Markdown
Member Author

andrewm4894 commented Apr 13, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a7d458be2e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +297 to +301
if content.citations:
refs_lines = []
for i, c in enumerate(content.citations, 1):
refs_lines.append(f"{i}. `{c.generation_id}` — {c.reason}")
content.sections.append(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce section cap after adding references

This appends a References section after validation has already accepted up to MAX_REPORT_SECTIONS, so a valid 6-section report with citations becomes 7 sections and violates the schema/prompt contract. That can break downstream assumptions about section bounds (for rendering or delivery formatting) specifically when the agent uses the full section budget and includes citations.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid -- the References append can push section count past MAX_REPORT_SECTIONS when the agent uses all 6 slots. Will fix by replacing the last agent section with References when at the cap (the agent's add_section tool already blocks at MAX, so the replaced section is the least important one), or by accounting for the References slot in validation. Pushing a fix.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 914abb7 — when at MAX_REPORT_SECTIONS, References replaces the last agent section instead of exceeding the cap.

AND timestamp < '{ts_end}'
{filter_clause}
ORDER BY timestamp DESC
LIMIT {limit}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clamp sample size before interpolating LIMIT

The limit argument is not bounded before being inserted into the HogQL LIMIT, so the model can be steered (e.g., via custom report guidance) into requesting very large result sets, causing expensive queries and oversized tool responses that can exhaust context or hit activity timeouts. This is most likely to surface when prompt injection or malformed tool planning asks for a large limit.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, clamping limit before interpolation is the right call. Will add limit = min(max(1, limit), 500) to both sample_eval_results and get_top_failure_reasons. Same fix addresses the greptile comment at line 209.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 914abb7 — both functions now clamp limit to [1, 500].

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 13, 2026

Prompt To Fix All With AI
This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/report_agent/graph.py
Line: 58

Comment:
**Unnecessary late import — already imported at module level**

`_ch_ts` and `_execute_hogql` are imported late here to avoid a perceived circular dependency, but `graph.py` already imports from `tools` at the top level (`from posthog.temporal.llm_analytics.eval_reports.report_agent.tools import EVAL_REPORT_TOOLS`). Since no circular import actually exists, these can simply join that top-level import.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/report_agent/graph.py
Line: 67-96

Comment:
**Duplicate HogQL queries — OnceAndOnlyOnce**

The two queries here are character-for-character copies of the ones in `get_summary_metrics` in `tools.py` (lines 84–114). The current and previous period selects, the `countIf` expressions, and the row-parsing arithmetic are all repeated. If the evaluation schema changes (e.g. a new applicable flag), both places must be updated in sync.

Consider extracting the raw query execution and row-parsing into a shared private helper (e.g. `_fetch_period_counts(team_id, evaluation_id, ts_start, ts_end)`) that both `_compute_metrics` and `get_summary_metrics` call.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/report_agent/test/test_graph.py
Line: 67-90

Comment:
**Non-parameterized tests for trend direction**

`test_populated_metrics_stable_trend`, `test_populated_metrics_trend_up`, and `test_populated_metrics_trend_down` are structurally identical — they differ only in `(pass_rate, previous_pass_rate, expected_phrase)`. Per project convention, these should be a single `@parameterized.expand` test.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/report_agent/tools.py
Line: 206-209

Comment:
**Unbounded LLM-controlled `limit` in query tools**

The LLM can pass any integer as `limit`. Without a cap, a misbehaving or confused agent could request `limit=1_000_000` and generate a very expensive ClickHouse scan. The same applies to `get_top_failure_reasons` (line 601). A simple guard like `limit = min(max(1, limit), 500)` keeps things predictable.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "feat(llma): add evaluation report agent" | Re-trigger Greptile

with zero counts and logs the exception. The agent cannot fabricate numbers
because this function is the sole source of truth for `content.metrics`.
"""
from posthog.temporal.llm_analytics.eval_reports.report_agent.tools import _ch_ts, _execute_hogql
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unnecessary late import — already imported at module level

_ch_ts and _execute_hogql are imported late here to avoid a perceived circular dependency, but graph.py already imports from tools at the top level (from posthog.temporal.llm_analytics.eval_reports.report_agent.tools import EVAL_REPORT_TOOLS). Since no circular import actually exists, these can simply join that top-level import.

Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/report_agent/graph.py
Line: 58

Comment:
**Unnecessary late import — already imported at module level**

`_ch_ts` and `_execute_hogql` are imported late here to avoid a perceived circular dependency, but `graph.py` already imports from `tools` at the top level (`from posthog.temporal.llm_analytics.eval_reports.report_agent.tools import EVAL_REPORT_TOOLS`). Since no circular import actually exists, these can simply join that top-level import.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch -- you're right, there's no circular import here. graph.py already imports from tools at module level. Will hoist _ch_ts and _execute_hogql into that top-level import.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 914abb7 — hoisted to top-level imports.

Comment on lines +67 to +96
current_rows = _execute_hogql(
team_id,
f"""
SELECT
countIf(properties.$ai_evaluation_result = true AND (isNull(properties.$ai_evaluation_applicable) OR properties.$ai_evaluation_applicable != false)) as pass_count,
countIf(properties.$ai_evaluation_result = false AND (isNull(properties.$ai_evaluation_applicable) OR properties.$ai_evaluation_applicable != false)) as fail_count,
countIf(properties.$ai_evaluation_applicable = false) as na_count,
count() as total
FROM events
WHERE event = '$ai_evaluation'
AND properties.$ai_evaluation_id = '{evaluation_id}'
AND timestamp >= '{ts_start}'
AND timestamp < '{ts_end}'
""",
)

previous_rows = _execute_hogql(
team_id,
f"""
SELECT
countIf(properties.$ai_evaluation_result = true AND (isNull(properties.$ai_evaluation_applicable) OR properties.$ai_evaluation_applicable != false)) as pass_count,
countIf(properties.$ai_evaluation_result = false AND (isNull(properties.$ai_evaluation_applicable) OR properties.$ai_evaluation_applicable != false)) as fail_count,
count() as total
FROM events
WHERE event = '$ai_evaluation'
AND properties.$ai_evaluation_id = '{evaluation_id}'
AND timestamp >= '{ts_prev_start}'
AND timestamp < '{ts_start}'
""",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Duplicate HogQL queries — OnceAndOnlyOnce

The two queries here are character-for-character copies of the ones in get_summary_metrics in tools.py (lines 84–114). The current and previous period selects, the countIf expressions, and the row-parsing arithmetic are all repeated. If the evaluation schema changes (e.g. a new applicable flag), both places must be updated in sync.

Consider extracting the raw query execution and row-parsing into a shared private helper (e.g. _fetch_period_counts(team_id, evaluation_id, ts_start, ts_end)) that both _compute_metrics and get_summary_metrics call.

Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/report_agent/graph.py
Line: 67-96

Comment:
**Duplicate HogQL queries — OnceAndOnlyOnce**

The two queries here are character-for-character copies of the ones in `get_summary_metrics` in `tools.py` (lines 84–114). The current and previous period selects, the `countIf` expressions, and the row-parsing arithmetic are all repeated. If the evaluation schema changes (e.g. a new applicable flag), both places must be updated in sync.

Consider extracting the raw query execution and row-parsing into a shared private helper (e.g. `_fetch_period_counts(team_id, evaluation_id, ts_start, ts_end)`) that both `_compute_metrics` and `get_summary_metrics` call.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional duplication for now. _compute_metrics and get_summary_metrics serve different trust boundaries: _compute_metrics produces the authoritative EvalReportMetrics dataclass (attached directly to the report, agent cannot mutate), while get_summary_metrics is an agent-facing tool that returns a JSON string the LLM reads for analysis context. They also diverge slightly (the tool returns the previous na_count, _compute_metrics doesn't).

Extracting a shared helper would couple the trusted-metrics path to the tool's return shape. If the queries diverge further (e.g., adding cost breakdowns only to the tool), the shared helper becomes a leaky abstraction. Happy to revisit if they stay identical after the stack lands.

Comment on lines +67 to +90
def test_populated_metrics_stable_trend(self):
metrics = EvalReportMetrics(
total_runs=100,
pass_count=80,
fail_count=20,
na_count=0,
pass_rate=80.0,
previous_pass_rate=80.0,
)
content = _fallback_content("Helpfulness", metrics, "validation failed")
self.assertEqual(len(content.sections), 1)
body = content.sections[0].content
self.assertIn("80.0%", body)
self.assertIn("stable", body)

def test_populated_metrics_trend_up(self):
metrics = EvalReportMetrics(total_runs=10, pass_count=9, fail_count=1, pass_rate=90.0, previous_pass_rate=70.0)
content = _fallback_content("X", metrics, "why")
self.assertIn("up from", content.sections[0].content)

def test_populated_metrics_trend_down(self):
metrics = EvalReportMetrics(total_runs=10, pass_count=5, fail_count=5, pass_rate=50.0, previous_pass_rate=80.0)
content = _fallback_content("X", metrics, "why")
self.assertIn("down from", content.sections[0].content)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Non-parameterized tests for trend direction

test_populated_metrics_stable_trend, test_populated_metrics_trend_up, and test_populated_metrics_trend_down are structurally identical — they differ only in (pass_rate, previous_pass_rate, expected_phrase). Per project convention, these should be a single @parameterized.expand test.

Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/report_agent/test/test_graph.py
Line: 67-90

Comment:
**Non-parameterized tests for trend direction**

`test_populated_metrics_stable_trend`, `test_populated_metrics_trend_up`, and `test_populated_metrics_trend_down` are structurally identical — they differ only in `(pass_rate, previous_pass_rate, expected_phrase)`. Per project convention, these should be a single `@parameterized.expand` test.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call -- this is a textbook case for @parameterized.expand. Will consolidate the three trend-direction tests into a single parameterized test.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 914abb7 — consolidated into @parameterized.expand.

Comment on lines +206 to +209
def sample_eval_results(
state: Annotated[dict, InjectedState],
filter: str = "all",
limit: int = 50,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unbounded LLM-controlled limit in query tools

The LLM can pass any integer as limit. Without a cap, a misbehaving or confused agent could request limit=1_000_000 and generate a very expensive ClickHouse scan. The same applies to get_top_failure_reasons (line 601). A simple guard like limit = min(max(1, limit), 500) keeps things predictable.

Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/report_agent/tools.py
Line: 206-209

Comment:
**Unbounded LLM-controlled `limit` in query tools**

The LLM can pass any integer as `limit`. Without a cap, a misbehaving or confused agent could request `limit=1_000_000` and generate a very expensive ClickHouse scan. The same applies to `get_top_failure_reasons` (line 601). A simple guard like `limit = min(max(1, limit), 500)` keeps things predictable.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, same issue as the codex comment above. Will add limit = min(max(1, limit), 500) guard to both sample_eval_results and get_top_failure_reasons.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 914abb7 — same clamp as above.

@andrewm4894 andrewm4894 force-pushed the andy/llma-eval-reports-2-report-agent branch from a7d458b to 7f1ec1f Compare April 13, 2026 22:27
Add LangGraph-based agent for generating evaluation reports. Includes
11 tools for querying ClickHouse metrics/examples/trends and producing
structured output, a system prompt with pass/fail analysis guidance,
and typed schema for report content (sections, metrics, citations).
Pure logic with no side effects — wired to Temporal in a follow-up PR.
@andrewm4894 andrewm4894 force-pushed the andy/llma-eval-reports-1-models-api branch from 7b3cefa to 628c96b Compare April 13, 2026 22:34
@andrewm4894 andrewm4894 force-pushed the andy/llma-eval-reports-2-report-agent branch from 7f1ec1f to b44acb4 Compare April 13, 2026 22:34
- Cap References section: replace last agent section when at
  MAX_REPORT_SECTIONS instead of exceeding the limit
- Clamp LLM-controlled limit to [1, 500] in sample_eval_results
  and get_top_failure_reasons to prevent expensive queries
- Hoist _ch_ts and _execute_hogql to top-level imports in graph.py
- Parameterize trend direction tests with @parameterized.expand
New compact overview tool returns every eval result as a condensed
row (verdict | generation_id | truncated reasoning). Gives the agent
a full picture of all results early in its workflow before drilling
into specific examples. Added to suggested workflow as step 2.
When there are more than 500 eval results in the period, return a
random sample of 500 instead of all results to avoid blowing up the
LLM context window. Header indicates total count and that results
are sampled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant