feat(llma): add evaluation report agent by andrewm4894 · Pull Request #54364 · PostHog/posthog

andrewm4894 · 2026-04-13T22:01:26Z

Problem

The evaluation report system needs an LLM agent that can analyze evaluation data and produce structured reports with metrics, examples, and citations.

Part 2 of 5 in a stacked PR series. Depends on #54363.

Stack:

Models + API (feat(llma): add evaluation report models and API #54363)
Report Agent (this PR)
Delivery + Workflows (feat(llma): add evaluation report delivery and temporal workflows #54365)
Frontend (feat(llma): add evaluation reports frontend #54366)
Wiring + Generated (feat(llma): wire up auto-create hook, generated types, and MCP tools #54367)

Changes

Add LangGraph-based report agent with 11 tools for ClickHouse queries and structured output
System prompt with pass/fail analysis guidance
Typed schema: EvalReportContent with sections, metrics, citations
Agent state management via EvalReportAgentState
Constants and types for workflow integration
Pure logic with zero side effects — wired to Temporal in PR 3

How did you test this code?

Unit tests for agent graph construction and tool binding
Schema validation tests for report content structure
Tool unit tests for query building and output formatting
All tests pass locally

No

Docs update

skip-inkeep-docs

🤖 LLM context

Agent-assisted PR stack creation from a single reference branch.

github-actions · 2026-04-13T22:01:36Z

Hey @andrewm4894! 👋\nThis pull request seems to contain no description. Please add useful context, rationale, and/or any other information that will help make sense of this change now and in the distant Mars-based future.

andrewm4894 · 2026-04-13T22:01:39Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a7d458be2e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-13T22:06:02Z

posthog/temporal/llm_analytics/eval_reports/report_agent/graph.py

+        if content.citations:
+            refs_lines = []
+            for i, c in enumerate(content.citations, 1):
+                refs_lines.append(f"{i}. `{c.generation_id}` — {c.reason}")
+            content.sections.append(


Enforce section cap after adding references

This appends a References section after validation has already accepted up to MAX_REPORT_SECTIONS, so a valid 6-section report with citations becomes 7 sections and violates the schema/prompt contract. That can break downstream assumptions about section bounds (for rendering or delivery formatting) specifically when the agent uses the full section budget and includes citations.

Useful? React with 👍 / 👎.

Valid -- the References append can push section count past MAX_REPORT_SECTIONS when the agent uses all 6 slots. Will fix by replacing the last agent section with References when at the cap (the agent's add_section tool already blocks at MAX, so the replaced section is the least important one), or by accounting for the References slot in validation. Pushing a fix.

Fixed in 914abb7 — when at MAX_REPORT_SECTIONS, References replaces the last agent section instead of exceeding the cap.

chatgpt-codex-connector · 2026-04-13T22:06:02Z

posthog/temporal/llm_analytics/eval_reports/report_agent/tools.py

+            AND timestamp < '{ts_end}'
+            {filter_clause}
+        ORDER BY timestamp DESC
+        LIMIT {limit}


Clamp sample size before interpolating LIMIT

The limit argument is not bounded before being inserted into the HogQL LIMIT, so the model can be steered (e.g., via custom report guidance) into requesting very large result sets, causing expensive queries and oversized tool responses that can exhaust context or hit activity timeouts. This is most likely to surface when prompt injection or malformed tool planning asks for a large limit.

Useful? React with 👍 / 👎.

Agreed, clamping limit before interpolation is the right call. Will add limit = min(max(1, limit), 500) to both sample_eval_results and get_top_failure_reasons. Same fix addresses the greptile comment at line 209.

Fixed in 914abb7 — both functions now clamp limit to [1, 500].

greptile-apps · 2026-04-13T22:08:30Z

Prompt To Fix All With AI

This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/report_agent/graph.py
Line: 58

Comment:
**Unnecessary late import — already imported at module level**

`_ch_ts` and `_execute_hogql` are imported late here to avoid a perceived circular dependency, but `graph.py` already imports from `tools` at the top level (`from posthog.temporal.llm_analytics.eval_reports.report_agent.tools import EVAL_REPORT_TOOLS`). Since no circular import actually exists, these can simply join that top-level import.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/report_agent/graph.py
Line: 67-96

Comment:
**Duplicate HogQL queries — OnceAndOnlyOnce**

The two queries here are character-for-character copies of the ones in `get_summary_metrics` in `tools.py` (lines 84–114). The current and previous period selects, the `countIf` expressions, and the row-parsing arithmetic are all repeated. If the evaluation schema changes (e.g. a new applicable flag), both places must be updated in sync.

Consider extracting the raw query execution and row-parsing into a shared private helper (e.g. `_fetch_period_counts(team_id, evaluation_id, ts_start, ts_end)`) that both `_compute_metrics` and `get_summary_metrics` call.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/report_agent/test/test_graph.py
Line: 67-90

Comment:
**Non-parameterized tests for trend direction**

`test_populated_metrics_stable_trend`, `test_populated_metrics_trend_up`, and `test_populated_metrics_trend_down` are structurally identical — they differ only in `(pass_rate, previous_pass_rate, expected_phrase)`. Per project convention, these should be a single `@parameterized.expand` test.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/report_agent/tools.py
Line: 206-209

Comment:
**Unbounded LLM-controlled `limit` in query tools**

The LLM can pass any integer as `limit`. Without a cap, a misbehaving or confused agent could request `limit=1_000_000` and generate a very expensive ClickHouse scan. The same applies to `get_top_failure_reasons` (line 601). A simple guard like `limit = min(max(1, limit), 500)` keeps things predictable.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "feat(llma): add evaluation report agent" | Re-trigger Greptile}

greptile-apps · 2026-04-13T22:08:34Z

posthog/temporal/llm_analytics/eval_reports/report_agent/graph.py

+    with zero counts and logs the exception. The agent cannot fabricate numbers
+    because this function is the sole source of truth for `content.metrics`.
+    """
+    from posthog.temporal.llm_analytics.eval_reports.report_agent.tools import _ch_ts, _execute_hogql


Unnecessary late import — already imported at module level

_ch_ts and _execute_hogql are imported late here to avoid a perceived circular dependency, but graph.py already imports from tools at the top level (from posthog.temporal.llm_analytics.eval_reports.report_agent.tools import EVAL_REPORT_TOOLS). Since no circular import actually exists, these can simply join that top-level import.

Prompt To Fix With AI

This is a comment left during a code review. Path: posthog/temporal/llm_analytics/eval_reports/report_agent/graph.py Line: 58 Comment: **Unnecessary late import — already imported at module level** `_ch_ts` and `_execute_hogql` are imported late here to avoid a perceived circular dependency, but `graph.py` already imports from `tools` at the top level (`from posthog.temporal.llm_analytics.eval_reports.report_agent.tools import EVAL_REPORT_TOOLS`). Since no circular import actually exists, these can simply join that top-level import. How can I resolve this? If you propose a fix, please make it concise.

Good catch -- you're right, there's no circular import here. graph.py already imports from tools at module level. Will hoist _ch_ts and _execute_hogql into that top-level import.

Fixed in 914abb7 — hoisted to top-level imports.

greptile-apps · 2026-04-13T22:08:35Z

posthog/temporal/llm_analytics/eval_reports/report_agent/graph.py

+        current_rows = _execute_hogql(
+            team_id,
+            f"""
+            SELECT
+                countIf(properties.$ai_evaluation_result = true AND (isNull(properties.$ai_evaluation_applicable) OR properties.$ai_evaluation_applicable != false)) as pass_count,
+                countIf(properties.$ai_evaluation_result = false AND (isNull(properties.$ai_evaluation_applicable) OR properties.$ai_evaluation_applicable != false)) as fail_count,
+                countIf(properties.$ai_evaluation_applicable = false) as na_count,
+                count() as total
+            FROM events
+            WHERE event = '$ai_evaluation'
+                AND properties.$ai_evaluation_id = '{evaluation_id}'
+                AND timestamp >= '{ts_start}'
+                AND timestamp < '{ts_end}'
+            """,
+        )
+
+        previous_rows = _execute_hogql(
+            team_id,
+            f"""
+            SELECT
+                countIf(properties.$ai_evaluation_result = true AND (isNull(properties.$ai_evaluation_applicable) OR properties.$ai_evaluation_applicable != false)) as pass_count,
+                countIf(properties.$ai_evaluation_result = false AND (isNull(properties.$ai_evaluation_applicable) OR properties.$ai_evaluation_applicable != false)) as fail_count,
+                count() as total
+            FROM events
+            WHERE event = '$ai_evaluation'
+                AND properties.$ai_evaluation_id = '{evaluation_id}'
+                AND timestamp >= '{ts_prev_start}'
+                AND timestamp < '{ts_start}'
+            """,
+        )


Duplicate HogQL queries — OnceAndOnlyOnce

The two queries here are character-for-character copies of the ones in get_summary_metrics in tools.py (lines 84–114). The current and previous period selects, the countIf expressions, and the row-parsing arithmetic are all repeated. If the evaluation schema changes (e.g. a new applicable flag), both places must be updated in sync.

Consider extracting the raw query execution and row-parsing into a shared private helper (e.g. _fetch_period_counts(team_id, evaluation_id, ts_start, ts_end)) that both _compute_metrics and get_summary_metrics call.

Prompt To Fix With AI

This is a comment left during a code review. Path: posthog/temporal/llm_analytics/eval_reports/report_agent/graph.py Line: 67-96 Comment: **Duplicate HogQL queries — OnceAndOnlyOnce** The two queries here are character-for-character copies of the ones in `get_summary_metrics` in `tools.py` (lines 84–114). The current and previous period selects, the `countIf` expressions, and the row-parsing arithmetic are all repeated. If the evaluation schema changes (e.g. a new applicable flag), both places must be updated in sync. Consider extracting the raw query execution and row-parsing into a shared private helper (e.g. `_fetch_period_counts(team_id, evaluation_id, ts_start, ts_end)`) that both `_compute_metrics` and `get_summary_metrics` call. How can I resolve this? If you propose a fix, please make it concise.

Intentional duplication for now. _compute_metrics and get_summary_metrics serve different trust boundaries: _compute_metrics produces the authoritative EvalReportMetrics dataclass (attached directly to the report, agent cannot mutate), while get_summary_metrics is an agent-facing tool that returns a JSON string the LLM reads for analysis context. They also diverge slightly (the tool returns the previous na_count, _compute_metrics doesn't).

Extracting a shared helper would couple the trusted-metrics path to the tool's return shape. If the queries diverge further (e.g., adding cost breakdowns only to the tool), the shared helper becomes a leaky abstraction. Happy to revisit if they stay identical after the stack lands.

greptile-apps · 2026-04-13T22:08:36Z

posthog/temporal/llm_analytics/eval_reports/report_agent/test/test_graph.py

+    def test_populated_metrics_stable_trend(self):
+        metrics = EvalReportMetrics(
+            total_runs=100,
+            pass_count=80,
+            fail_count=20,
+            na_count=0,
+            pass_rate=80.0,
+            previous_pass_rate=80.0,
+        )
+        content = _fallback_content("Helpfulness", metrics, "validation failed")
+        self.assertEqual(len(content.sections), 1)
+        body = content.sections[0].content
+        self.assertIn("80.0%", body)
+        self.assertIn("stable", body)
+
+    def test_populated_metrics_trend_up(self):
+        metrics = EvalReportMetrics(total_runs=10, pass_count=9, fail_count=1, pass_rate=90.0, previous_pass_rate=70.0)
+        content = _fallback_content("X", metrics, "why")
+        self.assertIn("up from", content.sections[0].content)
+
+    def test_populated_metrics_trend_down(self):
+        metrics = EvalReportMetrics(total_runs=10, pass_count=5, fail_count=5, pass_rate=50.0, previous_pass_rate=80.0)
+        content = _fallback_content("X", metrics, "why")
+        self.assertIn("down from", content.sections[0].content)


Non-parameterized tests for trend direction

test_populated_metrics_stable_trend, test_populated_metrics_trend_up, and test_populated_metrics_trend_down are structurally identical — they differ only in (pass_rate, previous_pass_rate, expected_phrase). Per project convention, these should be a single @parameterized.expand test.

Prompt To Fix With AI

This is a comment left during a code review. Path: posthog/temporal/llm_analytics/eval_reports/report_agent/test/test_graph.py Line: 67-90 Comment: **Non-parameterized tests for trend direction** `test_populated_metrics_stable_trend`, `test_populated_metrics_trend_up`, and `test_populated_metrics_trend_down` are structurally identical — they differ only in `(pass_rate, previous_pass_rate, expected_phrase)`. Per project convention, these should be a single `@parameterized.expand` test. How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Good call -- this is a textbook case for @parameterized.expand. Will consolidate the three trend-direction tests into a single parameterized test.

Fixed in 914abb7 — consolidated into @parameterized.expand.

greptile-apps · 2026-04-13T22:08:37Z

posthog/temporal/llm_analytics/eval_reports/report_agent/tools.py

+def sample_eval_results(
+    state: Annotated[dict, InjectedState],
+    filter: str = "all",
+    limit: int = 50,


Unbounded LLM-controlled limit in query tools

The LLM can pass any integer as limit. Without a cap, a misbehaving or confused agent could request limit=1_000_000 and generate a very expensive ClickHouse scan. The same applies to get_top_failure_reasons (line 601). A simple guard like limit = min(max(1, limit), 500) keeps things predictable.

Prompt To Fix With AI

This is a comment left during a code review. Path: posthog/temporal/llm_analytics/eval_reports/report_agent/tools.py Line: 206-209 Comment: **Unbounded LLM-controlled `limit` in query tools** The LLM can pass any integer as `limit`. Without a cap, a misbehaving or confused agent could request `limit=1_000_000` and generate a very expensive ClickHouse scan. The same applies to `get_top_failure_reasons` (line 601). A simple guard like `limit = min(max(1, limit), 500)` keeps things predictable. How can I resolve this? If you propose a fix, please make it concise.

Agreed, same issue as the codex comment above. Will add limit = min(max(1, limit), 500) guard to both sample_eval_results and get_top_failure_reasons.

Fixed in 914abb7 — same clamp as above.

Add LangGraph-based agent for generating evaluation reports. Includes 11 tools for querying ClickHouse metrics/examples/trends and producing structured output, a system prompt with pass/fail analysis guidance, and typed schema for report content (sections, metrics, citations). Pure logic with no side effects — wired to Temporal in a follow-up PR.

- Cap References section: replace last agent section when at MAX_REPORT_SECTIONS instead of exceeding the limit - Clamp LLM-controlled limit to [1, 500] in sample_eval_results and get_top_failure_reasons to prevent expensive queries - Hoist _ch_ts and _execute_hogql to top-level imports in graph.py - Parameterize trend direction tests with @parameterized.expand

New compact overview tool returns every eval result as a condensed row (verdict | generation_id | truncated reasoning). Gives the agent a full picture of all results early in its workflow before drilling into specific examples. Added to suggested workflow as step 2.

When there are more than 500 eval results in the period, return a random sample of 500 instead of all results to avoid blowing up the LLM context window. Header indicates total count and that results are sampled.

andrewm4894 self-assigned this Apr 13, 2026

chatgpt-codex-connector bot reviewed Apr 13, 2026

View reviewed changes

greptile-apps bot reviewed Apr 13, 2026

View reviewed changes

andrewm4894 force-pushed the andy/llma-eval-reports-2-report-agent branch from a7d458b to 7f1ec1f Compare April 13, 2026 22:27

andrewm4894 force-pushed the andy/llma-eval-reports-1-models-api branch from 7b3cefa to 628c96b Compare April 13, 2026 22:34

andrewm4894 force-pushed the andy/llma-eval-reports-2-report-agent branch from 7f1ec1f to b44acb4 Compare April 13, 2026 22:34

andrewm4894 added 3 commits April 13, 2026 23:39

fix(llma): cap list_all_eval_results at 500 random sample

4ad9deb

When there are more than 500 eval results in the period, return a random sample of 500 instead of all results to avoid blowing up the LLM context window. Header indicates total count and that results are sampled.

Conversation

andrewm4894 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

How did you test this code?

Docs update

🤖 LLM context

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

andrewm4894 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Apr 13, 2026

Uh oh!

greptile-apps bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andrewm4894 commented Apr 13, 2026 •

edited

Loading

andrewm4894 commented Apr 13, 2026 •

edited

Loading