feat(llma): add evaluation report delivery and temporal workflows by andrewm4894 · Pull Request #54365 · PostHog/posthog

andrewm4894 · 2026-04-13T22:01:28Z

Problem

The report agent (PR 2) produces structured content but needs an execution pipeline to schedule generation, deliver results via email/Slack, and manage lifecycle.

Part 3 of 5 in a stacked PR series. Depends on #54364.

Stack:

Models + API (feat(llma): add evaluation report models and API #54363)
Report Agent (feat(llma): add evaluation report agent #54364)
Delivery + Workflows (this PR)
Frontend (feat(llma): add evaluation reports frontend #54366)
Wiring + Generated (feat(llma): wire up auto-create hook, generated types, and MCP tools #54367)

Changes

Email and Slack delivery renderers that convert EvalReportContent to formatted output
HTML email template for evaluation reports
7 Temporal activities (fetch report, run agent, save results, deliver, check schedules, check count triggers, update next delivery)
3 Temporal workflows:
- GenerateEvaluationReportWorkflow — runs agent and delivers
- CheckScheduledReportsWorkflow — hourly schedule check
- CheckCountTriggeredReportsWorkflow — 5-minute poll for count-based triggers
Schedule registration in the global Temporal schedule list
Cooldown (min 60 min) and daily cap (max 10) for count-based triggers

After merge, the pipeline is live but idle — no enabled report rows exist yet.

How did you test this code?

Activity unit tests for fetch, save, and delivery operations
Delivery tests for email and Slack rendering from report content
All tests pass locally

No

Docs update

skip-inkeep-docs

🤖 LLM context

Agent-assisted PR stack creation from a single reference branch.

github-actions · 2026-04-13T22:01:39Z

Hey @andrewm4894! 👋\nThis pull request seems to contain no description. Please add useful context, rationale, and/or any other information that will help make sense of this change now and in the distant Mars-based future.

andrewm4894 · 2026-04-13T22:01:40Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

graphite-app · 2026-04-13T22:04:25Z

posthog/temporal/llm_analytics/eval_reports/test/test_delivery.py

+    _inline_email_styles,
+    _linkify_uuids,
+    _render_metrics_block_html,
+    _render_metrics_block_mrkdwn,


Import of undefined function: _render_metrics_block_mrkdwn is imported but does not exist in delivery.py. The actual function is named _render_metrics_slack_blocks. This will cause an ImportError when the test module is loaded.

# Line 11 should be removed or changed to: from posthog.temporal.llm_analytics.eval_reports.delivery import ( # ... other imports ... _render_metrics_slack_blocks, # not _render_metrics_block_mrkdwn )

Suggested change

_render_metrics_block_mrkdwn,

_render_metrics_slack_blocks,

Spotted by Graphite

Is this helpful? React 👍 or 👎 to let us know.

Valid catch. The function was renamed from _render_metrics_block_mrkdwn to _render_metrics_slack_blocks (returns Block Kit blocks instead of a string). The test imports and assertions both need updating — will fix.

greptile-apps · 2026-04-13T22:06:07Z

Security Review

HogQL injection surface (activities.py lines 77–91 and 134–155): evaluation_id and formatted datetime strings are interpolated directly into parse_select() f-strings. The project's security guide explicitly flags this pattern as vulnerable and requires ast.Constant placeholders. The values currently come from the database (UUIDs and datetimes), but the string-interpolation pattern creates an injection surface that would be immediately exploitable if any of those values were ever sourced from user input.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/test/test_delivery.py
Line: 11

Comment:
**`_render_metrics_block_mrkdwn` does not exist — file will fail to import**

This import references a name that doesn't exist in `delivery.py`. The actual function defined there is `_render_metrics_slack_blocks`, which returns `list[dict]` (Block Kit blocks with an ASCII bar chart), not a string. The tests in `TestMetricsBlockMrkdwn` (lines 234–248) call this function and assert on `*Pass rate:*` and `*N/A:*`, which are not present in the Block Kit output. The entire test module will crash with `ImportError` before any test runs.

Either a separate `_render_metrics_block_mrkdwn` function (returning a plain mrkdwn string) needs to be added to `delivery.py`, or the tests should be updated to use `_render_metrics_slack_blocks` with assertions appropriate for its list-of-dicts shape.

```suggestion
    _render_metrics_block_html,
    _render_metrics_slack_blocks,
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/delivery.py
Line: 21

Comment:
**UUID pattern matches bare UUIDs — contradicts test and docstring intent**

The quantifier `*` (zero or more) on both delimiters makes the pattern match any UUID whether or not it's surrounded by backticks or angle brackets. The docstring for `_linkify_uuids` says "Replace backtick-wrapped UUIDs", and `test_no_match_without_backticks` explicitly asserts that a plain UUID without delimiters returns zero matches — that test will fail with this pattern. Changing `*` to `+` limits matching to UUIDs that actually have at least one surrounding delimiter.

```suggestion
UUID_LINK_PATTERN = re.compile(r"[`<]+([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})[`>]+")
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/delivery.py
Line: 390-399

Comment:
**`all_failed` overcounts errors when an email target has multiple addresses**

`deliver_email_report` iterates over comma-separated addresses inside a single target object, appending one error per failed address. `deliver_report` then compares `len(all_errors) >= len(email_targets)`, where `email_targets` counts target objects (not individual addresses). If a single target holds `a@b.com, c@d.com` and only the first send fails, `all_errors` has 1 entry and `len(email_targets)` is also 1, so `all_failed` is `True` — but one address was delivered successfully. This causes an incorrect `FAILED` status, a `RuntimeError`, and a Temporal retry that will duplicate delivery to the successful address.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/activities.py
Line: 77-91

Comment:
**HogQL f-string interpolation — use `ast.Constant` placeholders instead**

Both this query and the one in `_find_nth_eval_timestamp` (lines 134–155) embed `evaluation_id` and datetime strings directly into `parse_select()` f-strings. The security guide (`.agents/security.md`) flags this pattern: "User data embedded in f-string — can escape context!" and requires `ast.Constant` placeholders. Even though `evaluation_id` is a UUID from the database, `semgrep`'s `hogql-fstring-audit` rule will flag these, and if the value were ever sourced from user input upstream the injection surface would already exist.

Use placeholders:
```python
from posthog.hogql import ast as hogql_ast

query = parse_select(
    """
    SELECT count() as total
    FROM events
    WHERE event = '$ai_evaluation'
        AND properties.$ai_evaluation_id = {eval_id}
        AND timestamp >= {since}
    """,
    placeholders={
        "eval_id": hogql_ast.Constant(value=str(report.evaluation_id)),
        "since": hogql_ast.Constant(value=since_str),
    },
)
```
Apply the same fix to `_find_nth_eval_timestamp`.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "feat(llma): add evaluation report delive..." | Re-trigger Greptile}

greptile-apps · 2026-04-13T22:06:11Z

posthog/temporal/llm_analytics/eval_reports/delivery.py

+
+# Match any UUID in content — strips surrounding punctuation (backticks, angle brackets, etc.)
+# so we don't depend on how the LLM formats generation references.
+UUID_LINK_PATTERN = re.compile(r"[`<]*([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})[`>]*")


UUID pattern matches bare UUIDs — contradicts test and docstring intent

The quantifier * (zero or more) on both delimiters makes the pattern match any UUID whether or not it's surrounded by backticks or angle brackets. The docstring for _linkify_uuids says "Replace backtick-wrapped UUIDs", and test_no_match_without_backticks explicitly asserts that a plain UUID without delimiters returns zero matches — that test will fail with this pattern. Changing * to + limits matching to UUIDs that actually have at least one surrounding delimiter.

Suggested change

UUID_LINK_PATTERN = re.compile(r"[`<]*([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})[`>]*")

UUID_LINK_PATTERN = re.compile(r"[`<]+([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})[`>]+")

Prompt To Fix With AI

This is a comment left during a code review. Path: posthog/temporal/llm_analytics/eval_reports/delivery.py Line: 21 Comment: **UUID pattern matches bare UUIDs — contradicts test and docstring intent** The quantifier `*` (zero or more) on both delimiters makes the pattern match any UUID whether or not it's surrounded by backticks or angle brackets. The docstring for `_linkify_uuids` says "Replace backtick-wrapped UUIDs", and `test_no_match_without_backticks` explicitly asserts that a plain UUID without delimiters returns zero matches — that test will fail with this pattern. Changing `*` to `+` limits matching to UUIDs that actually have at least one surrounding delimiter. ```suggestion UUID_LINK_PATTERN = re.compile(r"[`<]+([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})[`>]+") ``` How can I resolve this? If you propose a fix, please make it concise.

Good catch — * (zero-or-more) means the pattern matches bare UUIDs without any delimiters, which contradicts the test expectation. Changing to + so only delimited UUIDs are matched. Will fix.

greptile-apps · 2026-04-13T22:06:13Z

posthog/temporal/llm_analytics/eval_reports/delivery.py

+    all_failed = (
+        had_any_target
+        and len(all_errors) > 0
+        and (
+            # Check if every attempted target failed. We count attempts implicitly by
+            # summing target counts — an exception in delivery appends 1 error, and
+            # targets with an attempt that succeeded don't append anything.
+            len(all_errors) >= len(email_targets) + len(slack_targets)
+        )
+    )


all_failed overcounts errors when an email target has multiple addresses

deliver_email_report iterates over comma-separated addresses inside a single target object, appending one error per failed address. deliver_report then compares len(all_errors) >= len(email_targets), where email_targets counts target objects (not individual addresses). If a single target holds a@b.com, c@d.com and only the first send fails, all_errors has 1 entry and len(email_targets) is also 1, so all_failed is True — but one address was delivered successfully. This causes an incorrect FAILED status, a RuntimeError, and a Temporal retry that will duplicate delivery to the successful address.

Prompt To Fix With AI

This is a comment left during a code review. Path: posthog/temporal/llm_analytics/eval_reports/delivery.py Line: 390-399 Comment: **`all_failed` overcounts errors when an email target has multiple addresses** `deliver_email_report` iterates over comma-separated addresses inside a single target object, appending one error per failed address. `deliver_report` then compares `len(all_errors) >= len(email_targets)`, where `email_targets` counts target objects (not individual addresses). If a single target holds `a@b.com, c@d.com` and only the first send fails, `all_errors` has 1 entry and `len(email_targets)` is also 1, so `all_failed` is `True` — but one address was delivered successfully. This causes an incorrect `FAILED` status, a `RuntimeError`, and a Temporal retry that will duplicate delivery to the successful address. How can I resolve this? If you propose a fix, please make it concise.

Valid. When an email target contains comma-separated addresses (e.g. a@b.com, c@d.com), each failed send appends one error, but len(email_targets) counts target objects not individual addresses. Will fix by counting the actual number of delivery attempts (individual email addresses + slack channels) instead of target objects.

greptile-apps · 2026-04-13T22:06:15Z

posthog/temporal/llm_analytics/eval_reports/activities.py

+        ).select_related("evaluation")
+
+        for report in reports:
+            # Cooldown: skip if last delivery was too recent
+            if report.last_delivered_at:
+                cooldown_delta = dt.timedelta(minutes=report.cooldown_minutes)
+                if (now - report.last_delivered_at) < cooldown_delta:
+                    continue
+
+            # Daily cap: skip if too many runs today
+            today_start = now.replace(hour=0, minute=0, second=0, microsecond=0)
+            today_runs = EvaluationReportRun.objects.filter(
+                report=report,
+                created_at__gte=today_start,
+            ).count()


HogQL f-string interpolation — use ast.Constant placeholders instead

Both this query and the one in _find_nth_eval_timestamp (lines 134–155) embed evaluation_id and datetime strings directly into parse_select() f-strings. The security guide (.agents/security.md) flags this pattern: "User data embedded in f-string — can escape context!" and requires ast.Constant placeholders. Even though evaluation_id is a UUID from the database, semgrep's hogql-fstring-audit rule will flag these, and if the value were ever sourced from user input upstream the injection surface would already exist.

Use placeholders:

from posthog.hogql import ast as hogql_ast query = parse_select( """ SELECT count() as total FROM events WHERE event = '$ai_evaluation' AND properties.$ai_evaluation_id = {eval_id} AND timestamp >= {since} """, placeholders={ "eval_id": hogql_ast.Constant(value=str(report.evaluation_id)), "since": hogql_ast.Constant(value=since_str), }, )

Apply the same fix to _find_nth_eval_timestamp.

Prompt To Fix With AI

This is a comment left during a code review. Path: posthog/temporal/llm_analytics/eval_reports/activities.py Line: 77-91 Comment: **HogQL f-string interpolation — use `ast.Constant` placeholders instead** Both this query and the one in `_find_nth_eval_timestamp` (lines 134–155) embed `evaluation_id` and datetime strings directly into `parse_select()` f-strings. The security guide (`.agents/security.md`) flags this pattern: "User data embedded in f-string — can escape context!" and requires `ast.Constant` placeholders. Even though `evaluation_id` is a UUID from the database, `semgrep`'s `hogql-fstring-audit` rule will flag these, and if the value were ever sourced from user input upstream the injection surface would already exist. Use placeholders: ```python from posthog.hogql import ast as hogql_ast query = parse_select( """ SELECT count() as total FROM events WHERE event = '$ai_evaluation' AND properties.$ai_evaluation_id = {eval_id} AND timestamp >= {since} """, placeholders={ "eval_id": hogql_ast.Constant(value=str(report.evaluation_id)), "since": hogql_ast.Constant(value=since_str), }, ) ``` Apply the same fix to `_find_nth_eval_timestamp`. How can I resolve this? If you propose a fix, please make it concise.

Agreed — even though evaluation_id is a UUID from our database and since_str is a formatted datetime, using ast.Constant placeholders is the right pattern per our security guidelines. Will fix both queries to use placeholders.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 394a1ada63

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-13T22:06:54Z

posthog/temporal/schedule.py

    create_batch_generation_summarization_schedule,
    create_trace_clustering_coordinator_schedule,
    create_generation_clustering_coordinator_schedule,


Restore WA weekly digest schedule registration

This change removes create_wa_weekly_digest_schedule from posthog/temporal/schedule.py and also drops it from the schedules list, so a_init_general_queue_schedules() will no longer create/update the wa-weekly-digest-schedule. The wa-weekly-digest workflow still exists in products/web_analytics/backend/temporal/weekly_digest/workflows.py, so this effectively disables automatic Web analytics weekly digest deliveries in production.

Useful? React with 👍 / 👎.

Good catch — this was an accidental deletion during a rebase. The WA weekly digest schedule should remain. Will restore it.

chatgpt-codex-connector · 2026-04-13T22:06:54Z

posthog/temporal/llm_analytics/eval_reports/delivery.py

+    all_failed = (
+        had_any_target
+        and len(all_errors) > 0
+        and (
+            # Check if every attempted target failed. We count attempts implicitly by
+            # summing target counts — an exception in delivery appends 1 error, and
+            # targets with an attempt that succeeded don't append anything.
+            len(all_errors) >= len(email_targets) + len(slack_targets)
+        )


Count per-recipient delivery attempts before marking all failed

The all_failed check compares len(all_errors) to the number of target objects, but email delivery can attempt multiple recipients inside a single target (value is split by commas). If one recipient fails and another succeeds under one email target, len(all_errors) can still be >= 1, so the run is incorrectly marked FAILED and raises RuntimeError, triggering unnecessary Temporal retries and duplicate sends to recipients that already got the report.

Useful? React with 👍 / 👎.

Duplicate of the Greptile comment above — same issue. Will fix by counting individual delivery attempts rather than target objects.

Add LangGraph-based agent for generating evaluation reports. Includes 11 tools for querying ClickHouse metrics/examples/trends and producing structured output, a system prompt with pass/fail analysis guidance, and typed schema for report content (sections, metrics, citations). Pure logic with no side effects — wired to Temporal in a follow-up PR.

Add email and Slack delivery renderers, 7 Temporal activities, 3 workflows (GenerateEvaluationReport, CheckScheduledReports, CheckCountTriggeredReports), and schedule registration. The pipeline is live after merge but idle until report rows are created with enabled=True.

andrewm4894 mentioned this pull request Apr 13, 2026

feat(llma): add evaluation report models and API #54363

Open

This was referenced Apr 13, 2026

feat(llma): add evaluation report agent #54364

Open

feat(llma): add evaluation reports frontend #54366

Open

feat(llma): wire up auto-create hook, generated types, and MCP tools #54367

Open

andrewm4894 self-assigned this Apr 13, 2026

graphite-app bot reviewed Apr 13, 2026

View reviewed changes

greptile-apps bot reviewed Apr 13, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Apr 13, 2026

View reviewed changes

andrewm4894 force-pushed the andy/llma-eval-reports-2-report-agent branch from a7d458b to 7f1ec1f Compare April 13, 2026 22:27

andrewm4894 force-pushed the andy/llma-eval-reports-3-delivery-workflows branch from 394a1ad to b927375 Compare April 13, 2026 22:27

andrewm4894 force-pushed the andy/llma-eval-reports-2-report-agent branch from 7f1ec1f to b44acb4 Compare April 13, 2026 22:34

andrewm4894 force-pushed the andy/llma-eval-reports-3-delivery-workflows branch 3 times, most recently from bc73add to d286f52 Compare April 13, 2026 22:46

andrewm4894 added 2 commits April 13, 2026 23:48

andrewm4894 force-pushed the andy/llma-eval-reports-3-delivery-workflows branch from d286f52 to 7b533aa Compare April 13, 2026 22:48

	UUID_LINK_PATTERN = re.compile(r"[`<]([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})[`>]")
	UUID_LINK_PATTERN = re.compile(r"[`<]+([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})[`>]+")

Conversation

andrewm4894 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

How did you test this code?

Docs update

🤖 LLM context

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

andrewm4894 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

graphite-app bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

andrewm4894 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Apr 13, 2026

Security Review

Uh oh!

greptile-apps bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

andrewm4894 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

andrewm4894 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

andrewm4894 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

andrewm4894 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

andrewm4894 Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andrewm4894 commented Apr 13, 2026 •

edited

Loading

andrewm4894 commented Apr 13, 2026 •

edited

Loading