feat(llma): add evaluation report delivery and temporal workflows#54365
feat(llma): add evaluation report delivery and temporal workflows#54365andrewm4894 wants to merge 2 commits intoandy/llma-eval-reports-2-report-agentfrom
Conversation
|
Hey @andrewm4894! 👋\nThis pull request seems to contain no description. Please add useful context, rationale, and/or any other information that will help make sense of this change now and in the distant Mars-based future. |
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
| _inline_email_styles, | ||
| _linkify_uuids, | ||
| _render_metrics_block_html, | ||
| _render_metrics_block_mrkdwn, |
There was a problem hiding this comment.
Import of undefined function: _render_metrics_block_mrkdwn is imported but does not exist in delivery.py. The actual function is named _render_metrics_slack_blocks. This will cause an ImportError when the test module is loaded.
# Line 11 should be removed or changed to:
from posthog.temporal.llm_analytics.eval_reports.delivery import (
# ... other imports ...
_render_metrics_slack_blocks, # not _render_metrics_block_mrkdwn
)| _render_metrics_block_mrkdwn, | |
| _render_metrics_slack_blocks, |
Spotted by Graphite
Is this helpful? React 👍 or 👎 to let us know.
There was a problem hiding this comment.
Valid catch. The function was renamed from _render_metrics_block_mrkdwn to _render_metrics_slack_blocks (returns Block Kit blocks instead of a string). The test imports and assertions both need updating — will fix.
|
|
|
||
| # Match any UUID in content — strips surrounding punctuation (backticks, angle brackets, etc.) | ||
| # so we don't depend on how the LLM formats generation references. | ||
| UUID_LINK_PATTERN = re.compile(r"[`<]*([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})[`>]*") |
There was a problem hiding this comment.
UUID pattern matches bare UUIDs — contradicts test and docstring intent
The quantifier * (zero or more) on both delimiters makes the pattern match any UUID whether or not it's surrounded by backticks or angle brackets. The docstring for _linkify_uuids says "Replace backtick-wrapped UUIDs", and test_no_match_without_backticks explicitly asserts that a plain UUID without delimiters returns zero matches — that test will fail with this pattern. Changing * to + limits matching to UUIDs that actually have at least one surrounding delimiter.
| UUID_LINK_PATTERN = re.compile(r"[`<]*([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})[`>]*") | |
| UUID_LINK_PATTERN = re.compile(r"[`<]+([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})[`>]+") |
Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/delivery.py
Line: 21
Comment:
**UUID pattern matches bare UUIDs — contradicts test and docstring intent**
The quantifier `*` (zero or more) on both delimiters makes the pattern match any UUID whether or not it's surrounded by backticks or angle brackets. The docstring for `_linkify_uuids` says "Replace backtick-wrapped UUIDs", and `test_no_match_without_backticks` explicitly asserts that a plain UUID without delimiters returns zero matches — that test will fail with this pattern. Changing `*` to `+` limits matching to UUIDs that actually have at least one surrounding delimiter.
```suggestion
UUID_LINK_PATTERN = re.compile(r"[`<]+([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})[`>]+")
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Good catch — * (zero-or-more) means the pattern matches bare UUIDs without any delimiters, which contradicts the test expectation. Changing to + so only delimited UUIDs are matched. Will fix.
| all_failed = ( | ||
| had_any_target | ||
| and len(all_errors) > 0 | ||
| and ( | ||
| # Check if every attempted target failed. We count attempts implicitly by | ||
| # summing target counts — an exception in delivery appends 1 error, and | ||
| # targets with an attempt that succeeded don't append anything. | ||
| len(all_errors) >= len(email_targets) + len(slack_targets) | ||
| ) | ||
| ) |
There was a problem hiding this comment.
all_failed overcounts errors when an email target has multiple addresses
deliver_email_report iterates over comma-separated addresses inside a single target object, appending one error per failed address. deliver_report then compares len(all_errors) >= len(email_targets), where email_targets counts target objects (not individual addresses). If a single target holds a@b.com, c@d.com and only the first send fails, all_errors has 1 entry and len(email_targets) is also 1, so all_failed is True — but one address was delivered successfully. This causes an incorrect FAILED status, a RuntimeError, and a Temporal retry that will duplicate delivery to the successful address.
Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/delivery.py
Line: 390-399
Comment:
**`all_failed` overcounts errors when an email target has multiple addresses**
`deliver_email_report` iterates over comma-separated addresses inside a single target object, appending one error per failed address. `deliver_report` then compares `len(all_errors) >= len(email_targets)`, where `email_targets` counts target objects (not individual addresses). If a single target holds `a@b.com, c@d.com` and only the first send fails, `all_errors` has 1 entry and `len(email_targets)` is also 1, so `all_failed` is `True` — but one address was delivered successfully. This causes an incorrect `FAILED` status, a `RuntimeError`, and a Temporal retry that will duplicate delivery to the successful address.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Valid. When an email target contains comma-separated addresses (e.g. a@b.com, c@d.com), each failed send appends one error, but len(email_targets) counts target objects not individual addresses. Will fix by counting the actual number of delivery attempts (individual email addresses + slack channels) instead of target objects.
| ).select_related("evaluation") | ||
|
|
||
| for report in reports: | ||
| # Cooldown: skip if last delivery was too recent | ||
| if report.last_delivered_at: | ||
| cooldown_delta = dt.timedelta(minutes=report.cooldown_minutes) | ||
| if (now - report.last_delivered_at) < cooldown_delta: | ||
| continue | ||
|
|
||
| # Daily cap: skip if too many runs today | ||
| today_start = now.replace(hour=0, minute=0, second=0, microsecond=0) | ||
| today_runs = EvaluationReportRun.objects.filter( | ||
| report=report, | ||
| created_at__gte=today_start, | ||
| ).count() |
There was a problem hiding this comment.
HogQL f-string interpolation — use
ast.Constant placeholders instead
Both this query and the one in _find_nth_eval_timestamp (lines 134–155) embed evaluation_id and datetime strings directly into parse_select() f-strings. The security guide (.agents/security.md) flags this pattern: "User data embedded in f-string — can escape context!" and requires ast.Constant placeholders. Even though evaluation_id is a UUID from the database, semgrep's hogql-fstring-audit rule will flag these, and if the value were ever sourced from user input upstream the injection surface would already exist.
Use placeholders:
from posthog.hogql import ast as hogql_ast
query = parse_select(
"""
SELECT count() as total
FROM events
WHERE event = '$ai_evaluation'
AND properties.$ai_evaluation_id = {eval_id}
AND timestamp >= {since}
""",
placeholders={
"eval_id": hogql_ast.Constant(value=str(report.evaluation_id)),
"since": hogql_ast.Constant(value=since_str),
},
)Apply the same fix to _find_nth_eval_timestamp.
Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/temporal/llm_analytics/eval_reports/activities.py
Line: 77-91
Comment:
**HogQL f-string interpolation — use `ast.Constant` placeholders instead**
Both this query and the one in `_find_nth_eval_timestamp` (lines 134–155) embed `evaluation_id` and datetime strings directly into `parse_select()` f-strings. The security guide (`.agents/security.md`) flags this pattern: "User data embedded in f-string — can escape context!" and requires `ast.Constant` placeholders. Even though `evaluation_id` is a UUID from the database, `semgrep`'s `hogql-fstring-audit` rule will flag these, and if the value were ever sourced from user input upstream the injection surface would already exist.
Use placeholders:
```python
from posthog.hogql import ast as hogql_ast
query = parse_select(
"""
SELECT count() as total
FROM events
WHERE event = '$ai_evaluation'
AND properties.$ai_evaluation_id = {eval_id}
AND timestamp >= {since}
""",
placeholders={
"eval_id": hogql_ast.Constant(value=str(report.evaluation_id)),
"since": hogql_ast.Constant(value=since_str),
},
)
```
Apply the same fix to `_find_nth_eval_timestamp`.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Agreed — even though evaluation_id is a UUID from our database and since_str is a formatted datetime, using ast.Constant placeholders is the right pattern per our security guidelines. Will fix both queries to use placeholders.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 394a1ada63
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| create_batch_generation_summarization_schedule, | ||
| create_trace_clustering_coordinator_schedule, | ||
| create_generation_clustering_coordinator_schedule, |
There was a problem hiding this comment.
Restore WA weekly digest schedule registration
This change removes create_wa_weekly_digest_schedule from posthog/temporal/schedule.py and also drops it from the schedules list, so a_init_general_queue_schedules() will no longer create/update the wa-weekly-digest-schedule. The wa-weekly-digest workflow still exists in products/web_analytics/backend/temporal/weekly_digest/workflows.py, so this effectively disables automatic Web analytics weekly digest deliveries in production.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — this was an accidental deletion during a rebase. The WA weekly digest schedule should remain. Will restore it.
| all_failed = ( | ||
| had_any_target | ||
| and len(all_errors) > 0 | ||
| and ( | ||
| # Check if every attempted target failed. We count attempts implicitly by | ||
| # summing target counts — an exception in delivery appends 1 error, and | ||
| # targets with an attempt that succeeded don't append anything. | ||
| len(all_errors) >= len(email_targets) + len(slack_targets) | ||
| ) |
There was a problem hiding this comment.
Count per-recipient delivery attempts before marking all failed
The all_failed check compares len(all_errors) to the number of target objects, but email delivery can attempt multiple recipients inside a single target (value is split by commas). If one recipient fails and another succeeds under one email target, len(all_errors) can still be >= 1, so the run is incorrectly marked FAILED and raises RuntimeError, triggering unnecessary Temporal retries and duplicate sends to recipients that already got the report.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Duplicate of the Greptile comment above — same issue. Will fix by counting individual delivery attempts rather than target objects.
a7d458b to
7f1ec1f
Compare
394a1ad to
b927375
Compare
7f1ec1f to
b44acb4
Compare
bc73add to
d286f52
Compare
Add LangGraph-based agent for generating evaluation reports. Includes 11 tools for querying ClickHouse metrics/examples/trends and producing structured output, a system prompt with pass/fail analysis guidance, and typed schema for report content (sections, metrics, citations). Pure logic with no side effects — wired to Temporal in a follow-up PR.
Add email and Slack delivery renderers, 7 Temporal activities, 3 workflows (GenerateEvaluationReport, CheckScheduledReports, CheckCountTriggeredReports), and schedule registration. The pipeline is live after merge but idle until report rows are created with enabled=True.
d286f52 to
7b533aa
Compare

Problem
The report agent (PR 2) produces structured content but needs an execution pipeline to schedule generation, deliver results via email/Slack, and manage lifecycle.
Part 3 of 5 in a stacked PR series. Depends on #54364.
Stack:
Changes
EvalReportContentto formatted outputGenerateEvaluationReportWorkflow— runs agent and deliversCheckScheduledReportsWorkflow— hourly schedule checkCheckCountTriggeredReportsWorkflow— 5-minute poll for count-based triggersAfter merge, the pipeline is live but idle — no enabled report rows exist yet.
How did you test this code?
No
Docs update
skip-inkeep-docs
🤖 LLM context
Agent-assisted PR stack creation from a single reference branch.