Skip to content

feat: agentic extraction & search pipelines (single-loop v2)#34

Open
yilu331 wants to merge 103 commits intomainfrom
feat/agentic-extraction-search
Open

feat: agentic extraction & search pipelines (single-loop v2)#34
yilu331 wants to merge 103 commits intomainfrom
feat/agentic-extraction-search

Conversation

@yilu331
Copy link
Copy Markdown
Collaborator

@yilu331 yilu331 commented Apr 23, 2026

Summary

Adds an agentic (single-loop, tool-calling) alternative to the classic extraction and search pipelines. Opt-in via Config.extraction_backend="agentic" / search_backend="agentic"; defaults unchanged.

  • ExtractionAgent — one adaptive agent loop per enabled (extractor_config × window). Drives run_tool_loop over EXTRACTION_TOOLS (search / get / create / delete user profile + user playbook + finish), accumulates PlanOps on ExtractionCtx.plan, and applies them atomically at commit time through deterministic invariants.
  • SearchAgent — read-only mirror; returns a typed SearchResult(answer, outcome, budget_exceeded, trace). AgenticSearchService.search walks the trace, harvests entity IDs, and populates UnifiedSearchResponse.{profiles,user_playbooks,agent_playbooks} plus the new .agent_answer field.
  • Atomic tool surface mapping 1:1 to existing BaseStorage methods — no new storage API.
  • Invariants at commit time (spec §6): A search-before-create, B delete-known-id-only, D plan-size cap, E no-duplicate-create, F no-duplicate-delete, H source_span present, J scope-match, K deletes-without-creates. Plus a pre-invariant resolver that cancels create + delete-of-tentative-id pairs (self-correction oscillations).
  • 34 eval fixtures + semantic judge: 12 mutation patterns, 18 Supermemory failure modes, 4 agent-loop behaviors. Group 3 runs fully deterministic (mocked-LLM replay) in CI.

Diagrams

Classic vs agentic — extraction flow

Classic: single-shot LLM per configured extractor, post-hoc LLM-based deduplication. Agentic: one adaptive agent loop per extractor × window, plan+commit with deterministic invariants.

Classic (extraction_backend=classic) — single-shot LLM per configured extractor, post-hoc LLM-based deduplication:

graph TD
    C1[publish_request] --> C2[pre-flight gate<br/>_cheap_should_run_reject]
    C2 -->|pass| C3[ProfileGenerationService<br/>single-shot LLM per config]
    C2 -->|pass| C4[PlaybookGenerationService<br/>single-shot LLM per config]
    C3 --> C3d[ProfileDeduplicator<br/>LLM-based dedup pass]
    C4 --> C4d[PlaybookDeduplicator<br/>LLM-based dedup pass]
    C3d --> C5[(profiles)]
    C4d --> C6[(user_playbooks)]
    C6 --> C7[PlaybookAggregator<br/>user → agent_playbooks]
Loading

Agentic (extraction_backend=agentic) — one adaptive agent loop per extractor × window, plan+commit with deterministic invariants:

graph TD
    A1[publish_request] --> A2[pre-flight gate<br/>_cheap_should_run_reject]
    A2 -->|pass, for each<br/>extractor_config| A3[ExtractionAgent.run<br/>run_tool_loop]
    A3 -->|tool calls| A4{atomic tools}
    A4 -->|read| A4r[search_* · get_*]
    A4 -->|mutate| A4m[create_* · delete_*<br/>→ ctx.plan]
    A4r --> A3
    A4m --> A3
    A3 -->|finish / max_steps| A5[commit_plan]
    A5 --> A6[resolve tentative-id<br/>oscillations]
    A6 --> A7[hard + soft invariants]
    A7 --> A8[apply_plan_op]
    A8 --> A9[(profiles)]
    A8 --> A10[(user_playbooks)]
    A10 --> A11[PlaybookAggregator<br/>unchanged]

    style A5 fill:#d0ecd8
    style A6 fill:#d0ecd8
    style A7 fill:#d0ecd8
Loading

Key structural shifts:

  • Deduplication moves from post-hoc LLM passes (classic) to commit-time invariants (agentic). Invariant A (search-before-create) prevents duplicates at the source instead of pruning them after.
  • Persistence moves from direct writes (classic) to a plan+commit stage (agentic). Mutation tools append to ctx.plan; commit_plan applies the plan atomically only after the loop terminates.
  • PlaybookAggregator trigger is preserved identically across both backends.

Why agentic over classic?

The classic pipeline is a single-shot LLM per extractor (+ RAG for search) with a post-hoc LLM-based deduplication pass. It ships a lot of state into one prompt and hopes the model gets everything right in one pass. Five failure modes surfaced in the Supermemory literature (and directly exercised by the Group 2 eval fixtures) illustrate where this breaks — and how the agentic design mitigates each:

Failure mode Classic behaviour Agentic mitigation
Temporal supersession — "I've moved to Seattle" when an old profile says Austin LLM emits a new "lives in Seattle" profile; the old one survives unless post-hoc dedup's text-similarity heuristic catches it Invariant A forces search-before-create. The agent finds the stale profile, deletes it, and creates the new one in the same plan — a true supersession, not a duplicate.
Contradiction resolution — "I'm not vegetarian, never was" negating a stored profile The current window contains no positive fact; the single-shot LLM may emit nothing; the stale profile survives The agent can actively search for contradicting memory and issue delete ops on profiles the current statement invalidates.
Multi-hop reasoning — "what DB tool should I suggest for this user?" requires linking a profile (user is a data scientist) to a playbook (prefer polars for data) Single-shot RAG returns one ranked list per query; chaining across entity types is not supported SearchAgent re-queries iteratively: first profiles, then — knowing the user's role — relevant playbooks.
Empty-result confabulation — asking about memory that does not exist The classic LLM may invent plausible answers from training priors The search prompt instructs the agent to say "no evidence in memory"; tool-trace grounding means every assertion must trace to a retrieved id.
Nonmatching-constraint application — user's broad rule ("metric units only") should apply to a query that doesn't lexically match the rule RAG may miss the rule when current-query vocabulary doesn't overlap The agent re-queries by intent ("measurement preferences") rather than verbatim, surfaces the rule, applies it.

Structural benefits beyond the failure modes are detailed in the commit_plan section below (atomic cancellation, commit-time invariants, incremental cost, auditability).

Trade-off honesty: agentic has more LLM turns per run than classic's single-shot. The bet is that looping over a narrow tool surface with commit-time invariants is a net win on quality and correctness vs. a single larger prompt with post-hoc cleanup. Group 1/2 real-LLM eval runs will quantify the trade-off; default-flip from classicagentic is gated on those numbers.

Agent loop shape

  ExtractionAgent                   SearchAgent
  ───────────────                   ───────────
  for each (extractor × window):    for each query:
    search / get (read)               search / get (read, re-query)
    create / delete (plan op)         finish(answer)
    finish                                │
       │                                  ▼
       ▼                             {answer, trace, outcome}
  commit_plan(ctx):                       │
    resolve tentative-id pairs            ▼
    run hard + soft invariants       harvest ids from trace
    apply surviving ops ──► storage  fetch entities by id
                                     populate UnifiedSearchResponse

commit_plan — the atomic commit stage

The agentic extraction pipeline separates planning from persistence. During the agent loop, mutation tools (create_user_profile, create_user_playbook, delete_user_profile, delete_user_playbook) don't touch storage — they append PlanOps to ExtractionCtx.plan. Once the agent calls finish (or the loop hits max_steps), commit_plan(ctx, storage, outcome) runs a deterministic pipeline:

  1. Oscillation resolver. Scan the plan for matching create + delete-of-tentative-id pairs and cancel both — the agent changed its mind mid-run, so the net effect is zero writes.
  2. Hard invariants (A / B / D / F / J). Violations drop the offending ops entirely:
    • A — create must follow at least one search in the same run (no ungrounded writes).
    • B — delete target must be a known id (surfaced by a prior search/get, or issued as a tentative_id by a create this run).
    • D — plan size capped at 30 ops.
    • F — no duplicate deletes on the same id.
    • J — all ops scoped to the run's user_id / extractor_name.
  3. Soft invariants (E / H / K). Violations are logged but the ops still apply:
    • E — no duplicate creates with identical content.
    • H — every create has a non-empty source_span.
    • K — deletes without a matching create are allowed (user explicitly asked to remove stored memory).
  4. Apply surviving ops atomically via apply_plan_op — each op dispatched to the right BaseStorage method (add_user_profile, delete_profiles_by_ids, save_user_playbooks, delete_user_playbooks_by_ids).
  5. Return CommitResult(applied, violations, outcome) for the caller to surface as warnings.

Why plan+commit instead of direct writes?

  • Atomic cancellation. The agent can create-then-delete-tentative (self-correction) without burning a write-then-delete round-trip on the database.
  • Determinism at commit time. LLM non-determinism is confined to the loop. Once the plan is frozen, what survives is a pure function of the plan + invariants — so two runs with the same plan always commit the same ops.
  • Replaces post-hoc deduplication. The classic path runs ProfileDeduplicator / PlaybookDeduplicator after the orchestrator finishes, re-prompting an LLM to find duplicates. The agentic path folds that discipline into the loop via invariant A (search-before-create) — the agent is required to look before it writes, so duplicates don't enter storage in the first place.
  • Uniform failure recovery. max_steps and error outcomes both go through the same commit path. On error the plan is discarded; on max_steps the plan commits as if the agent had called finish — best-effort partial progress per spec §7.
  • Testable in isolation. Invariants are pure functions of ExtractionCtx. 24 unit tests in test_invariants.py exercise every code / severity / drop-set combination without touching storage or LLMs.
  • Auditability. CommitResult.violations is the single source of truth for "what the agent tried but was prevented from doing" — surfaced to callers as warnings and logged at INFO with the affected op indices.

Architecture iteration history (Phase A → v2 single-loop)

The first round of agentic landed as the spec described: 6 reader agents per lane + 2 critics + a reconciler for extraction; 6 search agents + 2 synthesizers + reconciler for search. Telemetry from the Group 3 fixtures and early benchmark runs showed that complexity wasn't pulling its weight, so this PR also lands the v2 single-loop refactor that strips the architecture down to one extraction agent and one search agent driven by the same run_tool_loop primitive.

What was removed (commits 5f0bcaf, d35da1d, ffc6a71, 8052df8):

  • 6-reader stack + reader prompts (prompt_bank/profile_reader_*, prompt_bank/playbook_reader_*).
  • critics.pyProfileCritic, PlaybookCritic, Reconciler, and the _supersede / _merge helpers (subsumed by atomic delete + create tools, gated by the same invariants).
  • 6-search-agent stack + 2-synthesizer-stack from search; replaced with one SearchAgent and an answer-trace harvester.
  • ProfileDeduplicator / PlaybookDeduplicator (replaced by invariant A at commit time).

What replaced them (commits bab6e62, 91319de, 690dccc, e5cc383, 17e08ba):

  • extraction_agent.py::ExtractionAgent.run(window, extractor_config) — one tool-calling loop per (extractor × window).
  • search_agent.py::SearchAgent.run(query) — one tool-calling loop per query, returning a typed SearchResult.
  • agentic_adapter.py::AgenticExtractionRunner rewritten for the single-loop shape; BaseGenerationService.run's per-extractor iteration is preserved (multi-extractor workflows still scale linearly).
  • HandlerBundle promoted to a stable contract for tool handlers; search_answer typed end-to-end.

The v2 design spec lives at docs/superpowers/specs/2026-04-24-agentic-v2-single-loop-design.md; the original Phase-A spec stays in-tree for context.

Prompt + tool tuning iterations

Once the v2 architecture stabilised, prompts and tool-loop budgets were tuned against the LongMemEval golden set. The relevant commits:

  • Tighter max_steps budgets + budget-in-prompt0abaa50 caps extraction at 4 and search at 3 by default and threads the budget into the prompts so the agent reasons about its remaining tool calls.
  • Search agent v1.2.0 — verbatim-first863df90 rewrites the search prompt to copy dates / IDs verbatim, prefer enumeration of unique names for counts, and inspect the top-10 snippets when the query is temporal or count-shaped. Tuning iterations 7b44ae4, 081628b, 3a6c74d, 38c2565, bd47944.
  • Extraction agent v1.1.0 → v1.4.0 — atomic-fact profiles (c462b64), structured trigger + markdown-bullet content for playbooks (c1d02e6), aggregator v2.1.0 (6d5df02), then iter 1–3 tightening (b4d3560, c73852b, 0a046c5) for temporal markers, countable items, and the pre-create checklist.
  • Self-improvement framing93efc71 rewrites both system prompts so the agent explicitly reasons about what changed in storage between turns.
  • Per-entity specialisation166418f splits extraction tools and prompts per entity kind so the agent's mental model of "what create_* does to which table" is concrete instead of templated.

The reviewer-facing changelog is the prompt-bank diff plus the LongMemEval-S spot-check report on the enterprise PR.

Logging + observability

  • Per-run summaryac15700 adds an INFO-level summary at the end of each agentic pipeline run (elapsed wall-clock, total turns, prompt + completion tokens, cost, model). Same shape for extraction and search.
  • Tool-call rendering in llm_io.logc47bde2, 1fa9091, 3a53729 route tool-call requests and responses through format_messages_for_logging so the round-trip is reconstructible from the log.
  • TZ-aware timestamps122490b + 95808ea switch logging to ISO 8601 extended offset (-07:00) and append the local TZ abbreviation when tzdata is available; falls back to +00:00 on minimal containers.
  • Uvicorn access log alignment465c4ed wires uvicorn.access through the same formatter so request logs share the same timestamp shape as the agent's llm_io.log.

Profile_id shortening (12-hex)

f8ba5b4 shrinks the profile id from a long opaque slug to 12 hex characters. The motivation is LLM tool-call reliability: with the longer slug, gpt-5-mini occasionally truncated or fat-fingered the id when emitting delete_user_profile(id=...), causing invariant B violations and forcing wasted re-search turns. The 12-hex form is short enough to copy reliably and large enough (4.7 × 10^14 namespace) to keep collision risk per user negligible. 5926766 standardises the infinity-TTL sentinel to NEVER_EXPIRES_TIMESTAMP so storage layers don't paper over the change with magic numbers.

API rename: search_profilessearch_user_profiles

Bundled into this PR (commit 9336495). Aligns the granular profile-search endpoint with the rest of the user_* CRUD surface (get_user_profile, add_user_profile, delete_user_profile, search_user_playbooks, search_agent_playbooks). The previous search_profiles was the only outlier.

Affects: Python client method ReflexioClient.search_user_profiles, lib mixin, HTTP route /api/search_user_profiles, CLI subcommand reflexio user-profiles search-user-profiles, retrieval-latency benchmark, all tests + docs. No backward-compat shim — outright rename per pre-1.0 API discipline.

Bug fixes shipped alongside

  • 9a82308 — wires AgenticSearchService into the public /api/search path (was reachable only via the dispatcher).
  • b3e3986request_id now threads through ExtractionCtx so agentic-created profiles + playbooks carry the same provenance as classic.
  • 5ad98e9AgenticSearchService.search rejects empty user_id / agent_version at the API boundary instead of silently coercing to empty (would have leaked across users on SqliteStorage).
  • fb8f999 — agentic search handlers now compute the query embedding (was missing for the FTS-only path).
  • a8de04fapply_plan_op derives expiration_timestamp from the requested TTL (previously left at 0 for non-infinity TTLs).
  • e0e9651 — pre-invariant resolver in commit_plan cancels create + delete-of-tentative-id oscillations.
  • 6763330 — structured-output retries: _make_request parse failures burn a retry attempt instead of returning raw content (mitigates MiniMax JSON flakiness).

CodeRabbit review-loop fixes

f7485b3 addresses the substantive comments from the CodeRabbit review:

  • run_tool_loop: respect max_steps in the fallback path; per-tool latency_ms is now measured from a per-tool clock instead of the turn-start clock (fixes inflated late-tool timings).
  • _format_tool_calls: json.dumps(parsed_args) is wrapped with a repr() fallback so the logging path never raises on non-serialisable arguments.
  • commit_plan: validate DeleteUserPlaybookOp.id before int() and surface a clear TypeError for non-numeric ids.
  • _migrate_field_names: strip null extraction_backend / search_backend so existing stored configs migrate cleanly.
  • Eval suite hardening: JudgeScore floats constrained to [0, 1], golden cases sorted by YAML id (not filename), rubrics aligned with the JudgeScore schema, mixed-TTL fixture uses schema-valid enum values.
  • Test coverage: source_span and notes asserted on positive-path domain-entity tests, known_ids asserted on the search-handler tests, soft-violation case in test_agentic_adapter switched to code E (was B, which is hard).

Test plan

  • 1967 non-e2e tests pass; ruff + pyright clean on prod code.
  • 4/4 Group 3 fixtures pass via eval_runner.run_fixture (includes oscillated_self_correction exercising create + delete-of-tentative-id cancellation).
  • test_agentic_v2_e2e verifies the full GenerationService.run flow (pre-filter → per-extractor iteration → ExtractionAgent → commit → PlaybookAggregator trigger).
  • Structured-output parse retry covered by test_litellm_client_unit.py::TestStructuredOutputRetry (malformed-JSON burns a retry; final attempt raises LiteLLMClientError).
  • All CodeRabbit review-loop fixes covered by their corresponding unit tests.
  • Real-LLM Group 1/2 eval run (requires API keys; deferred to nightly CI).
  • LongMemEval-S oracle full-run (in flight on the enterprise PR).

Out of scope

  • Default-flip from classic to agentic — manual decision gated on golden-set eval.
  • Rerun-on-agentic wiring — output_pending_status currently only flows through classic services.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an agentic-v2 tool-calling pipeline for extraction and search (tool registry, tool-loop, agents, plan/invariant system), extends LLM client for tool-calls and cost accounting, adds extraction metadata fields and DB migration, removes legacy deduplicators, updates config/type-checking, and adds extensive eval fixtures and tests.

Changes

Cohort / File(s) Summary
Configuration & Type-Checking
pyproject.toml, pyrightconfig.json
Add polars to dev deps; add Pyright settings (include/exclude and reportMissingImports="warning").
Config Schema
reflexio/models/config_schema.py
Add extraction_backend and search_backend fields (Literal["classic","agentic"]) defaulting to "classic".
LLM Client & Model Defaults
reflexio/server/llm/litellm_client.py, reflexio/server/llm/model_defaults.py, reflexio/server/llm/__init__.py
Introduce ToolCallingChatResponse and StructuredOutputParseError; propagate tools, tool_choice, model_role; compute/log USD cost; add EXTRACTION_AGENT/SEARCH_AGENT roles and provider defaults; re-export ToolCallingChatResponse.
Tool-Calling Core
reflexio/server/llm/tools.py
New tool system: Tool, ToolRegistry, OpenAI-compatible tool specs, supports_tool_calling(), run_tool_loop() with native multi-turn tool-calling and structured-output fallback; models for turns/traces/results.
Agentic Extraction: Plan, Tools, Invariants
reflexio/server/services/extraction/plan.py, .../invariants.py, .../tools.py
Add plan/op models, ExtractionCtx, Violation/CommitResult, oscillation resolution, commit logic enforcing hard/soft invariants, atomic tool handlers, apply_plan_op, and tool registries.
Agent Runners & Adapter
reflexio/server/services/extraction/extraction_agent.py, .../agentic_adapter.py
New ExtractionAgent and AgenticExtractionRunner to run tool-loop extraction, integrate with publish flow, handle warnings/aggregation.
Agentic Search & Service
reflexio/server/services/search/search_agent.py, .../agentic_search_service.py, .../plan.py
Add SearchAgent, AgenticSearchService, SearchResult model; harvest IDs from tool-loop trace and fetch/order results in-memory.
Service Dispatch & Factories
reflexio/server/services/generation_service.py, reflexio/server/services/unified_search_service.py
Add build_extraction_service and build_search_service; introduce UnifiedSearchService wrapper; dispatch by Config.
Logging / Service Utils
reflexio/server/services/service_utils.py, reflexio/server/uvicorn_logging.py, reflexio/server/__init__.py
TZ-aware log formatter, improved LLM I/O formatting to render tool_calls, use AccessFormatter for uvicorn access logs, and helper to format tool-calls for logging.
Data Model & Storage Schema
reflexio/models/api_schema/domain/entities.py, .../retriever_schema.py, .../playbook_service_utils.py, .../profile_generation_service_utils.py, reflexio/server/services/storage/sqlite_storage/*
Add optional source_span, notes, reader_angle to profiles/playbooks/structured content; add agent_answer to UnifiedSearchResponse; extend SQLite DDL/mappers and add migration _migrate_agentic_signals() to backfill columns.
Deduplication Removal
reflexio/server/services/profile/profile_deduplicator.py, reflexio/server/services/playbook/playbook_deduplicator.py, related callers/tests
Remove profile and playbook deduplicator modules and tests; remove dedup invocation from generation services and helper wrapper is_deduplicator_enabled.
Feature Flags & Registry
reflexio/server/site_var/feature_flags.py, reflexio/test_support/llm_model_registry.py
Remove is_deduplicator_enabled() convenience; delete deduplication entries from test model registry.
Prompts
reflexio/server/prompt/prompt_bank/...
Add extraction-agent and search-agent prompt templates (multiple versions); delete legacy playbook/profile deduplication prompts.
CLI & Misc
reflexio/cli/commands/setup_cmd.py, reflexio/cli/run_services.py, reflexio/lib/_base.py
Adjust Claude Code setup to manage Stop hook and remove expert-only command directory on non-expert installs; small formatting changes.
Tests & Eval Framework
tests/eval/*, many tests/server/services/extraction/*, tests/server/services/search/*, etc.
Large addition of eval fixtures (many JSON/YAML), Polars aggregator, judge harness, extensive unit/integration tests for tool-loop, agents, plan/invariants, tools, storage migrations; add test helpers and adjust many tests to new schemas/behaviors; remove dedup-related tests.
Test Helpers & Fixtures
reflexio/test_support/llm_mock.py, reflexio/test_support/llm_model_registry.py, tests/conftest.py
Add helpers for crafting tool-call LiteLLM responses and new pytest fixtures for tool-call sequences; cleanup model registry test entries.

Sequence Diagram(s)

sequenceDiagram
  participant Client as LiteLLMClient
  participant LLM as LiteLLM (provider)
  participant Runner as run_tool_loop
  participant Registry as ToolRegistry
  participant Storage as Storage

  Client->>Runner: run_tool_loop(messages, tools, model_role)
  Runner->>LLM: generate_chat_response(..., tools, tool_choice="auto", model=resolved)
  LLM-->>Runner: response (choices with tool_calls or content)
  alt native tool-calling (tool_calls present)
    Runner->>Registry: parse tool_calls, handle(name,args_json,ctx)
    Registry->>Runner: handler result and ctx updates
    Runner->>Storage: apply_plan_op or fetch entities as needed
    Runner->>LLM: append assistant messages of tool results -> continue loop or finish
  else fallback (no native support)
    Runner->>LLM: generate_chat_response(response_format=fallback_schema)
    LLM-->>Runner: parsed structured schema -> synthetic tool calls
    Runner->>Registry: dispatch synthetic calls to fallback tool
    Registry->>Storage: persist or fetch results
  end
  Runner-->>Client: ToolLoopResult / ToolCallingChatResponse / SearchResult
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Poem

🐰 I hopped into the code today,
Tools now chatter, plans at play,
Old dedup tricks quietly stray,
Agents fetch and store the way —
Hooray for hops in bytes and hay! 🥕

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/agentic-extraction-search

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
reflexio/server/services/profile/profile_generation_service_utils.py (1)

75-105: ⚠️ Potential issue | 🟡 Minor

Update ProfileAddItem docstring to match the expanded schema.

The attributes section is now stale after adding source_span, notes, and reader_angle.

Suggested doc fix
     Attributes:
         content (str): The profile content based on content definition
         time_to_live (str): Time to live for the profile - one of: 'one_day', 'one_week', 'one_month', 'one_quarter', 'one_year', 'infinity'
         metadata (str, optional): Metadata extracted for the profile based on metadata definition
+        source_span (str, optional): Verbatim evidence span supporting this profile item
+        notes (str, optional): Extraction caveats/confidence/alternatives
+        reader_angle (str, optional): Reader perspective that surfaced this item
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/profile/profile_generation_service_utils.py` around
lines 75 - 105, Update the ProfileAddItem class docstring so the Attributes
section reflects the expanded schema: add entries for source_span (str | None)
with description "Verbatim excerpt from the source that most directly supports
this profile item", notes (str | None) with description "Free-form extraction
notes — confidence, caveats, or alternative readings", and reader_angle (str |
None) with description "The extraction perspective or reader role that surfaced
this item"; also ensure the existing time_to_live entry lists the current
literal options ("one_day", "one_week", "one_month", "one_quarter", "one_year",
"infinity") and keep content, metadata entries matching their Field descriptions
in the class.
🧹 Nitpick comments (27)
tests/eval/golden_set/extraction/polars_vs_pandas.yaml (1)

14-18: Tighten source-span evidence for the comparative preference.

At Line 17, requiring only "polars" is a bit permissive for a profile that asserts “over pandas for new work.” Consider anchoring to a contrastive phrase so extractors don’t pass with a weaker “user uses polars” span.

♻️ Proposed fixture tightening
-    must_include_in_source_span: "polars"
+    must_include_in_source_span: "don't want agents to suggest pandas for new code"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/eval/golden_set/extraction/polars_vs_pandas.yaml` around lines 14 - 18,
The fixture's evidence requirement is too weak—update the second item's
must_include_in_source_span (the YAML key in the entry whose content starts
"User's team standardized on polars last quarter.") to require a contrastive
phrase such as "polars over pandas" or "polars over pandas for new work" instead
of just "polars" so extractors must capture the comparative preference.
reflexio/server/prompt/prompt_bank/profile_reader_facts/v1.0.0.prompt.md (1)

11-11: Minor wording polish: hyphenate the compound adjective.

Use “third-person” for grammatical clarity in the instruction text.

✏️ Suggested edit
-  - content: one-sentence statement of the fact, written in third person
+  - content: one-sentence statement of the fact, written in third-person
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/prompt/prompt_bank/profile_reader_facts/v1.0.0.prompt.md` at
line 11, The prompt instruction text currently reads "one-sentence statement of
the fact, written in third person"; update that phrase to hyphenate the compound
adjective as "written in third-person" for grammatical clarity—edit the content
string in prompt_bank/profile_reader_facts/v1.0.0.prompt.md (the line containing
"one-sentence statement of the fact, written in third person") to use
"third-person".
reflexio/server/services/unified_search_service.py (1)

279-297: Class contract is unclear: it doesn’t currently wrap execution.

The docstring says this wraps run_unified_search, but the class currently only stores dependencies. Consider either adding a thin forwarding method or adjusting the docstring to avoid implying executable behavior.

💡 Minimal docstring-only correction
 class UnifiedSearchService:
-    """Class handle for the classic unified search pipeline.
+    """Dependency handle for the classic unified search pipeline.
 
-    Wraps :func:`run_unified_search` so the dispatcher factory can return an
-    object whose ``__class__.__name__`` can be inspected uniformly alongside
-    the agentic search service (Phase 4).
+    Used by dispatcher code that needs a concrete class instance for
+    backend identity/selection alongside the agentic search service.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/unified_search_service.py` around lines 279 - 297,
The class docstring claims UnifiedSearchService "wraps run_unified_search" but
the class only stores dependencies; add a thin forwarding method (e.g., run or
execute) on UnifiedSearchService that calls the top-level run_unified_search
function using the stored self.llm_client and self.request_context (and forwards
any parameters/kwargs) and returns its result, or alternatively update the
docstring to remove the "wraps run_unified_search" line; reference the
UnifiedSearchService class, its __init__, and the run_unified_search function
when making the change.
tests/server/services/playbook/test_structured_playbook_content.py (1)

15-25: Assert notes in the optional-fields test.

notes is provided but not asserted, so a regression on that field could pass unnoticed.

Suggested test fix
 def test_structured_playbook_content_accepts_optional_fields() -> None:
     c = StructuredPlaybookContent(
         trigger="t",
         content="c",
         rationale="r",
         source_span="quote",
         notes="confidence=0.9",
         reader_angle="trigger",
     )
     assert c.source_span == "quote"
+    assert c.notes == "confidence=0.9"
     assert c.reader_angle == "trigger"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server/services/playbook/test_structured_playbook_content.py` around
lines 15 - 25, The test test_structured_playbook_content_accepts_optional_fields
creates a StructuredPlaybookContent with notes="confidence=0.9" but never
asserts that field; add an assertion to verify the notes field (e.g., assert
c.notes == "confidence=0.9") in the test to ensure the optional notes property
is checked and protected against regressions.
tests/eval/test_agentic_vs_classic_search_integration.py (1)

22-23: Tighten score assertions to enforce normalized bounds.

Use a full [0.0, 1.0] range check instead of only lower-bound checks.

Suggested test hardening
-    assert c_score.answer_correctness >= 0.0
-    assert a_score.answer_correctness >= 0.0
+    assert 0.0 <= c_score.answer_correctness <= 1.0
+    assert 0.0 <= a_score.answer_correctness <= 1.0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/eval/test_agentic_vs_classic_search_integration.py` around lines 22 -
23, The assertions for c_score.answer_correctness and a_score.answer_correctness
only check a lower bound; update the test in
test_agentic_vs_classic_search_integration.py to assert both lower and upper
bounds so each score is within [0.0, 1.0] (e.g., replace the two lines
referencing c_score.answer_correctness and a_score.answer_correctness with range
checks that ensure >= 0.0 and <= 1.0). Ensure both the c_score and a_score
variables use the tightened assertions so the test enforces normalized score
bounds.
reflexio/server/prompt/prompt_bank/profile_critic/v1.0.0.prompt.md (1)

9-16: Add an explicit “exactly one action per candidate” rule.

Lines 9-16 list valid actions, but a strict one-action constraint will reduce ambiguous/duplicate tool calls in critic loops.

🛠️ Prompt hardening snippet
 You are a profile critic. Three angle readers (facts / context / temporal) produced
 the candidate profile items below. You must decide, for each one, whether to:
+Emit exactly one action per candidate.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/prompt/prompt_bank/profile_critic/v1.0.0.prompt.md` around
lines 9 - 16, Add an explicit rule to the action list that each profile
candidate must have exactly one chosen action to avoid ambiguous or duplicate
decisions; update the section that currently lists `accept`, `refine`, `reject`,
and `flag_cross_entity_conflict` to include a one-line rule such as "Choose
exactly one action per candidate — do not select multiple actions" and, where
`refine` is described, clarify that refinement implies final acceptance so it
counts as the single action; ensure the new wording appears alongside the
existing bullet list so reviewers and downstream critic loops enforce a single
action invariant.
reflexio/server/prompt/prompt_bank/profile_search_temporal/v1.0.0.prompt.md (1)

11-14: Make temporal retrieval size explicit for deterministic behavior.

Line 11 omits top_k, so retrieval behavior depends on backend defaults and may drift between providers/configs.

♻️ Suggested prompt tweak
-Use `search_profiles(respect_ttl=false)` to include expired items — they may
+Use `search_profiles(query, top_k=10, respect_ttl=false)` to include expired items — they may
 be the PREVIOUS state of something the query is asking about. Then
 `submit_candidates`, flagging in `why` whether the item is current vs
 superseded.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/prompt/prompt_bank/profile_search_temporal/v1.0.0.prompt.md`
around lines 11 - 14, The prompt currently calls
search_profiles(respect_ttl=false) without specifying top_k, making retrieval
size nondeterministic; update the prompt text to call
search_profiles(respect_ttl=false, top_k=<N>) (or a named variable like
top_k_temporal) so the retrieval count is explicit and deterministic, and ensure
any downstream references (e.g., submit_candidates / why) assume that same top_k
value.
tests/eval/test_agentic_vs_classic_extraction_integration.py (1)

16-25: This test currently verifies judge plumbing, not backend comparison.

With identical stubs in Lines 18-23, it can’t catch extraction quality regressions between classic and agentic despite the current test name.

🧪 Suggested clarity + tighter assertions
-def test_agentic_vs_classic_extraction(extraction_case, extraction_judge):
-    """For each golden case, the stubbed judge returns a parseable score."""
+def test_extraction_judge_harness_smoke(extraction_case, extraction_judge):
+    """For each golden case, judge output is parseable (no backend comparison)."""
@@
-    assert c_score.signal_f1 >= 0.0
-    assert a_score.signal_f1 >= 0.0
+    assert 0.0 <= c_score.signal_f1 <= 1.0
+    assert 0.0 <= a_score.signal_f1 <= 1.0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/eval/test_agentic_vs_classic_extraction_integration.py` around lines 16
- 25, The test test_agentic_vs_classic_extraction currently uses identical
stubbed outputs (classic_out and agentic_out) so it only validates judge
plumbing; replace those empty identical stubs with real extraction outputs or
distinct fixtures produced by the classic and agentic extraction paths (or call
the actual extractor functions) and then assert a meaningful comparison using
extraction_judge.score (e.g., compute c_score and a_score via
extraction_judge.score(expected=extraction_case, actual=<classic_output>) and
extraction_judge.score(expected=extraction_case, actual=<agentic_output>) and
assert a tighter condition such as a_score.signal_f1 >= c_score.signal_f1 or
that the difference is within an acceptable delta) to catch regressions between
the two backends.
reflexio/server/services/storage/sqlite_storage/_base.py (1)

852-868: Guard migration DDL with the storage lock.

Line 859–Line 868 runs PRAGMA/ALTER TABLE/commit without self._lock. Wrapping this block keeps migration behavior consistent with the rest of this class’s thread-safety pattern.

Suggested patch
 def _migrate_agentic_signals(self) -> None:
@@
-        for table in ("profiles", "user_playbooks"):
-            cols = {
-                row["name"]
-                for row in self.conn.execute(f"PRAGMA table_info({table})").fetchall()
-            }
-            for col in ("source_span", "notes", "reader_angle"):
-                if col not in cols:
-                    self.conn.execute(f"ALTER TABLE {table} ADD COLUMN {col} TEXT")  # noqa: S608
-                    logger.info("Added %s column to %s", col, table)
-        self.conn.commit()
+        with self._lock:
+            for table in ("profiles", "user_playbooks"):
+                cols = {
+                    row["name"]
+                    for row in self.conn.execute(f"PRAGMA table_info({table})").fetchall()
+                }
+                for col in ("source_span", "notes", "reader_angle"):
+                    if col not in cols:
+                        self.conn.execute(f"ALTER TABLE {table} ADD COLUMN {col} TEXT")  # noqa: S608
+                        logger.info("Added %s column to %s", col, table)
+            self.conn.commit()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/storage/sqlite_storage/_base.py` around lines 852 -
868, The _migrate_agentic_signals method executes PRAGMA/ALTER TABLE/commit
without using the class storage lock, which breaks the class's thread-safety
pattern; wrap the whole migration loop and commit in the existing self._lock
(use the same lock acquisition method used elsewhere in this class, e.g., with
self._lock:) so PRAGMA/ALTER/commit are executed while holding self._lock to
serialize schema changes and database access in _migrate_agentic_signals.
tests/eval/aggregate.py (1)

26-39: Make aggregated output order deterministic.

Add a final sort by backend so weekly reports and diffs are stable run-to-run.

Suggested patch
     return (
         pl.scan_parquet(results_path)
         .group_by("backend")
         .agg(
@@
         )
+        .sort("backend")
         .collect()
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/eval/aggregate.py` around lines 26 - 39, The aggregated DataFrame
currently returned from the pipeline can vary order between runs; update the
pipeline in tests/eval/aggregate.py so the result is deterministically ordered
by adding a sort on the "backend" column before calling .collect() (i.e., after
the .agg(...) step). Modify the return chain that builds the scan_parquet ->
group_by("backend") -> .agg(...) sequence to include a .sort("backend") prior to
.collect() so weekly reports and diffs are stable.
reflexio/server/prompt/prompt_bank/playbook_reader_trigger/v1.0.0.prompt.md (1)

13-17: Avoid natural-language fallback content for missing behavior.

Using "defer to other rules" as content can look like a real instruction downstream. Prefer a neutral sentinel plus explicit note tag so it’s easier to filter.

Suggested patch
-  - content: the behaviour if stated; else "defer to other rules"
+  - content: the behaviour if stated; else "[UNSPECIFIED_BEHAVIOR]"
@@
-  - notes: confidence and trigger-type tag ("event", "threshold", "keyword")
+  - notes: confidence, trigger-type tag ("event", "threshold", "keyword"),
+    and `behavior_unspecified=true` when content is unspecified
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/prompt/prompt_bank/playbook_reader_trigger/v1.0.0.prompt.md`
around lines 13 - 17, The current playbook fields use the natural-language
fallback "defer to other rules" for the content key which can be misinterpreted
downstream; update the prompt schema entries (look for the keys content,
rationale, source_span, notes, reader_angle in v1.0.0.prompt.md) to replace that
phrase with a neutral sentinel token (e.g., "<NO_BEHAVIOR_SPECIFIED>") and add
an explicit note tag in notes (e.g., "note: no-behavior-specified") so consumers
can reliably filter missing behavior without treating the text as an actionable
instruction.
reflexio/server/prompt/prompt_bank/profile_synthesizer/v1.0.0.prompt.md (1)

13-20: Make the tool-call termination contract explicit.

Add a hard rule that rank(ordered_ids) must be called exactly once before finish, and IDs must come only from provided candidates, to reduce loop/output ambiguity.

Suggested patch
 Tools:
   - `rank(ordered_ids)` — emit the final ordered ID list
   - `drop(id, reason)` — exclude a candidate
   - `flag_cross_entity_conflict(id, reason)` — flag contradictions with
     the playbook lane
   - `finish`
 
 Use the `other_lane` summary only for cross-checking coherence.
+You must call `rank(...)` exactly once before `finish`.
+Only include IDs that appear in `CANDIDATES`.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/prompt/prompt_bank/profile_synthesizer/v1.0.0.prompt.md`
around lines 13 - 20, The prompt must enforce a strict tool-call termination
contract: require that rank(ordered_ids) is invoked exactly once before finish
and that ordered_ids are drawn only from the provided candidate set (no new or
transformed IDs), disallow additional rank calls or other tool calls that change
the candidate set after rank, and ensure finish() is called only after that
single rank call; update the prompt language around the tools (rank, drop,
flag_cross_entity_conflict, finish) to state these rules explicitly so the model
cannot call rank multiple times, cannot call finish before rank, and must not
emit IDs that aren't in the provided candidates.
tests/eval/conftest.py (1)

25-29: Docstring inconsistency: sorted by filename, not id.

The docstring says "sorted by id" but the code sorts by filename via sorted((_GOLDEN / kind).glob("*.yaml")). If files are named to match their IDs (e.g., mixed_ttl.yaml), this works, but the comment should reflect the actual behavior.

📝 Suggested fix
 def _load(kind: str) -> list[dict[str, Any]]:
-    """Load every YAML golden file under ``golden_set/<kind>/`` sorted by id."""
+    """Load every YAML golden file under ``golden_set/<kind>/`` sorted by filename."""
     return [
         yaml.safe_load(p.read_text()) for p in sorted((_GOLDEN / kind).glob("*.yaml"))
     ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/eval/conftest.py` around lines 25 - 29, The docstring for _load is
incorrect: it claims results are "sorted by id" while the implementation sorts
the file paths lexicographically; update the _load function's docstring to state
that YAML files under golden_set/<kind>/ are loaded and returned in
filename-sorted (lexicographic) order, and optionally note that filenames should
correspond to IDs if ordering by ID is required; keep the description concise
and reference the existing function name _load.
tests/eval/judge.py (1)

62-66: Template rendering edge case.

Using str.replace() for template substitution is simple but could behave unexpectedly if expected or actual contain the literal strings "{expected}" or "{actual}". For eval test harness purposes with controlled golden cases, this is likely fine, but worth noting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/eval/judge.py` around lines 62 - 66, The current substitution on
self.rubric["prompt"] uses chained str.replace on the prompt variable which can
mis-substitute if the values contain literal "{expected}" or "{actual}"; change
the rendering to use a safe templating approach (e.g., Python's
str.format/format_map with explicit placeholders or
string.Template.safe_substitute) so you pass expected and actual as named fields
instead of doing raw replace — update the code that builds prompt (the prompt
variable / self.rubric["prompt"] usage) to use one of these safe template
methods and pass {"expected": expected, "actual": actual}.
reflexio/server/services/generation_service.py (1)

396-397: Broaden dispatcher return contracts to match actual runtime branches.

Line 396 and Line 434 currently advertise concrete classic services, but both functions can return agentic implementations. Use a shared protocol (or explicit union) so callers/type-checkers see the real contract.

♻️ Proposed refactor
-from typing import TYPE_CHECKING
+from typing import TYPE_CHECKING, Any, Protocol
@@
 if TYPE_CHECKING:
     from reflexio.server.services.unified_search_service import UnifiedSearchService
+
+
+class ExtractionServiceLike(Protocol):
+    def run(self, request: Any) -> Any: ...
+
+
+class SearchServiceLike(Protocol):
+    def search(self, request: Any) -> Any: ...
@@
 def build_extraction_service(
@@
-) -> ProfileGenerationService:
+) -> ExtractionServiceLike:
@@
 def build_search_service(
@@
-) -> UnifiedSearchService:
+) -> SearchServiceLike:

Also applies to: 434-435

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/generation_service.py` around lines 396 - 397, The
dispatcher functions in generation_service.py currently declare a concrete
return type of ProfileGenerationService but actually may return agentic
implementations; update their return annotations to reflect the true contract by
using a shared Protocol (e.g., define an IProfileGenerationService protocol
covering the methods used by callers) or an explicit Union (e.g.,
Union[ClassicProfileGenerationService, AgenticProfileGenerationService]) and
change the two dispatcher function signatures to return that Protocol/Union
instead of the concrete ProfileGenerationService; ensure imports are added and
update any type hints or docstrings referencing the old concrete type
accordingly.
tests/server/services/test_agentic_backend_pipeline_integration.py (1)

81-87: Minor: Response count comment is slightly misleading.

The comment says "6 readers finish + 2 critics finish = 8 LLM calls" but 10 responses are provided. While providing extras is fine for robustness, the comment could be clearer. Similarly, the search response math: 6 agent submissions + 2×(rank + finish) = 10 calls, which matches the code.

Consider updating the comment to match the actual expectation or explain why extras are provided.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server/services/test_agentic_backend_pipeline_integration.py` around
lines 81 - 87, Update the misleading inline comments near extract_responses and
search_responses to accurately describe the number of mocked LLM calls produced
by make_tc: for extract_responses note that although 8 calls are expected (6
readers + 2 critics), 10 responses are supplied as extras for robustness, and
for search_responses clarify that the sequence produces 10 calls (6
submit_candidates + 2×(rank + finish)); adjust the comment text next to the
make_tc, extract_responses, and search_responses variables so the math matches
the actual arrays or explicitly state that extras are intentionally included.
reflexio/server/services/extraction/agentic_extraction_service.py (2)

200-210: Blind Exception catch is acceptable here but could be narrowed.

The _safe_result helper intentionally catches all exceptions to prevent one reader failure from killing extraction. This is a reasonable resilience pattern. The static analysis warning (BLE001) can be suppressed with a comment if desired.

For completeness, consider catching (TimeoutError, RuntimeError, ValueError) to avoid masking truly unexpected errors during development, or add a # noqa: BLE001 comment if the broad catch is intentional.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/extraction/agentic_extraction_service.py` around
lines 200 - 210, The _safe_result helper currently catches all exceptions which
triggers static-analysis BLE001; either narrow the except to expected failures
(e.g., except (concurrent.futures.TimeoutError, RuntimeError, ValueError) as e)
so truly unexpected errors still surface, or explicitly mark the broad catch as
intentional by adding a suppression comment (e.g., "# noqa: BLE001") above the
except block; update the logger.warning call in _safe_result to still log the
caught exception variable (e) so debugging info is preserved.

174-188: Critics run sequentially despite critic_workers parameter.

The constructor accepts critic_workers (line 97) and stores it (line 104), but the critics are invoked sequentially in _run_profile_critic and _run_playbook_critic rather than in parallel. This differs from the search service's synthesizer parallelization pattern.

If parallelization is intended, wrap the critic calls in a ThreadPoolExecutor. If sequential execution is intentional (e.g., to simplify debugging or due to cross-lane dependencies), consider removing the unused critic_workers parameter or documenting why it's reserved for future use.

Option 1: Parallelize critics (if intended)
+    def _run_critics(
+        self,
+        profile_cands: list[Any],
+        playbook_cands: list[Any],
+    ) -> tuple[list[VettedProfile], list[CrossEntityFlag], list[VettedPlaybook], list[CrossEntityFlag]]:
+        executor = ThreadPoolExecutor(max_workers=self._critic_workers)
+        try:
+            profile_fut = executor.submit(
+                self._run_profile_critic, profile_cands, playbook_cands
+            )
+            playbook_fut = executor.submit(
+                self._run_playbook_critic, playbook_cands, profile_cands
+            )
+            vetted_profiles, profile_flags = profile_fut.result(timeout=30.0)
+            vetted_playbooks, playbook_flags = playbook_fut.result(timeout=30.0)
+        finally:
+            executor.shutdown(wait=False, cancel_futures=True)
+        return vetted_profiles, profile_flags, vetted_playbooks, playbook_flags
Option 2: Remove unused parameter (if sequential is intentional)
     def __init__(
         self,
         *,
         llm_client: LiteLLMClient,
         request_context: RequestContext,
         reader_workers: int = 6,
-        critic_workers: int = 2,
     ) -> None:
         self.client = llm_client
         self.request_context = request_context
         self.storage = request_context.storage
         self.prompt_manager = request_context.prompt_manager
         self._reader_workers = min(reader_workers, 6)
-        self._critic_workers = min(critic_workers, 2)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/extraction/agentic_extraction_service.py` around
lines 174 - 188, The critics are currently invoked sequentially in
_run_profile_critic and _run_playbook_critic despite the constructor storing
critic_workers; change to run the two critic.review calls in parallel using
ThreadPoolExecutor(max_workers=self.critic_workers) so both ProfileCritic.review
and PlaybookCritic.review run concurrently—create the two critic instances
(ProfileCritic and PlaybookCritic), submit their .review(...) calls (note you
must still call summarize(playbook_cands) for the profile critic and
summarize(profile_cands) for the playbook critic), collect future.result() for
each, and propagate/handle exceptions as before; alternatively, if sequential
behavior is intentional, remove the critic_workers parameter and its storage
from the constructor instead of adding parallelism.
reflexio/server/services/search/agentic_search_service.py (4)

100-107: TODO placeholder for reconciliation.

The cross-entity flags are currently only logged rather than being used to drive reconciliation. The TODO is noted, but ensure this is tracked for Phase 6+ completion.

Would you like me to open an issue to track the search reconciliation implementation?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/search/agentic_search_service.py` around lines 100 -
107, The code currently collects cross-entity flags into all_flags (from p_flags
+ b_flags) and only logs them instead of invoking reconciliation; replace the
TODO by wiring these flags into the Reconciler path: pass all_flags (or derived
lanes) into the Reconciler call inside the AgenticSearchService (or whichever
function constructs reconciliation) instead of using empty lanes that cause
out-of-range errors, and ensure you handle empty/all_flags cases and surface
errors via logger.error; keep the existing logger.info line for visibility but
add the Reconciler invocation using the same all_flags variable and add
appropriate validation/try-except around Reconciler usage.

211-216: Verify cross-lane summary variable naming.

The variable naming here is slightly confusing but appears correct on close inspection:

  • playbook_other_lane = summary of profile hits (what playbook synthesizer sees as "other lane")
  • profile_other_lane = summary of playbook hits (what profile synthesizer sees as "other lane")

Consider renaming for clarity: profile_synth_sees_playbooks / playbook_synth_sees_profiles or similar, to make the data flow more obvious.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/search/agentic_search_service.py` around lines 211 -
216, The two variables playbook_other_lane and profile_other_lane are inverted
in meaning (playbook_other_lane holds a summary of profile hits and
profile_other_lane holds playbook hits); rename them to clearer identifiers
(e.g., playbook_synth_sees_profiles and profile_synth_sees_playbooks) and update
all references where summarize is called with profile_batches and
playbook_batches respectively so the names reflect the data flow (symbols to
change: playbook_other_lane, profile_other_lane, summarize usage,
profile_batches, playbook_batches).

267-285: Potential silent data loss if ID mapping fails.

In _assemble_ranked, if a ranked ID doesn't exist in the id_to_profile/id_to_playbook dicts, it's silently skipped (line 286-287). This is likely intentional for resilience, but consider logging a warning when ranked IDs can't be resolved, as it may indicate a bug in the agent/synthesizer pipeline.

Add logging for unmapped IDs
+        missing_profiles = [i for i in p_ids if i not in id_to_profile]
+        missing_playbooks = [i for i in b_ids if i not in id_to_playbook]
+        if missing_profiles:
+            logger.warning("ranked profile IDs not found in hits: %s", missing_profiles)
+        if missing_playbooks:
+            logger.warning("ranked playbook IDs not found in hits: %s", missing_playbooks)
         ranked_profiles = [id_to_profile[i] for i in p_ids if i in id_to_profile]
         ranked_playbooks = [id_to_playbook[i] for i in b_ids if i in id_to_playbook]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/search/agentic_search_service.py` around lines 267 -
285, In _assemble_ranked, when resolving ranked IDs against id_to_profile and
id_to_playbook, add a warning log for any ranked ID that cannot be found instead
of silently skipping it: detect missing keys when looking up id_to_profile and
id_to_playbook inside the _assemble_ranked flow and call the service logger
(e.g., self.logger.warn or processLogger.warn used in this module) with a short
message including the missing id and context (e.g., "missing profile for id
{id}" or "missing playbook for id {id}") so operators can trace pipeline issues;
keep behavior unchanged otherwise.

199-202: Consider catching more specific exceptions for agent failures.

The blind Exception catch here is intentional for fault tolerance (preventing one agent failure from killing the pipeline), but it may mask programming errors. Consider catching (TimeoutError, FuturesTimeoutError, RuntimeError) or similar known failure modes, and letting truly unexpected exceptions bubble up during development.

That said, for a production-grade resilient system, the current approach is defensible. This is a low-priority suggestion.

Possible refinement
-            except Exception as e:
+            except (FuturesTimeoutError, RuntimeError, ValueError) as e:
                 logger.warning("search agent failed: %s: %s", type(e).__name__, e)
                 partial = True

Or keep Exception but add explicit re-raise for certain types you want to surface.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/search/agentic_search_service.py` around lines 199 -
202, The current except Exception block around the agent loop (which logs via
logger.warning and sets partial before returning batches, partial) is too broad;
change it to catch only expected failure modes such as TimeoutError,
concurrent.futures.TimeoutError (or FuturesTimeoutError), and RuntimeError so
programming errors can surface, or alternatively keep Exception but explicitly
re-raise critical exceptions (e.g., KeyboardInterrupt, SystemExit, BaseException
subclasses you want surfaced) before logging and setting partial; update the
except clause that wraps logger.warning and the partial assignment accordingly
and reference logger.warning, partial, and the surrounding agent processing loop
(the code that returns batches, partial).
reflexio/server/llm/model_defaults.py (1)

219-275: Several providers lack agentic role mappings.

The gemini, deepseek, openrouter, minimax, dashscope, xai, moonshot, and zai providers don't define angle_reader, critic, synthesizer, or reconciler model mappings. While the fallthrough logic in _auto_detect_model handles this gracefully (skipping to the next provider), users relying solely on these providers will get a RuntimeError when attempting to use agentic pipelines.

If this is intentional (requiring anthropic/openai for agentic features), consider adding a comment or logging when agentic roles are unavailable for the detected provider. Otherwise, consider adding reasonable defaults for these providers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/llm/model_defaults.py` around lines 219 - 275, The
ProviderDefaults entries for providers "gemini", "deepseek", "openrouter",
"minimax", "dashscope", "xai", "moonshot", and "zai" are missing agentic role
mappings (angle_reader, critic, synthesizer, reconciler), causing RuntimeError
in agentic pipelines when _auto_detect_model falls back; update each
ProviderDefaults instance in model_defaults.py to either include reasonable
default model strings for those four agentic roles or add a clear comment/log
when those roles are unsupported for the provider, and ensure any change
references the ProviderDefaults constructor and the _auto_detect_model code path
so behavior is explicit for users relying solely on those providers.
reflexio/server/llm/tools.py (2)

236-251: Latency measurement includes LLM response time, not just tool execution.

The t0 timestamp is set before the generate_chat_response call (line 217), so latency_ms in the trace includes both LLM round-trip time and tool handler execution. If the intent is to measure only tool execution time, t0 should be reset inside the tool processing loop. If the combined latency is intentional, consider renaming to turn_latency_ms for clarity.

⏱️ Option: Measure only tool handler latency
             for tc in tool_calls:
                 name = tc.function.name
                 args_json = tc.function.arguments
+                t_tool = time.monotonic()
                 result = registry.handle(name, args_json, ctx)
                 try:
                     args_dict = json.loads(args_json or "{}")
                 except json.JSONDecodeError:
                     args_dict = {}
                 trace.turns.append(
                     ToolLoopTurn(
                         tool_name=name,
                         args=args_dict,
                         result=result,
-                        latency_ms=int((time.monotonic() - t0) * 1000),
+                        latency_ms=int((time.monotonic() - t_tool) * 1000),
                     )
                 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/llm/tools.py` around lines 236 - 251, The latency being
recorded in ToolLoopTurn.latency_ms currently uses t0 set before
generate_chat_response, so it includes LLM round-trip time; to measure only tool
handler execution reset t0 immediately before calling registry.handle(name,
args_json, ctx) inside the for tc in tool_calls loop (or compute
handler_latency_ms by taking a new timestamp around registry.handle), ensuring
ToolLoopTurn.latency_ms is set from that handler-only delta; alternatively, if
the combined LLM+handler duration is intentional, rename latency_ms to
turn_latency_ms (and update usages) to reflect that it measures the whole turn.

196-211: Document the fallback schema requirement for first field.

The fallback path assumes the schema's first field (via next(iter(type(parsed).model_fields))) is a list of items to dispatch. This implicit contract should be documented in the function docstring to prevent subtle bugs when defining fallback schemas.

📝 Suggested docstring addition
     Args:
         ...
         fallback_schema (type[BaseModel] | None): Pydantic schema for the
             capability-fallback path; required when tool-calling is unsupported.
+            The schema's **first field** must be a list of items, where each
+            item's ``model_dump_json()`` matches the ``fallback_tool_name``
+            tool's args model.
         fallback_tool_name (str | None): Name of the tool each fallback item
             is dispatched against.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/llm/tools.py` around lines 196 - 211, Update the docstring
for the function containing this fallback dispatch logic (the function that
returns ToolLoopResult and builds ToolLoopTurn entries) to explicitly state the
schema contract: the parsed object’s first model field (accessed via
next(iter(type(parsed).model_fields))) must be a list/iterable of items whose
model_dump_json() matches the fallback tool (fallback_tool_name) args model;
mention that each item will be iterated and passed to registry.handle and that
mismatching schemas will cause runtime errors in the ToolLoopTurn creation.
Include references to parsed, fallback_tool_name, registry.handle, ToolLoopTurn,
and ToolLoopResult so callers defining fallback schemas know the requirement.
reflexio/server/services/extraction/readers.py (1)

47-52: Consider adding type annotation for candidates field.

The candidates field is typed as bare list rather than list[Any] or a more specific type. While this works, explicit typing improves readability and IDE support.

✨ Suggested improvement
 `@dataclass`
 class ReaderCtx:
     """Mutable accumulator passed to tool handlers during one reader run."""

-    candidates: list = field(default_factory=list)
+    candidates: list[Any] = field(default_factory=list)
     finished: bool = False

You'll need to add Any to the imports from typing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/extraction/readers.py` around lines 47 - 52, The
ReaderCtx dataclass uses a bare list for candidates; update its type annotation
to a parameterized type (e.g., candidates: list[Any] or a more specific type)
and keep the default_factory=list; also add Any to the typing imports so the
annotation resolves (refer to the ReaderCtx class and the candidates field).
reflexio/server/services/search/synthesizers.py (1)

105-111: Document the sentinel candidate_index=-1 for synthesizer-originated flags.

The _flag handler sets candidate_index=-1 for flags raised by synthesizers, distinguishing them from critic flags which use actual indices. This sentinel value should be documented to prevent confusion when processing flags downstream.

📝 Suggested documentation improvement
 def _flag(args: BaseModel, ctx: SynthCtx) -> dict:
-    """Tool handler: raise a cross-entity conflict flag tied to ctx.lane."""
+    """Tool handler: raise a cross-entity conflict flag tied to ctx.lane.
+
+    Synthesizer flags use ``candidate_index=-1`` as a sentinel since they
+    reference IDs rather than positional indices.
+    """
     a = cast(SynthFlagArgs, args)
     ctx.flags.append(
         CrossEntityFlag(candidate_index=-1, reason=f"{a.id}: {a.reason}", lane=ctx.lane)
     )
     return {"flagged": a.id}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/search/synthesizers.py` around lines 105 - 111,
Document that synthesizer-originated flags use the sentinel candidate_index=-1:
update the _flag handler docstring and the CrossEntityFlag (or the flags
collection docs) to state that candidate_index==-1 indicates a
synthesizer-generated flag (as created by _flag with SynthFlagArgs and appended
to ctx.flags), so downstream processors can distinguish these from critic flags
which use real candidate indices; include the sentinel value and its semantic
meaning in the same comment/block where CrossEntityFlag or ctx.flags is defined.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 5c09495f-8226-4140-8538-1bb03e6315b4

📥 Commits

Reviewing files that changed from the base of the PR and between b3a0ec5 and 6b63d95.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (78)
  • pyproject.toml
  • reflexio/models/api_schema/domain/entities.py
  • reflexio/models/config_schema.py
  • reflexio/server/llm/__init__.py
  • reflexio/server/llm/litellm_client.py
  • reflexio/server/llm/model_defaults.py
  • reflexio/server/llm/tools.py
  • reflexio/server/prompt/prompt_bank/playbook_critic/v1.0.0.prompt.md
  • reflexio/server/prompt/prompt_bank/playbook_reader_behavior/v1.0.0.prompt.md
  • reflexio/server/prompt/prompt_bank/playbook_reader_rationale/v1.0.0.prompt.md
  • reflexio/server/prompt/prompt_bank/playbook_reader_trigger/v1.0.0.prompt.md
  • reflexio/server/prompt/prompt_bank/playbook_search_context/v1.0.0.prompt.md
  • reflexio/server/prompt/prompt_bank/playbook_search_direct/v1.0.0.prompt.md
  • reflexio/server/prompt/prompt_bank/playbook_search_temporal/v1.0.0.prompt.md
  • reflexio/server/prompt/prompt_bank/playbook_synthesizer/v1.0.0.prompt.md
  • reflexio/server/prompt/prompt_bank/profile_critic/v1.0.0.prompt.md
  • reflexio/server/prompt/prompt_bank/profile_reader_context/v1.0.0.prompt.md
  • reflexio/server/prompt/prompt_bank/profile_reader_facts/v1.0.0.prompt.md
  • reflexio/server/prompt/prompt_bank/profile_reader_temporal/v1.0.0.prompt.md
  • reflexio/server/prompt/prompt_bank/profile_search_context/v1.0.0.prompt.md
  • reflexio/server/prompt/prompt_bank/profile_search_direct/v1.0.0.prompt.md
  • reflexio/server/prompt/prompt_bank/profile_search_temporal/v1.0.0.prompt.md
  • reflexio/server/prompt/prompt_bank/profile_synthesizer/v1.0.0.prompt.md
  • reflexio/server/prompt/prompt_bank/reconciler/v1.0.0.prompt.md
  • reflexio/server/services/extraction/__init__.py
  • reflexio/server/services/extraction/agentic_extraction_service.py
  • reflexio/server/services/extraction/critics.py
  • reflexio/server/services/extraction/readers.py
  • reflexio/server/services/generation_service.py
  • reflexio/server/services/playbook/playbook_service_utils.py
  • reflexio/server/services/profile/profile_generation_service_utils.py
  • reflexio/server/services/search/__init__.py
  • reflexio/server/services/search/agentic_search_service.py
  • reflexio/server/services/search/search_agents.py
  • reflexio/server/services/search/synthesizers.py
  • reflexio/server/services/storage/sqlite_storage/_base.py
  • reflexio/server/services/storage/sqlite_storage/_playbook.py
  • reflexio/server/services/storage/sqlite_storage/_profiles.py
  • reflexio/server/services/unified_search_service.py
  • reflexio/test_support/llm_mock.py
  • tests/conftest.py
  • tests/eval/__init__.py
  • tests/eval/aggregate.py
  • tests/eval/conftest.py
  • tests/eval/golden_set/extraction/mixed_ttl.yaml
  • tests/eval/golden_set/extraction/polars_vs_pandas.yaml
  • tests/eval/golden_set/extraction/superseded_state.yaml
  • tests/eval/golden_set/search/db_preference.yaml
  • tests/eval/golden_set/search/deadline_context.yaml
  • tests/eval/golden_set/search/superseded_rule.yaml
  • tests/eval/judge.py
  • tests/eval/judge_prompts/extraction_rubric.yaml
  • tests/eval/judge_prompts/search_rubric.yaml
  • tests/eval/test_agentic_vs_classic_extraction_integration.py
  • tests/eval/test_agentic_vs_classic_search_integration.py
  • tests/eval/test_aggregate.py
  • tests/eval/test_judge_unit.py
  • tests/models/api_schema/__init__.py
  • tests/models/api_schema/test_domain_entities.py
  • tests/models/api_schema/test_retriever_schema.py
  • tests/server/llm/test_litellm_client_tool_calls.py
  • tests/server/llm/test_model_roles.py
  • tests/server/llm/test_tools.py
  • tests/server/services/extraction/__init__.py
  • tests/server/services/extraction/test_agentic_extraction_service_integration.py
  • tests/server/services/extraction/test_critics.py
  • tests/server/services/extraction/test_readers.py
  • tests/server/services/playbook/test_structured_playbook_content.py
  • tests/server/services/profile/test_profile_add_item.py
  • tests/server/services/search/__init__.py
  • tests/server/services/search/test_agentic_search_service_integration.py
  • tests/server/services/search/test_search_agents.py
  • tests/server/services/search/test_synthesizers.py
  • tests/server/services/storage/sqlite_storage/__init__.py
  • tests/server/services/storage/sqlite_storage/test_agentic_signals.py
  • tests/server/services/test_agentic_backend_pipeline_integration.py
  • tests/server/services/test_generation_service_dispatcher.py
  • tests/server/services/test_prompt_model_mapping.py

Comment thread reflexio/models/config_schema.py
Comment on lines +395 to +401
def _supersede(args: BaseModel, ctx: ReconcilerCtx) -> dict:
a = cast(SupersedeArgs, args)
tgt = _lane_list(ctx, a.drop_lane)
if not 0 <= a.drop_index < len(tgt):
return {"error": "drop_index out of range"}
tgt.pop(a.drop_index)
return {"superseded": [a.drop_lane, a.drop_index]}
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if run_tool_loop processes multiple tool calls per turn that could trigger this
rg -nA 10 'for tc in tool_calls:' reflexio/server/llm/tools.py

Repository: ReflexioAI/reflexio

Length of output: 566


🏁 Script executed:

#!/bin/bash
# Find SupersedeArgs definition and understand the argument structure
rg -nB 5 -A 5 'class SupersedeArgs' reflexio/server/services/extraction/

Repository: ReflexioAI/reflexio

Length of output: 826


🏁 Script executed:

#!/bin/bash
# Check how _lane_list works - does it return a reference to the same list each time?
rg -nB 2 -A 8 'def _lane_list' reflexio/server/services/extraction/critics.py

Repository: ReflexioAI/reflexio

Length of output: 462


🏁 Script executed:

#!/bin/bash
# Check if there's any batching or deduplication of operations in the reconciler context
rg -nA 5 'class ReconcilerCtx' reflexio/server/services/extraction/

Repository: ReflexioAI/reflexio

Length of output: 553


🏁 Script executed:

#!/bin/bash
# Search for any existing tests or comments about multiple supersede operations
rg -i 'supersede.*multiple\|multiple.*supersede' reflexio/

Repository: ReflexioAI/reflexio

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Check how reconciler is called - does it expect/support multiple operations per turn?
rg -nB 5 -A 15 'run_reconciler\|call_reconciler' reflexio/server/services/extraction/ | head -60

Repository: ReflexioAI/reflexio

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Look for tests of reconciler to see if multiple operations per turn are tested
find reflexio -name '*test*.py' -o -name '*_test.py' | xargs rg -l 'supersede\|reconcil' 2>/dev/null | head -5

Repository: ReflexioAI/reflexio

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Check the LLM tool schema/documentation for reconciler to understand how indices are provided
rg -nB 5 -A 20 'RECONCILER_TOOLS\|reconciler.*tools\|_get_tool_definitions' reflexio/server/services/extraction/ | grep -A 20 -B 5 'supersede'

Repository: ReflexioAI/reflexio

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Find where the reconciler tool handlers are registered/called
rg -n 'registry\|_supersede\|_merge\|_accept' reflexio/server/services/extraction/critics.py | head -30

Repository: ReflexioAI/reflexio

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Search for where ReconcilerCtx is instantiated - this shows how reconciler is invoked
rg -nB 5 -A 10 'ReconcilerCtx(' reflexio/server/services/extraction/

Repository: ReflexioAI/reflexio

Length of output: 141


🏁 Script executed:

#!/bin/bash
# Look for any LLM prompting or tool documentation that explains supersede to the model
rg -i 'drop_lane\|keep_lane\|drop_index' reflexio/ --type py | head -20

Repository: ReflexioAI/reflexio

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Find the handler functions and their registration
rg -n 'def _supersede\|def _merge\|def _accept\|def _reject\|def _flag' reflexio/server/services/extraction/critics.py

Repository: ReflexioAI/reflexio

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Search for handler registration or mapping
rg -n 'HANDLERS\|handlers\s*=\|register' reflexio/server/services/extraction/critics.py | head -20

Repository: ReflexioAI/reflexio

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Look at the full reconciler module to understand how it's used
cat -n reflexio/server/services/extraction/critics.py | sed -n '375,420p'

Repository: ReflexioAI/reflexio

Length of output: 1890


🏁 Script executed:

#!/bin/bash
# Search for where these handler functions are actually used/called
rg -n '_supersede\|_merge\|_keep_both\|_accept\|_reject' reflexio/ --type py -A 1 -B 1 | grep -v 'def _' | head -40

Repository: ReflexioAI/reflexio

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Look for tool definitions or handler dictionaries in the file
cat -n reflexio/server/services/extraction/critics.py | sed -n '1,100p'

Repository: ReflexioAI/reflexio

Length of output: 3265


🏁 Script executed:

#!/bin/bash
# Check if there's a tool handler mapping at the end of critics.py
tail -50 reflexio/server/services/extraction/critics.py

Repository: ReflexioAI/reflexio

Length of output: 1918


🏁 Script executed:

#!/bin/bash
# Find RECONCILER_TOOLS definition
rg -n 'RECONCILER_TOOLS' reflexio/server/services/extraction/critics.py

Repository: ReflexioAI/reflexio

Length of output: 142


🏁 Script executed:

#!/bin/bash
# Search the entire file for where RECONCILER_TOOLS is defined
rg -nB 2 -A 20 'RECONCILER_TOOLS\s*=' reflexio/server/services/extraction/critics.py

Repository: ReflexioAI/reflexio

Length of output: 873


Multiple supersede calls in one turn can cause index shift issues.

The _supersede handler uses pop() which shifts list indices. If the LLM issues multiple supersede calls targeting the same lane in a single turn, subsequent calls may reference stale indices after the first pop(). While bounds checking prevents silent corruption, it could return a confusing error or remove the wrong item.

Consider processing all supersede operations in reverse index order, or collecting indices to remove and batch-processing them after the turn completes. Note: The _merge handler documents this issue (lines 412–413) but accepts it for cross-lane cases; _supersede should either match that approach or implement the batching fix.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/extraction/critics.py` around lines 395 - 401, The
_supersede handler currently mutates the target lane list immediately (using
tgt.pop) which makes multiple supersede calls in one turn shift subsequent
indices; update _supersede (and use SupersedeArgs and _lane_list) to collect all
drop_index values for a given drop_lane and apply removals after gathering all
requests in descending index order (or otherwise batch-process removals) so
earlier removals don't change indices for later ones; ensure the function still
validates indices against the original list length and return the same
{"superseded": [...] } shape but derived from the batched operation (match the
reasoning used by _merge if you prefer the documented accept-or-batch approach).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not addressing — reflexio/server/services/extraction/critics.py was removed entirely as part of the v2 single-loop refactor (_supersede and _merge no longer exist). The agent now drives supersession via the same delete_* / create_* tools the prompt rules describe, with no shared mutable list to index-shift. Comment is moot.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yilu331, understood — thanks for the clarification. Since reflexio/server/services/extraction/critics.py was removed entirely in the v2 single-loop refactor, the index-shift concern around _supersede is no longer applicable. The new approach using delete_* / create_* tools avoids the shared mutable list altogether, so this comment can be dismissed.

Comment thread tests/eval/golden_set/extraction/mixed_ttl.yaml Outdated
Comment thread tests/eval/judge_prompts/extraction_rubric.yaml Outdated
Comment thread tests/eval/judge_prompts/search_rubric.yaml Outdated
Comment thread tests/eval/test_aggregate.py Outdated
Comment thread tests/eval/test_aggregate.py
Comment thread tests/models/api_schema/test_domain_entities.py
Comment thread tests/server/services/search/test_search_agents.py Outdated
Comment on lines +193 to +216
def test_playbook_search_agent_temporal_includes_archived(
real_client, tool_call_completion
):
"""Temporal intent: status_filter is None so archived items are in scope."""
make_tc, _ = tool_call_completion
storage = MagicMock()
storage.search_user_playbooks.return_value = []
req = MagicMock()
req.user_id = "u1"
agent = PlaybookSearchAgent(
"temporal", client=real_client, prompt_manager=_pm(), storage=storage
)
responses = [
make_tc(
"search_playbooks",
{"query": "x", "top_k": 10, "respect_ttl": False},
),
make_tc("submit_candidates", {"ids": [], "why": "none"}),
]
with patch("litellm.completion", side_effect=responses):
agent.run(query="x", req=req)

sent = storage.search_user_playbooks.call_args.args[0]
assert sent.status_filter is None
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Temporal playbook test also misses a direct TTL assertion.

Like the profile temporal test, this currently verifies archived-scope behavior but not the claimed TTL toggle. Assert respect_ttl=False explicitly.

✅ Suggested assertion addition
     with patch("litellm.completion", side_effect=responses):
         agent.run(query="x", req=req)
 
     sent = storage.search_user_playbooks.call_args.args[0]
+    assert sent.respect_ttl is False
     assert sent.status_filter is None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server/services/search/test_search_agents.py` around lines 193 - 216,
The temporal playbook test test_playbook_search_agent_temporal_includes_archived
currently checks that status_filter is None but misses asserting the TTL toggle;
after obtaining sent = storage.search_user_playbooks.call_args.args[0] (from the
PlaybookSearchAgent run), add an assertion that sent.respect_ttl is False to
explicitly verify the TTL toggle for the temporal intent.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not addressing — same root cause as the sibling profile-temporal comment. PlaybookSearchAgent and the file holding this test (tests/server/services/search/test_search_agents.py) were removed in the single-SearchAgent refactor (commit 8052df8). No equivalent surface remains to assert against.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yilu331 — understood, thanks for the context. Since PlaybookSearchAgent and the entire test_search_agents.py file were removed as part of the single-SearchAgent refactor in 8052df8, this comment is moot. I'll disregard it.

yilu331 added 28 commits April 24, 2026 09:40
Extends ModelRole enum with four tool-calling agentic pipeline roles and
adds matching ProviderDefaults fields (defaulting to None on providers
that have not opted into tool-calling). Anthropic and claude-code
providers are populated with concrete defaults; other providers fall
through via _auto_detect_model's priority chain.

Part of the agentic extraction & search rollout
(docs/superpowers/specs/2026-04-22-agentic-extraction-search-design.md).
Introduce ToolCallingChatResponse and wire tools/tool_choice through
_build_completion_params; resolve model_role via resolve_model_name.
Also add openai ProviderDefaults for agentic roles (angle_reader,
critic, synthesizer, reconciler) to fix pre-existing test failure.
Adds ToolLoopTurn/ToolLoopTrace/ToolLoopResult and run_tool_loop that
drives an LLM through a tool loop via tools=/tool_choice= and falls
back to structured output for non-tool-calling providers.
Catch exceptions in the native tool loop and return a terminal error
ToolLoopResult instead of letting them propagate. Makes the 'error'
Literal value actually reachable and adds test coverage.
…schemas

Adds three optional fields to ProfileAddItem, StructuredPlaybookContent,
UserProfile, and UserPlaybook. Fields are nullable and default to None,
so callers and existing data are unchanged. Consumed by agentic
extraction readers to carry evidence and attribution through to storage.
…and user_playbooks

Extends the CREATE TABLE DDL and adds an in-place _migrate_agentic_signals
helper so existing DBs are upgraded on startup. Adds the fields to the
INSERT/UPDATE/SELECT row-mapping sites. Columns are nullable and default
to NULL — classic extraction leaves them blank.
…ories

Adds extraction_backend / search_backend Literal['classic','agentic']
fields to Config (default 'classic', keeps existing behavior). Adds
build_extraction_service / build_search_service factories that dispatch
on those flags. Agentic targets are lazy-imported — Phase 3/4 implement
them, Phase 6 wires the factories into GenerationService.run.

Also adds UnifiedSearchService class wrapper to unified_search_service.py
so the dispatcher factory returns a uniform class-based handle.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (5)
tests/server/test_logging_timezone.py (1)

31-33: Tighten the direct timestamp assertion to avoid false positives.

Line 32 uses match, which can still pass with trailing junk. For formatTime() output, fullmatch gives a stricter regression guard.

Proposed assertion tweak
-        assert _TZ_PATTERN.match(rendered), f"timestamp missing TZ offset: {rendered!r}"
+        assert _TZ_PATTERN.fullmatch(rendered), f"timestamp missing TZ offset: {rendered!r}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server/test_logging_timezone.py` around lines 31 - 33, Replace the
loose regex check using _TZ_PATTERN.match(rendered) with a strict full match so
the entire formatted timestamp is validated; update the assertion in the test
(the line that sets rendered = formatter.formatTime(record) and the subsequent
assert) to use _TZ_PATTERN.fullmatch(rendered) instead of match (retain the same
failure message and variables: rendered, formatter.formatTime(record), and
_TZ_PATTERN).
tests/server/services/extraction/test_agentic_v2_e2e.py (1)

207-251: Storage assertion occurs after context manager exit.

Lines 246-248 access request_context.storage after the TemporaryDirectory context manager has exited. While SQLite typically handles this, it's safer to move assertions inside the context.

♻️ Move assertions inside the context
     with tempfile.TemporaryDirectory() as temp_dir:
         request_context = RequestContext(org_id=org_id, storage_base_dir=temp_dir)
         gs = GenerationService(llm_client=client, request_context=request_context)
         gs.configurator.get_config = MagicMock(return_value=_make_agentic_config())  # type: ignore[method-assign]

         with patch(
             "reflexio.server.services.extraction.agentic_adapter.ExtractionAgent"
         ) as mock_agent_cls:
             request = PublishUserInteractionRequest(...)
             result = gs.run(request)

-    # ExtractionAgent was never instantiated.
-    mock_agent_cls.assert_not_called()
+        # ExtractionAgent was never instantiated.
+        mock_agent_cls.assert_not_called()

-    # No profiles persisted.
-    assert request_context.storage is not None
-    profiles = request_context.storage.get_user_profile(user_id)
-    assert profiles == [], f"expected no profiles; got {profiles}"
+        # No profiles persisted.
+        assert request_context.storage is not None
+        profiles = request_context.storage.get_user_profile(user_id)
+        assert profiles == [], f"expected no profiles; got {profiles}"

-    # Result must not have raised (warnings may be empty or trivial).
-    assert result.request_id is not None
+        # Result must not have raised (warnings may be empty or trivial).
+        assert result.request_id is not None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server/services/extraction/test_agentic_v2_e2e.py` around lines 207 -
251, In test_e2e_agentic_v2_extraction_agent_not_invoked_for_trivial_session
move the storage-related assertions so they run before the TemporaryDirectory
context exits: after the block that calls gs.run(request) (but still inside the
with tempfile.TemporaryDirectory() as temp_dir: scope) assert
request_context.storage is not None, call
request_context.storage.get_user_profile(user_id) and assert profiles == [], and
also assert result.request_id is not None there; this ensures
RequestContext/storage is still valid when inspected and removes the
post-context checks that currently run after the temp dir is closed.
reflexio/server/services/extraction/tools.py (1)

154-156: _status_from_str can raise KeyError on invalid input.

If an unexpected status string is passed (e.g., from malformed LLM output that bypasses Pydantic validation), the dict lookup will raise KeyError. Consider using .get() with a default or explicit error handling.

🛡️ Suggested defensive fix
 def _status_from_str(s: str) -> Status | None:
-    return {"current": None, "pending": Status.PENDING, "archived": Status.ARCHIVED}[s]
+    mapping = {"current": None, "pending": Status.PENDING, "archived": Status.ARCHIVED}
+    if s not in mapping:
+        raise ValueError(f"Invalid status: {s!r}")
+    return mapping[s]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/extraction/tools.py` around lines 154 - 156,
_status_from_str currently does a direct dict lookup which will raise KeyError
for unexpected strings; change it to handle invalid inputs defensively by using
a safe lookup (e.g., mapping.get(s)) and either return a default (None) or raise
a clear ValueError with context; update the function _status_from_str(s: str) to
explicitly handle unknown keys and reference the Status enum values
(Status.PENDING, Status.ARCHIVED) in the error/return path so malformed LLM
output doesn't propagate a KeyError.
tests/server/services/extraction/test_agentic_adapter.py (1)

131-134: Same os.environ.setdefault leak pattern as e2e tests.

Lines 133, 573, and 687 use the same pattern that leaks API key environment variables across test runs. Consider using patch.dict(os.environ, ...) for isolation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server/services/extraction/test_agentic_adapter.py` around lines 131 -
134, Replace the leaking os.environ.setdefault calls used before creating
LiteLLMClient/LiteLLMConfig with a scoped environment patch using
unittest.mock.patch.dict to avoid cross-test pollution: wrap the client
instantiation (where LiteLLMClient(LiteLLMConfig(...)) is created) in a with
patch.dict(os.environ, {"ANTHROPIC_API_KEY": "test-key"}, clear=False): block
(or use the pytest monkeypatch fixture to setenv) so the API key is set only for
that test and automatically restored after; update all occurrences where
os.environ.setdefault("ANTHROPIC_API_KEY", ...) is used (including the instances
near the LiteLLMClient creation) to use this scoped approach.
reflexio/server/llm/tools.py (1)

53-55: Guard against duplicate tool registration names.

register() silently overwrites existing entries. A duplicate name should fail fast; otherwise the wrong handler can be invoked with no signal.

Proposed fix
     def register(self, tool: Tool) -> None:
+        if tool.name in self._tools:
+            raise ValueError(f"duplicate tool registration: {tool.name}")
         self._tools[tool.name] = tool
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/llm/tools.py` around lines 53 - 55, The register method
currently overwrites existing entries in self._tools; modify register(self,
tool: Tool) to guard against duplicate tool names by checking if tool.name
already exists in self._tools and failing fast (e.g., raise a ValueError or
custom DuplicateToolError) with a clear message that includes the duplicate
name; ensure you reference register, self._tools, and tool.name so callers get
an explicit error instead of silent overwrite.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@reflexio/server/__init__.py`:
- Around line 69-76: The docstring for class _TZAwareFormatter is out of sync
with actual output: update the example timestamp and description to reflect the
runtime format emitted by _TZAwareFormatter (it uses an ISO-extended UTC offset
like -07:00 and may include a TZ abbreviation), not the old `-0700` form; edit
the class docstring in reflexio/server/__init__.py (class _TZAwareFormatter
inheriting logging.Formatter) to show a correct sample such as "2026-04-24
10:20:51.238 -07:00 PST" (or note that the offset is ISO-extended and a TZ
abbrev may be appended) and adjust the explanatory text to mention fallback
behavior remains +00:00 when timezone is missing.

In `@reflexio/server/llm/tools.py`:
- Around line 242-301: The per-tool latency uses t0 (turn start) so later tool
calls include model + earlier-tool time; change it to measure each tool's own
start and compute latency from that: inside the for tc in tool_calls loop record
a per-call start (e.g., call_t0 = time.monotonic()) immediately before calling
registry.handle(name, args_json, ctx) and then set ToolLoopTurn.latency_ms to
int((time.monotonic() - call_t0) * 1000) instead of using t0; keep t0 for
turn-level/model latency if needed and update references to t0 only where
turn-level timing is intended (ToolLoopTurn.model/prompt/completion/total tokens
remain unchanged).
- Around line 224-237: The fallback loop currently iterates all items unbounded
and always returns ToolLoopResult(..., finished_reason="finish_tool"); change
the loop in the function that iterates over items to respect the configured
max_steps cap (use the existing max_steps variable or parameter) by tracking a
step counter, break once steps >= max_steps before calling registry.handle for
the next item, and append ToolLoopTurn only for executed steps; set
trace.finished and ToolLoopResult.finished_reason to a distinct value (e.g.,
"max_steps" or "exhausted") when stopped by the cap, and only use "finish_tool"
when all items were processed naturally; reference symbols: items,
fallback_tool_name, registry.handle, ToolLoopTurn, trace, ctx, ToolLoopResult,
and max_steps.

In `@reflexio/server/prompt/prompt_bank/extraction_agent/v1.1.0.prompt.md`:
- Around line 39-44: The prompt uses ambiguous generic calls `delete`/`create`
in the supersession and profile-merge rules; change the wording to require
concrete, entity-specific tool names (e.g. `delete_user_profile` /
`create_user_profile`, `delete_order` / `create_order`, etc.) so the agent
invokes the correct runtime surface. Update the supersession line to instruct
"call delete_<entity> on the stale id, then create_<entity> with the new
content" and the profile-merge line to instruct "call delete_<entity> on each
duplicate, then create_<entity> once with the merged wording" referencing these
concrete patterns (delete_<entity>, create_<entity>, delete_user_profile,
create_user_profile) so callers know to use entity-specific tool names.

In `@tests/server/services/extraction/test_agentic_adapter.py`:
- Around line 519-527: The test constructs a Violation with code "B" but uses
severity "soft" — change the violation code to one that is actually soft (e.g.,
"E", "H", or "K") so the test validates soft-violation handling correctly;
update the Violation instantiation referenced by soft_violation (and the related
CommitResult result_with_soft) to use a soft-code like "E" (keeping
severity="soft" and the same msg/affected_op_indices).

---

Nitpick comments:
In `@reflexio/server/llm/tools.py`:
- Around line 53-55: The register method currently overwrites existing entries
in self._tools; modify register(self, tool: Tool) to guard against duplicate
tool names by checking if tool.name already exists in self._tools and failing
fast (e.g., raise a ValueError or custom DuplicateToolError) with a clear
message that includes the duplicate name; ensure you reference register,
self._tools, and tool.name so callers get an explicit error instead of silent
overwrite.

In `@reflexio/server/services/extraction/tools.py`:
- Around line 154-156: _status_from_str currently does a direct dict lookup
which will raise KeyError for unexpected strings; change it to handle invalid
inputs defensively by using a safe lookup (e.g., mapping.get(s)) and either
return a default (None) or raise a clear ValueError with context; update the
function _status_from_str(s: str) to explicitly handle unknown keys and
reference the Status enum values (Status.PENDING, Status.ARCHIVED) in the
error/return path so malformed LLM output doesn't propagate a KeyError.

In `@tests/server/services/extraction/test_agentic_adapter.py`:
- Around line 131-134: Replace the leaking os.environ.setdefault calls used
before creating LiteLLMClient/LiteLLMConfig with a scoped environment patch
using unittest.mock.patch.dict to avoid cross-test pollution: wrap the client
instantiation (where LiteLLMClient(LiteLLMConfig(...)) is created) in a with
patch.dict(os.environ, {"ANTHROPIC_API_KEY": "test-key"}, clear=False): block
(or use the pytest monkeypatch fixture to setenv) so the API key is set only for
that test and automatically restored after; update all occurrences where
os.environ.setdefault("ANTHROPIC_API_KEY", ...) is used (including the instances
near the LiteLLMClient creation) to use this scoped approach.

In `@tests/server/services/extraction/test_agentic_v2_e2e.py`:
- Around line 207-251: In
test_e2e_agentic_v2_extraction_agent_not_invoked_for_trivial_session move the
storage-related assertions so they run before the TemporaryDirectory context
exits: after the block that calls gs.run(request) (but still inside the with
tempfile.TemporaryDirectory() as temp_dir: scope) assert request_context.storage
is not None, call request_context.storage.get_user_profile(user_id) and assert
profiles == [], and also assert result.request_id is not None there; this
ensures RequestContext/storage is still valid when inspected and removes the
post-context checks that currently run after the temp dir is closed.

In `@tests/server/test_logging_timezone.py`:
- Around line 31-33: Replace the loose regex check using
_TZ_PATTERN.match(rendered) with a strict full match so the entire formatted
timestamp is validated; update the assertion in the test (the line that sets
rendered = formatter.formatTime(record) and the subsequent assert) to use
_TZ_PATTERN.fullmatch(rendered) instead of match (retain the same failure
message and variables: rendered, formatter.formatTime(record), and _TZ_PATTERN).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9cc3617b-a8e1-4495-9e14-1529e8b85aa4

📥 Commits

Reviewing files that changed from the base of the PR and between 465c4ed and ac15700.

📒 Files selected for processing (20)
  • reflexio/server/__init__.py
  • reflexio/server/llm/litellm_client.py
  • reflexio/server/llm/tools.py
  • reflexio/server/prompt/prompt_bank/extraction_agent/v1.0.0.prompt.md
  • reflexio/server/prompt/prompt_bank/extraction_agent/v1.1.0.prompt.md
  • reflexio/server/prompt/prompt_bank/extraction_agent/v1.2.0.prompt.md
  • reflexio/server/prompt/prompt_bank/search_agent/v1.0.0.prompt.md
  • reflexio/server/prompt/prompt_bank/search_agent/v1.1.0.prompt.md
  • reflexio/server/services/extraction/agentic_adapter.py
  • reflexio/server/services/extraction/extraction_agent.py
  • reflexio/server/services/extraction/tools.py
  • reflexio/server/services/search/search_agent.py
  • tests/server/llm/test_tools.py
  • tests/server/services/extraction/test_agentic_adapter.py
  • tests/server/services/extraction/test_agentic_v2_e2e.py
  • tests/server/services/extraction/test_extraction_agent.py
  • tests/server/services/extraction/test_tools.py
  • tests/server/services/search/test_search_agent.py
  • tests/server/services/test_prompt_model_mapping.py
  • tests/server/test_logging_timezone.py
✅ Files skipped from review due to trivial changes (1)
  • reflexio/server/prompt/prompt_bank/extraction_agent/v1.2.0.prompt.md
🚧 Files skipped from review as they are similar to previous changes (7)
  • reflexio/server/prompt/prompt_bank/search_agent/v1.0.0.prompt.md
  • reflexio/server/prompt/prompt_bank/extraction_agent/v1.0.0.prompt.md
  • tests/server/services/search/test_search_agent.py
  • tests/server/services/test_prompt_model_mapping.py
  • reflexio/server/services/extraction/agentic_adapter.py
  • tests/server/llm/test_tools.py
  • reflexio/server/services/extraction/extraction_agent.py

Comment thread reflexio/server/__init__.py
Comment thread reflexio/server/llm/tools.py Outdated
Comment thread reflexio/server/llm/tools.py Outdated
Comment thread reflexio/server/prompt/prompt_bank/extraction_agent/v1.1.0.prompt.md Outdated
Comment thread tests/server/services/extraction/test_agentic_adapter.py
yilu331 added 22 commits April 24, 2026 12:28
Three agentic search tool handlers (_handle_search_user_profiles,
_handle_search_user_playbooks, _handle_search_agent_playbooks) built
Search*Request objects declaring search_mode=HYBRID but never computed
a query embedding. Storage's _effective_search_mode logged
'no query embedding provided — falling back to FTS' on every call and
downgraded to FTS-only search — a quality regression vs. the classic
path (unified_search_service.py:151-158 computes the embedding in
Phase A before dispatching).

Fix: new _maybe_embed_query(storage, query) helper reads
storage._get_embedding (already used by classic), swallows AttributeError
and provider failures, returns None for graceful fallback. Each of the
three handlers now passes the computed embedding:
- profile search uses query_embedding= kwarg (matching
  _profiles.py:199-205 signature)
- playbook search methods use options=SearchOptions(...) (matching
  _playbook.py:403-434 signature — the signature asymmetry is a
  pre-existing BaseStorage oddity)

Six regression tests cover: helper fallback (no embedder / raising
embedder / happy path) and per-handler embedding plumbing.
Real sessions surfaced two extraction anti-patterns the v1.2.0 prompt
did not prevent strongly enough:

1. Bundled profiles: 'User is a senior Go engineer and is on-call this
   week' — two atomic facts in one content string. Trapped under a
   single TTL (one loses precision), invariant E can't dedup, clean
   supersession impossible.

2. Profile / playbook content overlap: 'prefers no code review
   scheduling before 10:00 AM during on-call period' leaked into
   profile content AND was (correctly) emitted as a playbook. Two
   stores of the same rule → divergence when one updates, retrieval
   noise, breaks the self-improvement-axis separation.

v1.3.0 changes:
- Narrows the UserProfile scope description from ambiguous 'preferences'
  to concrete 'role, skills, environment, timezone, tools, ongoing
  status' — the LLM was latching onto 'preferences' as license to
  record rule-shaped text.
- Adds explicit Fact vs. Rule disambiguator ('is this something the
  user IS / HAS, or what the agent should DO when X?').
- Adds concrete ✅/❌ examples for both kinds in the scope block.
- New rule 3: 'One fact per profile'. New rule 7: 'No overlap between
  profile and playbook'. Existing rules renumbered.

Old v1.2.0 marked active: false; PROMPT_VERSION_MAP bumped to v1.3.0.
Sanity-render test extended to assert the new anti-pattern text and
rules are present; guards against accidental revert.
load_reflexio_env now stores the resolved .env path in a module-level
global; get_loaded_env_path() exposes it. print_startup_banner gains an
optional config_paths dict that renders as a 'Config' section between
the service URLs and the Logs line. HOME-prefixed paths collapse to
~/... for readability while remaining absolute in logs.

Enables the enterprise run_services to surface Env / Config / Storage
paths in the dev-server banner so operators can confirm which files
the server actually loaded without grepping logs.
…ent for playbooks

Today's extraction prompt (v1.3.0) covers fact-vs-rule separation but
says nothing about the SHAPE of the trigger/content/rationale fields.
Result: inconsistent playbook output — sometimes numbered inline lists
with semicolons, sometimes run-on sentences, sometimes single clauses.
The downstream agent has to re-parse each shape.

Borrows the Agent Skills spec (agentskills.io) where skill description
is the retrieval key and skill body is the on-activation instruction
set — structurally identical to UserPlaybook.trigger and .content.

v1.4.0 adds a new 'Playbook format' section with:

- trigger rules — imperative conditional phrasing, keyword coverage,
  150-300 char budget, explicit narrow/broad anti-patterns.
- content rules — markdown bullet list (or numbered when order is
  load-bearing), verb-led instructions, self-sufficient bullets,
  500-2000 char range.
- rationale rules — one sentence WHY, leave empty rather than restate
  the content.
- Concrete good/bad examples for each field, including the code-review
  playbook the user flagged as unstructured.

Rule 6 (playbook expansion) updated to show bullet-preserving merge.
Zero code/schema changes; purely prompt narrative + examples. One new
sanity-render test asserts the format guidance is present.
PROMPT_VERSION_MAP bumped; v1.3.0 retained with active: false.
…igger

Extension of the prompt v1.4.0 work to the user → agent playbook
aggregation step. The previous v2.0.0 aggregator emitted AgentPlaybook
content as 'human-readable prose' (line 54 of v2.0.0) — single long
sentences. Now that UserPlaybooks come in bullet-shaped from v1.4.0,
the aggregator output was the odd one out.

Problem: a downstream agent retrieving memory sees both UserPlaybooks
(bullets) and AgentPlaybooks (prose) for the same retrieval query.
Two shapes to parse; inconsistent display in any UI that shows both.

v2.1.0 adds the same ## Playbook format section used in the extraction
prompt, adapted for aggregation:

- trigger: consolidate all input triggers into ONE imperative-conditional
  phrase with broad keyword coverage.
- content: take the UNION of bullets across input entries, dedup
  semantically overlapping ones. When inputs are historical prose,
  re-shape to bullets on output (the aggregator is the right place to
  do the upgrade).
- rationale: one sentence across inputs; omit rather than restate content.

All 3 concrete examples at the bottom of the prompt upgraded to use
bullet-shaped content so the LLM pattern-matches on the new shape.

Zero code changes. Zero schema changes. PROMPT_VERSION_MAP bumped
playbook_aggregation v2.0.0 -> v2.1.0. One sanity-render test guards
against silent regression to prose.
The banner only showed dev_server.log, but when operators are debugging
prompt or tool-call issues they need llm_io.log first. llm_io.log is
where every agent turn / tool-call is rendered; dev_server.log is the
general log with the agent SUMMARY lines (elapsed_ms, turns, tools,
usage). Two different files for two different debugging modes.

Banner now renders:

    --------------------------------------------
      Dev log    ~/.reflexio/logs/dev_server.log
      LLM I/O    ~/.reflexio/logs/llm_io.log
    ============================================

HOME is collapsed to ~ for both paths (consistent with the Config
section above) via a small _collapse_home helper — factored out of
the inline expression used earlier in the config_paths loop.
Aligns the granular profile-search endpoint with the rest of the
user_* CRUD surface (get_user_profile, add_user_profile,
delete_user_profile, search_user_playbooks, search_agent_playbooks).
The previous search_profiles was the only outlier.

Affects: Python client method, lib mixin, HTTP route /api/search_user_profiles,
CLI subcommand reflexio user-profiles search-user-profiles, retrieval-latency
benchmark, and all tests + docs. No backward-compat shim — outright
rename per pre-1.0 API discipline.
…laybooks carry provenance

The agentic extraction backend was hardcoding empty strings into
UserProfile.generated_from_request_id and UserPlaybook.request_id at
commit time (tools.py:603, 616). Result: every agentic-extracted entity
had no provenance back to its source publish_interaction request, which
silently broke any downstream consumer that translates retrieved
profiles back to their originating session — most visibly the
LongMemEval recall@K metric.

The classic backend already threads request_id correctly via
service_config.request_id, embedding it into the model at extractor
time. This change mirrors that pattern in the agentic path:

1. ExtractionCtx gains a request_id: str = "" field (default empty for
   backward-compat with test callers that don't have a publish in scope).
2. ExtractionAgent.run accepts request_id: str = "" and threads it into
   ExtractionCtx.
3. AgenticExtractionRunner.run already had request_id as a parameter
   (kept for contract parity); we drop the # noqa: ARG002 and forward
   it to agent.run.
4. apply_plan_op reads ctx.request_id and embeds it into both the
   UserProfile and UserPlaybook constructors.

Tests:

- 3 new unit tests in test_extraction_agent.py covering: profile
  thread, playbook thread, and backward-compat default ("").
- test_agentic_v2_e2e.test_e2e_agentic_v2_full_flow strengthened with
  per-profile and per-playbook provenance assertions against
  result.request_id.

Verified: 93 extraction tests + 2046 full submodule non-e2e tests pass.
Lint clean. Pyright shows only 2 pre-existing import-path warnings in
tools.py unrelated to this change.
Bug: SearchMixin.unified_search hardcoded a call to run_unified_search
regardless of config.search_backend. The AgenticSearchService and the
underlying SearchAgent (server/services/search/) were fully implemented
but unreachable from the public API — setting Config.search_backend to
"agentic" had no observable effect.

Symptoms verified on a 500-question LongMemEval oracle benchmark run:
- recall@5 was identical to 6 decimals between classic_<x>_search and
  agentic_<x>_search cells across every question
- "search_agent elapsed_ms=" log marker count was zero across the
  entire 8.5h run
- search wall-time was ~500ms (hybrid SQL+vector) instead of the
  expected ~15-20s (multi-turn LLM tool loop)

Fix: lib/_search.py::unified_search now reads config.search_backend
and dispatches to AgenticSearchService.search() when it equals
"agentic", or falls through to run_unified_search otherwise.

The dispatcher factory build_search_service() already exists in
generation_service.py:448 but couldn't be reused as-is because
UnifiedSearchService (the classic side) doesn't expose a .search()
method — it's just a container class. Used explicit if/else here
rather than restructuring UnifiedSearchService for parity, which
would be a larger refactor.

Tests:
- New test_dispatches_to_agentic_when_search_backend_agentic asserts
  the AgenticSearchService is constructed and .search() called.
- New test_dispatches_to_classic_when_search_backend_classic asserts
  run_unified_search is invoked and AgenticSearchService is not.
- Existing test_delegation_to_service updated to set
  mock_config.search_backend = "classic" explicitly (was a MagicMock
  attribute which masked the dispatch logic).

Verified end-to-end on 10-question oracle SS-A smoke at concurrency=4:
- search_agent log marker count: 0 → 27
- search wall: 411 ms → 17,435 ms (agentic vs classic retrieve)
- recall@5 cells diverge: agentic_classic 1.000 vs agentic_agentic
  0.700 (was identical pre-fix)
Bug uncovered while debugging LongMemEval-oracle SS-A cell
(agentic_agentic R@5 = 0.700 vs agentic_classic R@5 = 1.000):

The v1.1.0 prompt's "Re-query freely. Rephrasing, narrowing, or trying
orthogonal angles is expected" was being interpreted by gpt-5-mini as
"extract keywords and search". 100% of search-tool calls (50/50 in a
10-question smoke) used keyword-bag queries like
"shift rotation Admon Sunday GM social media" instead of the original
natural-language question.

Vector recall depends on natural-language phrasing matching how the
underlying memory was written; bag-of-keywords queries produce poor
text-embedding-3-small embeddings. Classic search uses the literal
question and gets perfect R@5; the SearchAgent was hurting itself by
reformulating before the first retrieval.

v1.2.0 changes:
- New rule 1 (was rule 4): "First call: verbatim. Your first tool call
  MUST pass the user's query VERBATIM as the `query` argument."
- Updated rule 5: "Re-query is for narrowing or orthogonal angles, not
  for paraphrasing into keywords."
- Concrete bad-example callout: "shift rotation Admon Sunday".

Verified on the same 10-question SS-A smoke:
- Turn-1 query distribution: 100% keyword-bag → 95% natural-language.
- agentic_agentic R@5: 0.700 → 0.778 (+0.08).
- agentic_agentic SS-A accuracy: 30.0% → 33.3% (+3.3pp).

Deactivated v1.1.0 (active: false). Bumped PROMPT_VERSION_MAP entry to
v1.2.0 in tests/server/services/test_prompt_model_mapping.py.
…t, cap tool budget to <=3, require explicit grounding and temporal handling, and add concise answer formatting and aggregation rules
Cap the extraction agent's tool-loop at 4 rounds (was 12) and the search
agent's at 3 rounds (was 10), and surface that cap into the prompt as a
template variable so the LLM agent plans accordingly.

Why: per benchmark profiling, both agents routinely use 5-8 rounds when
the productive shape is search → batch creates → finish (3 rounds for
extraction; 1-2 search calls + finish for search). The accumulated
tool-call history dominates input tokens by round 4, so capping early
cuts ~30-50% of per-call wall time.

The prompt change is materially load-bearing:
- Without the {max_steps} variable, the agent doesn't know its budget
  and may fall short of finish() when the loop terminates early.
- With it, the agent batches creates into one turn and reserves the
  last round explicitly for finish().

Floor analysis: extraction needs 3 (search → create → finish); search
needs 2 (search → finish). The 4 / 3 caps leave one optional round each
for follow-up searches.

Tests updated: prompt-render tests pass max_steps so the prompt manager
doesn't reject as missing-variable.
…l phrase matching and session-first heuristics, and reinforce existing grounding and budget rules to reduce rounds and improve temporal/count accuracy
…s and templates, emphasize inspecting top-5 snippets for missing atoms, and reinforce temporal/count handling while preserving core rules
…trengthen snippet inspection checklist, enforce targeted single follow-ups, and emphasize temporal/count handling to reduce reformulation loss and improve recall
…able-item guidance, and a rule to preserve temporal markers; reinforce single-fact profiles and content formatting (markdown bullet list, imperative conditional phrasing).
…phasize date encoding and atomic facts; keep required phrases and tighten playbook/profile separation
…xplicit pre-create checklist and temporal examples, and emphasize atomic facts and playbook formatting
…rules: require copying dates/ids verbatim, prefer enumeration of unique names for counts, add full-metadata follow-up template, inspect top ~10 for counts/temporal queries, and emphasize existing core constraints
…lity

profile_id was a full 36-char str(uuid.uuid4()) — error-prone for smaller
LLMs (gpt-5-mini and below) to copy verbatim from a search result back
into a delete/update tool argument. Switch to 12 hex chars (~2.8e14
unique values via uuid.uuid4().hex[:12]).

Why 12: short enough that LLM transcription is high-fidelity; long
enough that birthday-paradox collisions are vanishingly small at any
realistic per-user scale. PRIMARY KEY constraint catches the rare
collision and fails loudly.

Why only profile_id: playbook ids are INTEGER autoincrements (already
LLM-friendly); interaction_id and request_id aren't passed back as tool
args so their format doesn't affect agent reliability.

Existing UUID profile_ids in storage continue to work — they're still
valid TEXT primary keys. New profiles get the short format. Benchmarks
that wipe + re-ingest will be 100% short-id immediately.

Two call sites updated:
- extraction/tools.py:596 (agentic backend)
- profile/profile_extractor.py:282 (classic backend)

Both go through new helper extraction.tools.new_profile_id().

Test brittleness fixes (unrelated to id change but surfaced by Phase 27
iter-2 prompt rewrite): two extraction-agent prompt-content tests were
asserting on specific example strings ("prefers no code review...",
'"When …"') that the proposer is allowed to evolve. Relaxed those to
assert on the load-bearing rule wording instead.
The agentic_search service was coercing missing user_id and agent_version
to empty strings before passing them to SearchAgent. Empty user_id then
flowed through to storage operations — SqliteStorage would query with
WHERE user_id='' (cross-user data leakage potential), DiskStorage would
write to an unintended path. Surface the bug at the boundary instead of
silently degrading.

Addresses CodeRabbit review feedback on PR #34 (Critical).
Source fixes
- run_tool_loop: respect max_steps in fallback path; use per-tool
  monotonic clock for ToolLoopTurn.latency_ms instead of turn-start
  (per-tool timings were inflated by model latency + earlier tools).
- service_utils._format_tool_calls: fall back to repr() when args are
  not JSON-serialisable so the logging path can never raise.
- extraction tools.commit_plan: validate DeleteUserPlaybookOp.id before
  int() coercion and raise a clear TypeError instead of leaking ValueError.
- config_schema._migrate_field_names: strip null extraction_backend /
  search_backend so existing stored configs deserialize with defaults.
- _TZAwareFormatter docstring: update example to match runtime output
  (ISO extended offset + optional TZ abbreviation).
- extraction prompt v1.1.0: use entity-specific tool names in the
  supersession / profile-merge rules (delete_user_profile, etc.).

Eval suite hardening
- judge.py: constrain JudgeScore floats to [0, 1] via Field(ge, le).
- conftest._load: sort golden cases by YAML 'id', fail loudly when a
  case has no id (previously sorted by filename, which silently
  desynced when filenames drifted).
- judge_prompts: align rubric prompts with the JudgeScore schema —
  fold nuance / must-not-violated guidance into the existing
  signal_f1 / answer_correctness fields rather than emitting
  unrequested keys.
- mixed_ttl golden case: use schema-valid TTL values (infinity,
  one_week) instead of 'persistent' / 'short_term'.
- test_aggregate: pytest.approx for float means, bound p95 latency
  asserts top + bottom so overestimation fails.

Test coverage
- test_domain_entities: assert source_span and notes (not just
  reader_angle) on the optional-field positive paths.
- test_tools: assert known_ids actually grew on the search-handler
  populates_known_ids tests.
- test_generation_service_dispatcher: drop ImportError skip so a
  broken AgenticSearchService import surfaces immediately.
- test_agentic_search_service: align org_id between RequestContext
  and the temp_storage fixture; pass agent_version on UnifiedSearchRequest
  to satisfy the API-boundary check.
- test_agentic_v2_e2e: scope OPENAI_API_KEY env mutation to client
  construction via patch.dict so it can't leak between tests.
- test_agentic_adapter: switch the soft-violation case to code 'E'
  (B is hard per invariants.py) so the soft-handling test is real.
@yilu331 yilu331 changed the title feat: add agentic extraction & search pipelines feat: agentic extraction & search pipelines (single-loop v2) Apr 26, 2026
yilu331 added 6 commits April 26, 2026 04:09
The previous commit 5ad98e9 added validation that rejected empty
request.agent_version with 500 errors. Only request.user_id has
data-isolation implications (it scopes ALL storage operations);
agent_version only scopes AgentPlaybook reads, which are read-only
and where empty == 'no scope filter, return any agent_playbook'.

The benchmark's client.search() doesn't pass agent_version, so the
strict validation caused 500s on every retrieval call. Keep the
user_id check, drop the agent_version check, default to '' inside
the agent.run() call.
Add 'reflexio setup openai-codex' — a self-contained PKCE OAuth flow
against auth.openai.com that lets reflexio sign in with the user's
ChatGPT subscription and persist tokens to ~/.reflexio/auth/openai-codex.json.

What this replaces: the codex proxy previously read tokens from
~/.openclaw/agents/main/agent/auth-profiles.json, making reflexio depend
on the OpenClaw CLI to refresh OAuth state. Now reflexio owns its own
OAuth flow + token storage, no external CLI required.

What's in this commit (OS submodule):

- reflexio/cli/codex_auth.py — token store + PKCE + browser-callback
  + refresh logic. Pure stdlib (urllib, http.server, secrets, hashlib).
  Tokens persist with 0600 file mode, 0700 dir mode where the
  filesystem supports it.
- reflexio/cli/commands/setup_cmd.py — new 'setup openai-codex'
  subcommand with --no-browser, --timeout, --show, --logout flags.
  --show surfaces email/plan_type/expiry; --logout deletes the token
  file; default flow opens browser, captures the callback on
  localhost:1455, exchanges code for tokens, persists, and warms the
  refresh path so any clock skew is caught at sign-in time.
- tests/cli/test_codex_auth.py — 13 unit tests covering PKCE pair
  generation (RFC 7636 shape, uniqueness), JWT decoding (with and
  without padding, malformed rejection), token-from-OAuth-response
  derivation (account_id / plan_type / email / expires_at, including
  fallback to expires_in when JWT lacks exp), token storage round
  trip + behaviour on missing/malformed files, expiry-with-lead-time
  threshold logic, and authorize-URL parameter completeness.

The OAuth client_id, issuer, redirect URI, and scopes are derived from
codex-rs source code (codex-rs/login/src/server.rs DEFAULT_ISSUER,
codex-rs/login/src/lib.rs CLIENT_ID export) and verified against the
JWT we already had in OpenClaw's auth-profiles.json (shared client_id
== app_EMoamEEZ73f0CkXaXp7hrann).
A. T-R schema (typed dates field):
- Add dates_mentioned: list[str] to UserProfile, UserPlaybook, ProfileView,
  UserPlaybookView (and converters).
- Plumb a dates argument through CreateUserProfileArgs / CreateUserPlaybookArgs,
  CreateUserProfileOp / CreateUserPlaybookOp, and apply_plan_op.
- SQLite: add dates_mentioned TEXT column (CREATE TABLE + idempotent ALTER
  TABLE migration), JSON-serialize on INSERT/UPDATE, parse back on row→model,
  append to FTS body so date queries hit profiles/playbooks.
- Disk storage: Pydantic auto-serializes the new optional field.
- Bump extraction_agent prompt to v1.5.0 with explicit guidance to populate
  the typed dates argument alongside the inline ISO date in content.

B. enable_agent_answer toggle (search-only mode):
- Add enable_agent_answer: bool | None = False to UnifiedSearchRequest,
  mirroring the enable_reformulation pattern.
- SearchAgent constructor accepts the flag; when False, ctx.search_answer
  is dropped and the response carries agent_answer=None. SearchResult.answer
  becomes str | None.
- SearchFinishArgs.answer is now Optional so the agent can call finish()
  with no synthesis in search-only mode.
- Bump search_agent prompt to v1.3.0 with a conditional block that branches
  finish() behaviour on enable_agent_answer.
- Default is False everywhere — UnifiedSearchViewResponse already strips
  agent_answer at the API boundary, so callers see no change unless they
  opt in explicitly.

Tests:
- 3 new extraction-agent tests asserting dates thread end-to-end (profile,
  playbook, default-empty backward compat).
- 4 new storage tests (round-trip, default empty, FTS-by-date, migration
  idempotency on a pre-migration DB).
- 3 new search-agent tests (search-only None answer, prompt branches by
  flag in both modes).
- Existing tests that asserted on agent_answer set enable_agent_answer=True
  to keep their original intent.
- PROMPT_VERSION_MAP updated for both agent versions.
The "one fact per profile" invariant already enforces single-event semantics,
so a single date per fact (rather than a list) is the right shape. Multi-date
events (trip start + trip end, recurring anniversaries) are split into multiple
profiles by the extraction prompt. Benefits over the list:

- Simpler tool-call surface — LLM no longer has to decide whether to wrap a
  single value in [].
- Aligns with atomicity: one fact -> one date.
- Storage stays as a TEXT column; FTS append becomes a single string instead
  of an iteration.
- Future-proof: widening str -> list[str] later is additive; narrowing list
  -> str would be lossy.

Touches schema (UserProfile, UserPlaybook, view models + converters), tool
args (CreateUserProfileArgs, CreateUserPlaybookArgs), plan ops, apply_plan_op,
SQLite (DDL column rename, idempotent ALTER TABLE, INSERT/UPDATE/FTS), the
extraction_agent v1.5.0 prompt examples, and tests for round-trip, FTS,
migration, and the extraction agent thread.

The migration auto-adds `date_mentioned TEXT` on existing DBs. Local dev DBs
that already had the previous `dates_mentioned` column from the prior commit
will retain it dormant (idempotent migration is forward-only); the new column
gets populated going forward.
…table extraction with temporal examples while preserving playbook format and core invariants.
Phase 27 iter 0 baseline on oracle (post-schema-change) showed T-R = 0/6,
identical to the prior baseline before the field existed. The locked decision
rule was: keep iff iter 0 shows T-R >= 1/6. It didn't, so revert.

Diagnostic: recall@5 = 98.7% in iter 0 — retrieval surfaces the right
profiles already, including profiles whose content carries dates inline
("user attended X on YYYY-MM-DD"). The bottleneck is the answer LLM, which
still can't reason temporally even when the dates are reaching it. A typed
field at the storage layer doesn't fix that.

The Section B work (enable_agent_answer toggle, search_agent v1.3.0) stays
intact — it's an independent improvement and isn't tied to the typed field.

Reverts:
- UserProfile / UserPlaybook: drop date_mentioned
- ProfileView / UserPlaybookView: drop date_mentioned
- to_profile_view / to_user_playbook_view: drop the converter line
- CreateUserProfileArgs / CreateUserPlaybookArgs: drop date arg
- CreateUserProfileOp / CreateUserPlaybookOp: drop date field
- _handle_create_user_profile / _handle_create_user_playbook: revert call sites
- apply_plan_op: revert profile/playbook construction
- SQLite schema: drop date_mentioned column from CREATE TABLE
- _migrate_date_mentioned: removed
- INSERT/UPDATE/FTS: drop the column and FTS-append paths
- _row_to_profile / _row_to_user_playbook: drop the field
- extraction_agent: drop v1.5.0 prompt, restore v1.4.0 active=true
- search_agent v1.3.0: drop the date_mentioned mention
- PROMPT_VERSION_MAP: extraction_agent back to v1.4.0
- 7 tests removed (3 extraction agent, 2 storage contract, 2 SQLite)

Phase 28 follow-up: T-R needs answer-LLM-side work, not extraction-side.
Probably a date-aware "given these dated profiles, compute X" prompt for
the answer model, OR session-metadata that the LLM can scan without
relying on profile content.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant