feat: agentic extraction & search pipelines (single-loop v2) by yilu331 · Pull Request #34 · ReflexioAI/reflexio

yilu331 · 2026-04-23T06:07:28Z

Summary

Adds an agentic (single-loop, tool-calling) alternative to the classic extraction and search pipelines. Opt-in via Config.extraction_backend="agentic" / search_backend="agentic"; defaults unchanged.

ExtractionAgent — one adaptive agent loop per enabled (extractor_config × window). Drives run_tool_loop over EXTRACTION_TOOLS (search / get / create / delete user profile + user playbook + finish), accumulates PlanOps on ExtractionCtx.plan, and applies them atomically at commit time through deterministic invariants.
SearchAgent — read-only mirror; returns a typed SearchResult(answer, outcome, budget_exceeded, trace). AgenticSearchService.search walks the trace, harvests entity IDs, and populates UnifiedSearchResponse.{profiles,user_playbooks,agent_playbooks} plus the new .agent_answer field.
Atomic tool surface mapping 1:1 to existing BaseStorage methods — no new storage API.
Invariants at commit time (spec §6): A search-before-create, B delete-known-id-only, D plan-size cap, E no-duplicate-create, F no-duplicate-delete, H source_span present, J scope-match, K deletes-without-creates. Plus a pre-invariant resolver that cancels create + delete-of-tentative-id pairs (self-correction oscillations).
34 eval fixtures + semantic judge: 12 mutation patterns, 18 Supermemory failure modes, 4 agent-loop behaviors. Group 3 runs fully deterministic (mocked-LLM replay) in CI.

Diagrams

Classic vs agentic — extraction flow

Classic: single-shot LLM per configured extractor, post-hoc LLM-based deduplication. Agentic: one adaptive agent loop per extractor × window, plan+commit with deterministic invariants.

Classic (extraction_backend=classic) — single-shot LLM per configured extractor, post-hoc LLM-based deduplication:

graph TD
    C1[publish_request] --> C2[pre-flight gate<br/>_cheap_should_run_reject]
    C2 -->|pass| C3[ProfileGenerationService<br/>single-shot LLM per config]
    C2 -->|pass| C4[PlaybookGenerationService<br/>single-shot LLM per config]
    C3 --> C3d[ProfileDeduplicator<br/>LLM-based dedup pass]
    C4 --> C4d[PlaybookDeduplicator<br/>LLM-based dedup pass]
    C3d --> C5[(profiles)]
    C4d --> C6[(user_playbooks)]
    C6 --> C7[PlaybookAggregator<br/>user → agent_playbooks]

Agentic (extraction_backend=agentic) — one adaptive agent loop per extractor × window, plan+commit with deterministic invariants:

graph TD
    A1[publish_request] --> A2[pre-flight gate<br/>_cheap_should_run_reject]
    A2 -->|pass, for each<br/>extractor_config| A3[ExtractionAgent.run<br/>run_tool_loop]
    A3 -->|tool calls| A4{atomic tools}
    A4 -->|read| A4r[search_* · get_*]
    A4 -->|mutate| A4m[create_* · delete_*<br/>→ ctx.plan]
    A4r --> A3
    A4m --> A3
    A3 -->|finish / max_steps| A5[commit_plan]
    A5 --> A6[resolve tentative-id<br/>oscillations]
    A6 --> A7[hard + soft invariants]
    A7 --> A8[apply_plan_op]
    A8 --> A9[(profiles)]
    A8 --> A10[(user_playbooks)]
    A10 --> A11[PlaybookAggregator<br/>unchanged]

    style A5 fill:#d0ecd8
    style A6 fill:#d0ecd8
    style A7 fill:#d0ecd8

Key structural shifts:

Deduplication moves from post-hoc LLM passes (classic) to commit-time invariants (agentic). Invariant A (search-before-create) prevents duplicates at the source instead of pruning them after.
Persistence moves from direct writes (classic) to a plan+commit stage (agentic). Mutation tools append to ctx.plan; commit_plan applies the plan atomically only after the loop terminates.
PlaybookAggregator trigger is preserved identically across both backends.

Why agentic over classic?

The classic pipeline is a single-shot LLM per extractor (+ RAG for search) with a post-hoc LLM-based deduplication pass. It ships a lot of state into one prompt and hopes the model gets everything right in one pass. Five failure modes surfaced in the Supermemory literature (and directly exercised by the Group 2 eval fixtures) illustrate where this breaks — and how the agentic design mitigates each:

Failure mode	Classic behaviour	Agentic mitigation
Temporal supersession — "I've moved to Seattle" when an old profile says Austin	LLM emits a new "lives in Seattle" profile; the old one survives unless post-hoc dedup's text-similarity heuristic catches it	Invariant A forces search-before-create. The agent finds the stale profile, deletes it, and creates the new one in the same plan — a true supersession, not a duplicate.
Contradiction resolution — "I'm not vegetarian, never was" negating a stored profile	The current window contains no positive fact; the single-shot LLM may emit nothing; the stale profile survives	The agent can actively search for contradicting memory and issue delete ops on profiles the current statement invalidates.
Multi-hop reasoning — "what DB tool should I suggest for this user?" requires linking a profile (user is a data scientist) to a playbook (prefer polars for data)	Single-shot RAG returns one ranked list per query; chaining across entity types is not supported	SearchAgent re-queries iteratively: first profiles, then — knowing the user's role — relevant playbooks.
Empty-result confabulation — asking about memory that does not exist	The classic LLM may invent plausible answers from training priors	The search prompt instructs the agent to say "no evidence in memory"; tool-trace grounding means every assertion must trace to a retrieved id.
Nonmatching-constraint application — user's broad rule ("metric units only") should apply to a query that doesn't lexically match the rule	RAG may miss the rule when current-query vocabulary doesn't overlap	The agent re-queries by intent ("measurement preferences") rather than verbatim, surfaces the rule, applies it.

Structural benefits beyond the failure modes are detailed in the commit_plan section below (atomic cancellation, commit-time invariants, incremental cost, auditability).

Trade-off honesty: agentic has more LLM turns per run than classic's single-shot. The bet is that looping over a narrow tool surface with commit-time invariants is a net win on quality and correctness vs. a single larger prompt with post-hoc cleanup. Group 1/2 real-LLM eval runs will quantify the trade-off; default-flip from classic → agentic is gated on those numbers.

Agent loop shape

  ExtractionAgent                   SearchAgent
  ───────────────                   ───────────
  for each (extractor × window):    for each query:
    search / get (read)               search / get (read, re-query)
    create / delete (plan op)         finish(answer)
    finish                                │
       │                                  ▼
       ▼                             {answer, trace, outcome}
  commit_plan(ctx):                       │
    resolve tentative-id pairs            ▼
    run hard + soft invariants       harvest ids from trace
    apply surviving ops ──► storage  fetch entities by id
                                     populate UnifiedSearchResponse

`commit_plan` — the atomic commit stage

The agentic extraction pipeline separates planning from persistence. During the agent loop, mutation tools (create_user_profile, create_user_playbook, delete_user_profile, delete_user_playbook) don't touch storage — they append PlanOps to ExtractionCtx.plan. Once the agent calls finish (or the loop hits max_steps), commit_plan(ctx, storage, outcome) runs a deterministic pipeline:

Oscillation resolver. Scan the plan for matching create + delete-of-tentative-id pairs and cancel both — the agent changed its mind mid-run, so the net effect is zero writes.
Hard invariants (A / B / D / F / J). Violations drop the offending ops entirely:
- A — create must follow at least one search in the same run (no ungrounded writes).
- B — delete target must be a known id (surfaced by a prior search/get, or issued as a tentative_id by a create this run).
- D — plan size capped at 30 ops.
- F — no duplicate deletes on the same id.
- J — all ops scoped to the run's user_id / extractor_name.
Soft invariants (E / H / K). Violations are logged but the ops still apply:
- E — no duplicate creates with identical content.
- H — every create has a non-empty source_span.
- K — deletes without a matching create are allowed (user explicitly asked to remove stored memory).
Apply surviving ops atomically via apply_plan_op — each op dispatched to the right BaseStorage method (add_user_profile, delete_profiles_by_ids, save_user_playbooks, delete_user_playbooks_by_ids).
Return CommitResult(applied, violations, outcome) for the caller to surface as warnings.

Why plan+commit instead of direct writes?

Atomic cancellation. The agent can create-then-delete-tentative (self-correction) without burning a write-then-delete round-trip on the database.
Determinism at commit time. LLM non-determinism is confined to the loop. Once the plan is frozen, what survives is a pure function of the plan + invariants — so two runs with the same plan always commit the same ops.
Replaces post-hoc deduplication. The classic path runs ProfileDeduplicator / PlaybookDeduplicator after the orchestrator finishes, re-prompting an LLM to find duplicates. The agentic path folds that discipline into the loop via invariant A (search-before-create) — the agent is required to look before it writes, so duplicates don't enter storage in the first place.
Uniform failure recovery. max_steps and error outcomes both go through the same commit path. On error the plan is discarded; on max_steps the plan commits as if the agent had called finish — best-effort partial progress per spec §7.
Testable in isolation. Invariants are pure functions of ExtractionCtx. 24 unit tests in test_invariants.py exercise every code / severity / drop-set combination without touching storage or LLMs.
Auditability. CommitResult.violations is the single source of truth for "what the agent tried but was prevented from doing" — surfaced to callers as warnings and logged at INFO with the affected op indices.

Architecture iteration history (Phase A → v2 single-loop)

The first round of agentic landed as the spec described: 6 reader agents per lane + 2 critics + a reconciler for extraction; 6 search agents + 2 synthesizers + reconciler for search. Telemetry from the Group 3 fixtures and early benchmark runs showed that complexity wasn't pulling its weight, so this PR also lands the v2 single-loop refactor that strips the architecture down to one extraction agent and one search agent driven by the same run_tool_loop primitive.

What was removed (commits 5f0bcaf, d35da1d, ffc6a71, 8052df8):

6-reader stack + reader prompts (prompt_bank/profile_reader_*, prompt_bank/playbook_reader_*).
critics.py — ProfileCritic, PlaybookCritic, Reconciler, and the _supersede / _merge helpers (subsumed by atomic delete + create tools, gated by the same invariants).
6-search-agent stack + 2-synthesizer-stack from search; replaced with one SearchAgent and an answer-trace harvester.
ProfileDeduplicator / PlaybookDeduplicator (replaced by invariant A at commit time).

What replaced them (commits bab6e62, 91319de, 690dccc, e5cc383, 17e08ba):

extraction_agent.py::ExtractionAgent.run(window, extractor_config) — one tool-calling loop per (extractor × window).
search_agent.py::SearchAgent.run(query) — one tool-calling loop per query, returning a typed SearchResult.
agentic_adapter.py::AgenticExtractionRunner rewritten for the single-loop shape; BaseGenerationService.run's per-extractor iteration is preserved (multi-extractor workflows still scale linearly).
HandlerBundle promoted to a stable contract for tool handlers; search_answer typed end-to-end.

The v2 design spec lives at docs/superpowers/specs/2026-04-24-agentic-v2-single-loop-design.md; the original Phase-A spec stays in-tree for context.

Prompt + tool tuning iterations

Once the v2 architecture stabilised, prompts and tool-loop budgets were tuned against the LongMemEval golden set. The relevant commits:

Tighter max_steps budgets + budget-in-prompt — 0abaa50 caps extraction at 4 and search at 3 by default and threads the budget into the prompts so the agent reasons about its remaining tool calls.
Search agent v1.2.0 — verbatim-first — 863df90 rewrites the search prompt to copy dates / IDs verbatim, prefer enumeration of unique names for counts, and inspect the top-10 snippets when the query is temporal or count-shaped. Tuning iterations 7b44ae4, 081628b, 3a6c74d, 38c2565, bd47944.
Extraction agent v1.1.0 → v1.4.0 — atomic-fact profiles (c462b64), structured trigger + markdown-bullet content for playbooks (c1d02e6), aggregator v2.1.0 (6d5df02), then iter 1–3 tightening (b4d3560, c73852b, 0a046c5) for temporal markers, countable items, and the pre-create checklist.
Self-improvement framing — 93efc71 rewrites both system prompts so the agent explicitly reasons about what changed in storage between turns.
Per-entity specialisation — 166418f splits extraction tools and prompts per entity kind so the agent's mental model of "what create_* does to which table" is concrete instead of templated.

The reviewer-facing changelog is the prompt-bank diff plus the LongMemEval-S spot-check report on the enterprise PR.

Logging + observability

Per-run summary — ac15700 adds an INFO-level summary at the end of each agentic pipeline run (elapsed wall-clock, total turns, prompt + completion tokens, cost, model). Same shape for extraction and search.
Tool-call rendering in llm_io.log — c47bde2, 1fa9091, 3a53729 route tool-call requests and responses through format_messages_for_logging so the round-trip is reconstructible from the log.
TZ-aware timestamps — 122490b + 95808ea switch logging to ISO 8601 extended offset (-07:00) and append the local TZ abbreviation when tzdata is available; falls back to +00:00 on minimal containers.
Uvicorn access log alignment — 465c4ed wires uvicorn.access through the same formatter so request logs share the same timestamp shape as the agent's llm_io.log.

Profile_id shortening (12-hex)

f8ba5b4 shrinks the profile id from a long opaque slug to 12 hex characters. The motivation is LLM tool-call reliability: with the longer slug, gpt-5-mini occasionally truncated or fat-fingered the id when emitting delete_user_profile(id=...), causing invariant B violations and forcing wasted re-search turns. The 12-hex form is short enough to copy reliably and large enough (4.7 × 10^14 namespace) to keep collision risk per user negligible. 5926766 standardises the infinity-TTL sentinel to NEVER_EXPIRES_TIMESTAMP so storage layers don't paper over the change with magic numbers.

API rename: `search_profiles` → `search_user_profiles`

Bundled into this PR (commit 9336495). Aligns the granular profile-search endpoint with the rest of the user_* CRUD surface (get_user_profile, add_user_profile, delete_user_profile, search_user_playbooks, search_agent_playbooks). The previous search_profiles was the only outlier.

Affects: Python client method ReflexioClient.search_user_profiles, lib mixin, HTTP route /api/search_user_profiles, CLI subcommand reflexio user-profiles search-user-profiles, retrieval-latency benchmark, all tests + docs. No backward-compat shim — outright rename per pre-1.0 API discipline.

Bug fixes shipped alongside

9a82308 — wires AgenticSearchService into the public /api/search path (was reachable only via the dispatcher).
b3e3986 — request_id now threads through ExtractionCtx so agentic-created profiles + playbooks carry the same provenance as classic.
5ad98e9 — AgenticSearchService.search rejects empty user_id / agent_version at the API boundary instead of silently coercing to empty (would have leaked across users on SqliteStorage).
fb8f999 — agentic search handlers now compute the query embedding (was missing for the FTS-only path).
a8de04f — apply_plan_op derives expiration_timestamp from the requested TTL (previously left at 0 for non-infinity TTLs).
e0e9651 — pre-invariant resolver in commit_plan cancels create + delete-of-tentative-id oscillations.
6763330 — structured-output retries: _make_request parse failures burn a retry attempt instead of returning raw content (mitigates MiniMax JSON flakiness).

CodeRabbit review-loop fixes

f7485b3 addresses the substantive comments from the CodeRabbit review:

run_tool_loop: respect max_steps in the fallback path; per-tool latency_ms is now measured from a per-tool clock instead of the turn-start clock (fixes inflated late-tool timings).
_format_tool_calls: json.dumps(parsed_args) is wrapped with a repr() fallback so the logging path never raises on non-serialisable arguments.
commit_plan: validate DeleteUserPlaybookOp.id before int() and surface a clear TypeError for non-numeric ids.
_migrate_field_names: strip null extraction_backend / search_backend so existing stored configs migrate cleanly.
Eval suite hardening: JudgeScore floats constrained to [0, 1], golden cases sorted by YAML id (not filename), rubrics aligned with the JudgeScore schema, mixed-TTL fixture uses schema-valid enum values.
Test coverage: source_span and notes asserted on positive-path domain-entity tests, known_ids asserted on the search-handler tests, soft-violation case in test_agentic_adapter switched to code E (was B, which is hard).

Test plan

1967 non-e2e tests pass; ruff + pyright clean on prod code.
4/4 Group 3 fixtures pass via eval_runner.run_fixture (includes oscillated_self_correction exercising create + delete-of-tentative-id cancellation).
test_agentic_v2_e2e verifies the full GenerationService.run flow (pre-filter → per-extractor iteration → ExtractionAgent → commit → PlaybookAggregator trigger).
Structured-output parse retry covered by test_litellm_client_unit.py::TestStructuredOutputRetry (malformed-JSON burns a retry; final attempt raises LiteLLMClientError).
All CodeRabbit review-loop fixes covered by their corresponding unit tests.
Real-LLM Group 1/2 eval run (requires API keys; deferred to nightly CI).
LongMemEval-S oracle full-run (in flight on the enterprise PR).

Out of scope

Default-flip from classic to agentic — manual decision gated on golden-set eval.
Rerun-on-agentic wiring — output_pending_status currently only flows through classic services.

coderabbitai · 2026-04-23T06:07:42Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds an agentic-v2 tool-calling pipeline for extraction and search (tool registry, tool-loop, agents, plan/invariant system), extends LLM client for tool-calls and cost accounting, adds extraction metadata fields and DB migration, removes legacy deduplicators, updates config/type-checking, and adds extensive eval fixtures and tests.

Changes

Cohort / File(s)	Summary
Configuration & Type-Checking `pyproject.toml`, `pyrightconfig.json`	Add `polars` to dev deps; add Pyright settings (`include`/`exclude` and `reportMissingImports="warning"`).
Config Schema `reflexio/models/config_schema.py`	Add `extraction_backend` and `search_backend` fields (`Literal["classic","agentic"]`) defaulting to `"classic"`.
LLM Client & Model Defaults `reflexio/server/llm/litellm_client.py`, `reflexio/server/llm/model_defaults.py`, `reflexio/server/llm/__init__.py`	Introduce `ToolCallingChatResponse` and `StructuredOutputParseError`; propagate `tools`, `tool_choice`, `model_role`; compute/log USD cost; add `EXTRACTION_AGENT`/`SEARCH_AGENT` roles and provider defaults; re-export `ToolCallingChatResponse`.
Tool-Calling Core `reflexio/server/llm/tools.py`	New tool system: `Tool`, `ToolRegistry`, OpenAI-compatible tool specs, `supports_tool_calling()`, `run_tool_loop()` with native multi-turn tool-calling and structured-output fallback; models for turns/traces/results.
Agentic Extraction: Plan, Tools, Invariants `reflexio/server/services/extraction/plan.py`, `.../invariants.py`, `.../tools.py`	Add plan/op models, `ExtractionCtx`, `Violation`/`CommitResult`, oscillation resolution, commit logic enforcing hard/soft invariants, atomic tool handlers, `apply_plan_op`, and tool registries.
Agent Runners & Adapter `reflexio/server/services/extraction/extraction_agent.py`, `.../agentic_adapter.py`	New `ExtractionAgent` and `AgenticExtractionRunner` to run tool-loop extraction, integrate with publish flow, handle warnings/aggregation.
Agentic Search & Service `reflexio/server/services/search/search_agent.py`, `.../agentic_search_service.py`, `.../plan.py`	Add `SearchAgent`, `AgenticSearchService`, `SearchResult` model; harvest IDs from tool-loop trace and fetch/order results in-memory.
Service Dispatch & Factories `reflexio/server/services/generation_service.py`, `reflexio/server/services/unified_search_service.py`	Add `build_extraction_service` and `build_search_service`; introduce `UnifiedSearchService` wrapper; dispatch by `Config`.
Logging / Service Utils `reflexio/server/services/service_utils.py`, `reflexio/server/uvicorn_logging.py`, `reflexio/server/__init__.py`	TZ-aware log formatter, improved LLM I/O formatting to render `tool_calls`, use `AccessFormatter` for uvicorn access logs, and helper to format tool-calls for logging.
Data Model & Storage Schema `reflexio/models/api_schema/domain/entities.py`, `.../retriever_schema.py`, `.../playbook_service_utils.py`, `.../profile_generation_service_utils.py`, `reflexio/server/services/storage/sqlite_storage/*`	Add optional `source_span`, `notes`, `reader_angle` to profiles/playbooks/structured content; add `agent_answer` to `UnifiedSearchResponse`; extend SQLite DDL/mappers and add migration `_migrate_agentic_signals()` to backfill columns.
Deduplication Removal `reflexio/server/services/profile/profile_deduplicator.py`, `reflexio/server/services/playbook/playbook_deduplicator.py`, related callers/tests	Remove profile and playbook deduplicator modules and tests; remove dedup invocation from generation services and helper wrapper `is_deduplicator_enabled`.
Feature Flags & Registry `reflexio/server/site_var/feature_flags.py`, `reflexio/test_support/llm_model_registry.py`	Remove `is_deduplicator_enabled()` convenience; delete deduplication entries from test model registry.
Prompts `reflexio/server/prompt/prompt_bank/...`	Add extraction-agent and search-agent prompt templates (multiple versions); delete legacy playbook/profile deduplication prompts.
CLI & Misc `reflexio/cli/commands/setup_cmd.py`, `reflexio/cli/run_services.py`, `reflexio/lib/_base.py`	Adjust Claude Code setup to manage `Stop` hook and remove expert-only command directory on non-expert installs; small formatting changes.
Tests & Eval Framework `tests/eval/`, many `tests/server/services/extraction/`, `tests/server/services/search/*`, etc.	Large addition of eval fixtures (many JSON/YAML), Polars aggregator, judge harness, extensive unit/integration tests for tool-loop, agents, plan/invariants, tools, storage migrations; add test helpers and adjust many tests to new schemas/behaviors; remove dedup-related tests.
Test Helpers & Fixtures `reflexio/test_support/llm_mock.py`, `reflexio/test_support/llm_model_registry.py`, `tests/conftest.py`	Add helpers for crafting tool-call LiteLLM responses and new pytest fixtures for tool-call sequences; cleanup model registry test entries.

Sequence Diagram(s)

sequenceDiagram
  participant Client as LiteLLMClient
  participant LLM as LiteLLM (provider)
  participant Runner as run_tool_loop
  participant Registry as ToolRegistry
  participant Storage as Storage

  Client->>Runner: run_tool_loop(messages, tools, model_role)
  Runner->>LLM: generate_chat_response(..., tools, tool_choice="auto", model=resolved)
  LLM-->>Runner: response (choices with tool_calls or content)
  alt native tool-calling (tool_calls present)
    Runner->>Registry: parse tool_calls, handle(name,args_json,ctx)
    Registry->>Runner: handler result and ctx updates
    Runner->>Storage: apply_plan_op or fetch entities as needed
    Runner->>LLM: append assistant messages of tool results -> continue loop or finish
  else fallback (no native support)
    Runner->>LLM: generate_chat_response(response_format=fallback_schema)
    LLM-->>Runner: parsed structured schema -> synthetic tool calls
    Runner->>Registry: dispatch synthetic calls to fallback tool
    Registry->>Storage: persist or fetch results
  end
  Runner-->>Client: ToolLoopResult / ToolCallingChatResponse / SearchResult

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

feat(openclaw): rewrite federated plugin using Plugin SDK #30: Overlaps CLI setup changes around Claude Code hook handling (setup_cmd.py) and expert-mode command installation adjustments.
feat(profile): add deletion directive channel for forget/remove requests #35: Strongly related — modifies/removes profile deduplication logic that this PR also deletes.
feat(llm): add Claude Code CLI + local ONNX embedding providers #28: Related changes to LLM client and provider/model defaults that affect ModelRole resolution and tool-calling behavior.

Poem

🐰 I hopped into the code today,
Tools now chatter, plans at play,
Old dedup tricks quietly stray,
Agents fetch and store the way —
Hooray for hops in bytes and hay! 🥕

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/agentic-extraction-search

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

reflexio/server/services/profile/profile_generation_service_utils.py (1)

75-105: ⚠️ Potential issue | 🟡 Minor

Update ProfileAddItem docstring to match the expanded schema.

The attributes section is now stale after adding source_span, notes, and reader_angle.

Suggested doc fix

     Attributes:
         content (str): The profile content based on content definition
         time_to_live (str): Time to live for the profile - one of: 'one_day', 'one_week', 'one_month', 'one_quarter', 'one_year', 'infinity'
         metadata (str, optional): Metadata extracted for the profile based on metadata definition
+        source_span (str, optional): Verbatim evidence span supporting this profile item
+        notes (str, optional): Extraction caveats/confidence/alternatives
+        reader_angle (str, optional): Reader perspective that surfaced this item

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/profile/profile_generation_service_utils.py` around
lines 75 - 105, Update the ProfileAddItem class docstring so the Attributes
section reflects the expanded schema: add entries for source_span (str | None)
with description "Verbatim excerpt from the source that most directly supports
this profile item", notes (str | None) with description "Free-form extraction
notes — confidence, caveats, or alternative readings", and reader_angle (str |
None) with description "The extraction perspective or reader role that surfaced
this item"; also ensure the existing time_to_live entry lists the current
literal options ("one_day", "one_week", "one_month", "one_quarter", "one_year",
"infinity") and keep content, metadata entries matching their Field descriptions
in the class.

🧹 Nitpick comments (27)

tests/eval/golden_set/extraction/polars_vs_pandas.yaml (1)

14-18: Tighten source-span evidence for the comparative preference.

At Line 17, requiring only "polars" is a bit permissive for a profile that asserts “over pandas for new work.” Consider anchoring to a contrastive phrase so extractors don’t pass with a weaker “user uses polars” span.
♻️ Proposed fixture tightening
-    must_include_in_source_span: "polars"
+    must_include_in_source_span: "don't want agents to suggest pandas for new code"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/eval/golden_set/extraction/polars_vs_pandas.yaml` around lines 14 - 18,
The fixture's evidence requirement is too weak—update the second item's
must_include_in_source_span (the YAML key in the entry whose content starts
"User's team standardized on polars last quarter.") to require a contrastive
phrase such as "polars over pandas" or "polars over pandas for new work" instead
of just "polars" so extractors must capture the comparative preference.

reflexio/server/prompt/prompt_bank/profile_reader_facts/v1.0.0.prompt.md (1)

11-11: Minor wording polish: hyphenate the compound adjective.

Use “third-person” for grammatical clarity in the instruction text.

✏️ Suggested edit

-  - content: one-sentence statement of the fact, written in third person
+  - content: one-sentence statement of the fact, written in third-person

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/prompt/prompt_bank/profile_reader_facts/v1.0.0.prompt.md` at
line 11, The prompt instruction text currently reads "one-sentence statement of
the fact, written in third person"; update that phrase to hyphenate the compound
adjective as "written in third-person" for grammatical clarity—edit the content
string in prompt_bank/profile_reader_facts/v1.0.0.prompt.md (the line containing
"one-sentence statement of the fact, written in third person") to use
"third-person".

reflexio/server/services/unified_search_service.py (1)

279-297: Class contract is unclear: it doesn’t currently wrap execution.

The docstring says this wraps run_unified_search, but the class currently only stores dependencies. Consider either adding a thin forwarding method or adjusting the docstring to avoid implying executable behavior.

💡 Minimal docstring-only correction

 class UnifiedSearchService:
-    """Class handle for the classic unified search pipeline.
+    """Dependency handle for the classic unified search pipeline.
 
-    Wraps :func:`run_unified_search` so the dispatcher factory can return an
-    object whose ``__class__.__name__`` can be inspected uniformly alongside
-    the agentic search service (Phase 4).
+    Used by dispatcher code that needs a concrete class instance for
+    backend identity/selection alongside the agentic search service.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/unified_search_service.py` around lines 279 - 297,
The class docstring claims UnifiedSearchService "wraps run_unified_search" but
the class only stores dependencies; add a thin forwarding method (e.g., run or
execute) on UnifiedSearchService that calls the top-level run_unified_search
function using the stored self.llm_client and self.request_context (and forwards
any parameters/kwargs) and returns its result, or alternatively update the
docstring to remove the "wraps run_unified_search" line; reference the
UnifiedSearchService class, its __init__, and the run_unified_search function
when making the change.

tests/server/services/playbook/test_structured_playbook_content.py (1)

15-25: Assert notes in the optional-fields test.

notes is provided but not asserted, so a regression on that field could pass unnoticed.

Suggested test fix

 def test_structured_playbook_content_accepts_optional_fields() -> None:
     c = StructuredPlaybookContent(
         trigger="t",
         content="c",
         rationale="r",
         source_span="quote",
         notes="confidence=0.9",
         reader_angle="trigger",
     )
     assert c.source_span == "quote"
+    assert c.notes == "confidence=0.9"
     assert c.reader_angle == "trigger"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/server/services/playbook/test_structured_playbook_content.py` around
lines 15 - 25, The test test_structured_playbook_content_accepts_optional_fields
creates a StructuredPlaybookContent with notes="confidence=0.9" but never
asserts that field; add an assertion to verify the notes field (e.g., assert
c.notes == "confidence=0.9") in the test to ensure the optional notes property
is checked and protected against regressions.

tests/eval/test_agentic_vs_classic_search_integration.py (1)

22-23: Tighten score assertions to enforce normalized bounds.

Use a full [0.0, 1.0] range check instead of only lower-bound checks.

Suggested test hardening

-    assert c_score.answer_correctness >= 0.0
-    assert a_score.answer_correctness >= 0.0
+    assert 0.0 <= c_score.answer_correctness <= 1.0
+    assert 0.0 <= a_score.answer_correctness <= 1.0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/eval/test_agentic_vs_classic_search_integration.py` around lines 22 -
23, The assertions for c_score.answer_correctness and a_score.answer_correctness
only check a lower bound; update the test in
test_agentic_vs_classic_search_integration.py to assert both lower and upper
bounds so each score is within [0.0, 1.0] (e.g., replace the two lines
referencing c_score.answer_correctness and a_score.answer_correctness with range
checks that ensure >= 0.0 and <= 1.0). Ensure both the c_score and a_score
variables use the tightened assertions so the test enforces normalized score
bounds.

reflexio/server/prompt/prompt_bank/profile_critic/v1.0.0.prompt.md (1)

9-16: Add an explicit “exactly one action per candidate” rule.

Lines 9-16 list valid actions, but a strict one-action constraint will reduce ambiguous/duplicate tool calls in critic loops.

🛠️ Prompt hardening snippet

 You are a profile critic. Three angle readers (facts / context / temporal) produced
 the candidate profile items below. You must decide, for each one, whether to:
+Emit exactly one action per candidate.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/prompt/prompt_bank/profile_critic/v1.0.0.prompt.md` around
lines 9 - 16, Add an explicit rule to the action list that each profile
candidate must have exactly one chosen action to avoid ambiguous or duplicate
decisions; update the section that currently lists `accept`, `refine`, `reject`,
and `flag_cross_entity_conflict` to include a one-line rule such as "Choose
exactly one action per candidate — do not select multiple actions" and, where
`refine` is described, clarify that refinement implies final acceptance so it
counts as the single action; ensure the new wording appears alongside the
existing bullet list so reviewers and downstream critic loops enforce a single
action invariant.

reflexio/server/prompt/prompt_bank/profile_search_temporal/v1.0.0.prompt.md (1)

11-14: Make temporal retrieval size explicit for deterministic behavior.

Line 11 omits top_k, so retrieval behavior depends on backend defaults and may drift between providers/configs.

♻️ Suggested prompt tweak

-Use `search_profiles(respect_ttl=false)` to include expired items — they may
+Use `search_profiles(query, top_k=10, respect_ttl=false)` to include expired items — they may
 be the PREVIOUS state of something the query is asking about. Then
 `submit_candidates`, flagging in `why` whether the item is current vs
 superseded.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/prompt/prompt_bank/profile_search_temporal/v1.0.0.prompt.md`
around lines 11 - 14, The prompt currently calls
search_profiles(respect_ttl=false) without specifying top_k, making retrieval
size nondeterministic; update the prompt text to call
search_profiles(respect_ttl=false, top_k=<N>) (or a named variable like
top_k_temporal) so the retrieval count is explicit and deterministic, and ensure
any downstream references (e.g., submit_candidates / why) assume that same top_k
value.

tests/eval/test_agentic_vs_classic_extraction_integration.py (1)

16-25: This test currently verifies judge plumbing, not backend comparison.

With identical stubs in Lines 18-23, it can’t catch extraction quality regressions between classic and agentic despite the current test name.

🧪 Suggested clarity + tighter assertions

-def test_agentic_vs_classic_extraction(extraction_case, extraction_judge):
-    """For each golden case, the stubbed judge returns a parseable score."""
+def test_extraction_judge_harness_smoke(extraction_case, extraction_judge):
+    """For each golden case, judge output is parseable (no backend comparison)."""
@@
-    assert c_score.signal_f1 >= 0.0
-    assert a_score.signal_f1 >= 0.0
+    assert 0.0 <= c_score.signal_f1 <= 1.0
+    assert 0.0 <= a_score.signal_f1 <= 1.0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/eval/test_agentic_vs_classic_extraction_integration.py` around lines 16
- 25, The test test_agentic_vs_classic_extraction currently uses identical
stubbed outputs (classic_out and agentic_out) so it only validates judge
plumbing; replace those empty identical stubs with real extraction outputs or
distinct fixtures produced by the classic and agentic extraction paths (or call
the actual extractor functions) and then assert a meaningful comparison using
extraction_judge.score (e.g., compute c_score and a_score via
extraction_judge.score(expected=extraction_case, actual=<classic_output>) and
extraction_judge.score(expected=extraction_case, actual=<agentic_output>) and
assert a tighter condition such as a_score.signal_f1 >= c_score.signal_f1 or
that the difference is within an acceptable delta) to catch regressions between
the two backends.

reflexio/server/services/storage/sqlite_storage/_base.py (1)

852-868: Guard migration DDL with the storage lock.

Line 859–Line 868 runs PRAGMA/ALTER TABLE/commit without self._lock. Wrapping this block keeps migration behavior consistent with the rest of this class’s thread-safety pattern.

Suggested patch

 def _migrate_agentic_signals(self) -> None:
@@
-        for table in ("profiles", "user_playbooks"):
-            cols = {
-                row["name"]
-                for row in self.conn.execute(f"PRAGMA table_info({table})").fetchall()
-            }
-            for col in ("source_span", "notes", "reader_angle"):
-                if col not in cols:
-                    self.conn.execute(f"ALTER TABLE {table} ADD COLUMN {col} TEXT")  # noqa: S608
-                    logger.info("Added %s column to %s", col, table)
-        self.conn.commit()
+        with self._lock:
+            for table in ("profiles", "user_playbooks"):
+                cols = {
+                    row["name"]
+                    for row in self.conn.execute(f"PRAGMA table_info({table})").fetchall()
+                }
+                for col in ("source_span", "notes", "reader_angle"):
+                    if col not in cols:
+                        self.conn.execute(f"ALTER TABLE {table} ADD COLUMN {col} TEXT")  # noqa: S608
+                        logger.info("Added %s column to %s", col, table)
+            self.conn.commit()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/storage/sqlite_storage/_base.py` around lines 852 -
868, The _migrate_agentic_signals method executes PRAGMA/ALTER TABLE/commit
without using the class storage lock, which breaks the class's thread-safety
pattern; wrap the whole migration loop and commit in the existing self._lock
(use the same lock acquisition method used elsewhere in this class, e.g., with
self._lock:) so PRAGMA/ALTER/commit are executed while holding self._lock to
serialize schema changes and database access in _migrate_agentic_signals.

tests/eval/aggregate.py (1)

26-39: Make aggregated output order deterministic.

Add a final sort by backend so weekly reports and diffs are stable run-to-run.

Suggested patch

     return (
         pl.scan_parquet(results_path)
         .group_by("backend")
         .agg(
@@
         )
+        .sort("backend")
         .collect()
     )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/eval/aggregate.py` around lines 26 - 39, The aggregated DataFrame
currently returned from the pipeline can vary order between runs; update the
pipeline in tests/eval/aggregate.py so the result is deterministically ordered
by adding a sort on the "backend" column before calling .collect() (i.e., after
the .agg(...) step). Modify the return chain that builds the scan_parquet ->
group_by("backend") -> .agg(...) sequence to include a .sort("backend") prior to
.collect() so weekly reports and diffs are stable.

reflexio/server/prompt/prompt_bank/playbook_reader_trigger/v1.0.0.prompt.md (1)

13-17: Avoid natural-language fallback content for missing behavior.

Using "defer to other rules" as content can look like a real instruction downstream. Prefer a neutral sentinel plus explicit note tag so it’s easier to filter.

Suggested patch

-  - content: the behaviour if stated; else "defer to other rules"
+  - content: the behaviour if stated; else "[UNSPECIFIED_BEHAVIOR]"
@@
-  - notes: confidence and trigger-type tag ("event", "threshold", "keyword")
+  - notes: confidence, trigger-type tag ("event", "threshold", "keyword"),
+    and `behavior_unspecified=true` when content is unspecified

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/prompt/prompt_bank/playbook_reader_trigger/v1.0.0.prompt.md`
around lines 13 - 17, The current playbook fields use the natural-language
fallback "defer to other rules" for the content key which can be misinterpreted
downstream; update the prompt schema entries (look for the keys content,
rationale, source_span, notes, reader_angle in v1.0.0.prompt.md) to replace that
phrase with a neutral sentinel token (e.g., "<NO_BEHAVIOR_SPECIFIED>") and add
an explicit note tag in notes (e.g., "note: no-behavior-specified") so consumers
can reliably filter missing behavior without treating the text as an actionable
instruction.

reflexio/server/prompt/prompt_bank/profile_synthesizer/v1.0.0.prompt.md (1)

13-20: Make the tool-call termination contract explicit.

Add a hard rule that rank(ordered_ids) must be called exactly once before finish, and IDs must come only from provided candidates, to reduce loop/output ambiguity.

Suggested patch

 Tools:
   - `rank(ordered_ids)` — emit the final ordered ID list
   - `drop(id, reason)` — exclude a candidate
   - `flag_cross_entity_conflict(id, reason)` — flag contradictions with
     the playbook lane
   - `finish`
 
 Use the `other_lane` summary only for cross-checking coherence.
+You must call `rank(...)` exactly once before `finish`.
+Only include IDs that appear in `CANDIDATES`.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/prompt/prompt_bank/profile_synthesizer/v1.0.0.prompt.md`
around lines 13 - 20, The prompt must enforce a strict tool-call termination
contract: require that rank(ordered_ids) is invoked exactly once before finish
and that ordered_ids are drawn only from the provided candidate set (no new or
transformed IDs), disallow additional rank calls or other tool calls that change
the candidate set after rank, and ensure finish() is called only after that
single rank call; update the prompt language around the tools (rank, drop,
flag_cross_entity_conflict, finish) to state these rules explicitly so the model
cannot call rank multiple times, cannot call finish before rank, and must not
emit IDs that aren't in the provided candidates.

tests/eval/conftest.py (1)

25-29: Docstring inconsistency: sorted by filename, not id.

The docstring says "sorted by id" but the code sorts by filename via sorted((_GOLDEN / kind).glob("*.yaml")). If files are named to match their IDs (e.g., mixed_ttl.yaml), this works, but the comment should reflect the actual behavior.

📝 Suggested fix

 def _load(kind: str) -> list[dict[str, Any]]:
-    """Load every YAML golden file under ``golden_set/<kind>/`` sorted by id."""
+    """Load every YAML golden file under ``golden_set/<kind>/`` sorted by filename."""
     return [
         yaml.safe_load(p.read_text()) for p in sorted((_GOLDEN / kind).glob("*.yaml"))
     ]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/eval/conftest.py` around lines 25 - 29, The docstring for _load is
incorrect: it claims results are "sorted by id" while the implementation sorts
the file paths lexicographically; update the _load function's docstring to state
that YAML files under golden_set/<kind>/ are loaded and returned in
filename-sorted (lexicographic) order, and optionally note that filenames should
correspond to IDs if ordering by ID is required; keep the description concise
and reference the existing function name _load.

tests/eval/judge.py (1)

62-66: Template rendering edge case.

Using str.replace() for template substitution is simple but could behave unexpectedly if expected or actual contain the literal strings "{expected}" or "{actual}". For eval test harness purposes with controlled golden cases, this is likely fine, but worth noting.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/eval/judge.py` around lines 62 - 66, The current substitution on
self.rubric["prompt"] uses chained str.replace on the prompt variable which can
mis-substitute if the values contain literal "{expected}" or "{actual}"; change
the rendering to use a safe templating approach (e.g., Python's
str.format/format_map with explicit placeholders or
string.Template.safe_substitute) so you pass expected and actual as named fields
instead of doing raw replace — update the code that builds prompt (the prompt
variable / self.rubric["prompt"] usage) to use one of these safe template
methods and pass {"expected": expected, "actual": actual}.

reflexio/server/services/generation_service.py (1)

396-397: Broaden dispatcher return contracts to match actual runtime branches.

Line 396 and Line 434 currently advertise concrete classic services, but both functions can return agentic implementations. Use a shared protocol (or explicit union) so callers/type-checkers see the real contract.

♻️ Proposed refactor

-from typing import TYPE_CHECKING
+from typing import TYPE_CHECKING, Any, Protocol
@@
 if TYPE_CHECKING:
     from reflexio.server.services.unified_search_service import UnifiedSearchService
+
+
+class ExtractionServiceLike(Protocol):
+    def run(self, request: Any) -> Any: ...
+
+
+class SearchServiceLike(Protocol):
+    def search(self, request: Any) -> Any: ...
@@
 def build_extraction_service(
@@
-) -> ProfileGenerationService:
+) -> ExtractionServiceLike:
@@
 def build_search_service(
@@
-) -> UnifiedSearchService:
+) -> SearchServiceLike:

Also applies to: 434-435

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/generation_service.py` around lines 396 - 397, The
dispatcher functions in generation_service.py currently declare a concrete
return type of ProfileGenerationService but actually may return agentic
implementations; update their return annotations to reflect the true contract by
using a shared Protocol (e.g., define an IProfileGenerationService protocol
covering the methods used by callers) or an explicit Union (e.g.,
Union[ClassicProfileGenerationService, AgenticProfileGenerationService]) and
change the two dispatcher function signatures to return that Protocol/Union
instead of the concrete ProfileGenerationService; ensure imports are added and
update any type hints or docstrings referencing the old concrete type
accordingly.

tests/server/services/test_agentic_backend_pipeline_integration.py (1)

81-87: Minor: Response count comment is slightly misleading.

The comment says "6 readers finish + 2 critics finish = 8 LLM calls" but 10 responses are provided. While providing extras is fine for robustness, the comment could be clearer. Similarly, the search response math: 6 agent submissions + 2×(rank + finish) = 10 calls, which matches the code.

Consider updating the comment to match the actual expectation or explain why extras are provided.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/server/services/test_agentic_backend_pipeline_integration.py` around
lines 81 - 87, Update the misleading inline comments near extract_responses and
search_responses to accurately describe the number of mocked LLM calls produced
by make_tc: for extract_responses note that although 8 calls are expected (6
readers + 2 critics), 10 responses are supplied as extras for robustness, and
for search_responses clarify that the sequence produces 10 calls (6
submit_candidates + 2×(rank + finish)); adjust the comment text next to the
make_tc, extract_responses, and search_responses variables so the math matches
the actual arrays or explicitly state that extras are intentionally included.

reflexio/server/services/extraction/agentic_extraction_service.py (2)

200-210: Blind Exception catch is acceptable here but could be narrowed.

The _safe_result helper intentionally catches all exceptions to prevent one reader failure from killing extraction. This is a reasonable resilience pattern. The static analysis warning (BLE001) can be suppressed with a comment if desired.

For completeness, consider catching (TimeoutError, RuntimeError, ValueError) to avoid masking truly unexpected errors during development, or add a # noqa: BLE001 comment if the broad catch is intentional.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/extraction/agentic_extraction_service.py` around
lines 200 - 210, The _safe_result helper currently catches all exceptions which
triggers static-analysis BLE001; either narrow the except to expected failures
(e.g., except (concurrent.futures.TimeoutError, RuntimeError, ValueError) as e)
so truly unexpected errors still surface, or explicitly mark the broad catch as
intentional by adding a suppression comment (e.g., "# noqa: BLE001") above the
except block; update the logger.warning call in _safe_result to still log the
caught exception variable (e) so debugging info is preserved.

174-188: Critics run sequentially despite critic_workers parameter.

The constructor accepts critic_workers (line 97) and stores it (line 104), but the critics are invoked sequentially in _run_profile_critic and _run_playbook_critic rather than in parallel. This differs from the search service's synthesizer parallelization pattern.

If parallelization is intended, wrap the critic calls in a ThreadPoolExecutor. If sequential execution is intentional (e.g., to simplify debugging or due to cross-lane dependencies), consider removing the unused critic_workers parameter or documenting why it's reserved for future use.

Option 1: Parallelize critics (if intended)

+    def _run_critics(
+        self,
+        profile_cands: list[Any],
+        playbook_cands: list[Any],
+    ) -> tuple[list[VettedProfile], list[CrossEntityFlag], list[VettedPlaybook], list[CrossEntityFlag]]:
+        executor = ThreadPoolExecutor(max_workers=self._critic_workers)
+        try:
+            profile_fut = executor.submit(
+                self._run_profile_critic, profile_cands, playbook_cands
+            )
+            playbook_fut = executor.submit(
+                self._run_playbook_critic, playbook_cands, profile_cands
+            )
+            vetted_profiles, profile_flags = profile_fut.result(timeout=30.0)
+            vetted_playbooks, playbook_flags = playbook_fut.result(timeout=30.0)
+        finally:
+            executor.shutdown(wait=False, cancel_futures=True)
+        return vetted_profiles, profile_flags, vetted_playbooks, playbook_flags

Option 2: Remove unused parameter (if sequential is intentional)

     def __init__(
         self,
         *,
         llm_client: LiteLLMClient,
         request_context: RequestContext,
         reader_workers: int = 6,
-        critic_workers: int = 2,
     ) -> None:
         self.client = llm_client
         self.request_context = request_context
         self.storage = request_context.storage
         self.prompt_manager = request_context.prompt_manager
         self._reader_workers = min(reader_workers, 6)
-        self._critic_workers = min(critic_workers, 2)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/extraction/agentic_extraction_service.py` around
lines 174 - 188, The critics are currently invoked sequentially in
_run_profile_critic and _run_playbook_critic despite the constructor storing
critic_workers; change to run the two critic.review calls in parallel using
ThreadPoolExecutor(max_workers=self.critic_workers) so both ProfileCritic.review
and PlaybookCritic.review run concurrently—create the two critic instances
(ProfileCritic and PlaybookCritic), submit their .review(...) calls (note you
must still call summarize(playbook_cands) for the profile critic and
summarize(profile_cands) for the playbook critic), collect future.result() for
each, and propagate/handle exceptions as before; alternatively, if sequential
behavior is intentional, remove the critic_workers parameter and its storage
from the constructor instead of adding parallelism.

reflexio/server/services/search/agentic_search_service.py (4)

100-107: TODO placeholder for reconciliation.

The cross-entity flags are currently only logged rather than being used to drive reconciliation. The TODO is noted, but ensure this is tracked for Phase 6+ completion.

Would you like me to open an issue to track the search reconciliation implementation?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/search/agentic_search_service.py` around lines 100 -
107, The code currently collects cross-entity flags into all_flags (from p_flags
+ b_flags) and only logs them instead of invoking reconciliation; replace the
TODO by wiring these flags into the Reconciler path: pass all_flags (or derived
lanes) into the Reconciler call inside the AgenticSearchService (or whichever
function constructs reconciliation) instead of using empty lanes that cause
out-of-range errors, and ensure you handle empty/all_flags cases and surface
errors via logger.error; keep the existing logger.info line for visibility but
add the Reconciler invocation using the same all_flags variable and add
appropriate validation/try-except around Reconciler usage.
211-216: Verify cross-lane summary variable naming.

The variable naming here is slightly confusing but appears correct on close inspection:

playbook_other_lane = summary of profile hits (what playbook synthesizer sees as "other lane")

profile_other_lane = summary of playbook hits (what profile synthesizer sees as "other lane")

Consider renaming for clarity: profile_synth_sees_playbooks / playbook_synth_sees_profiles or similar, to make the data flow more obvious.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/search/agentic_search_service.py` around lines 211 -
216, The two variables playbook_other_lane and profile_other_lane are inverted
in meaning (playbook_other_lane holds a summary of profile hits and
profile_other_lane holds playbook hits); rename them to clearer identifiers
(e.g., playbook_synth_sees_profiles and profile_synth_sees_playbooks) and update
all references where summarize is called with profile_batches and
playbook_batches respectively so the names reflect the data flow (symbols to
change: playbook_other_lane, profile_other_lane, summarize usage,
profile_batches, playbook_batches).
267-285: Potential silent data loss if ID mapping fails.

In _assemble_ranked, if a ranked ID doesn't exist in the id_to_profile/id_to_playbook dicts, it's silently skipped (line 286-287). This is likely intentional for resilience, but consider logging a warning when ranked IDs can't be resolved, as it may indicate a bug in the agent/synthesizer pipeline.
Add logging for unmapped IDs
+        missing_profiles = [i for i in p_ids if i not in id_to_profile]
+        missing_playbooks = [i for i in b_ids if i not in id_to_playbook]
+        if missing_profiles:
+            logger.warning("ranked profile IDs not found in hits: %s", missing_profiles)
+        if missing_playbooks:
+            logger.warning("ranked playbook IDs not found in hits: %s", missing_playbooks)
         ranked_profiles = [id_to_profile[i] for i in p_ids if i in id_to_profile]
         ranked_playbooks = [id_to_playbook[i] for i in b_ids if i in id_to_playbook]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/search/agentic_search_service.py` around lines 267 -
285, In _assemble_ranked, when resolving ranked IDs against id_to_profile and
id_to_playbook, add a warning log for any ranked ID that cannot be found instead
of silently skipping it: detect missing keys when looking up id_to_profile and
id_to_playbook inside the _assemble_ranked flow and call the service logger
(e.g., self.logger.warn or processLogger.warn used in this module) with a short
message including the missing id and context (e.g., "missing profile for id
{id}" or "missing playbook for id {id}") so operators can trace pipeline issues;
keep behavior unchanged otherwise.
199-202: Consider catching more specific exceptions for agent failures.

The blind Exception catch here is intentional for fault tolerance (preventing one agent failure from killing the pipeline), but it may mask programming errors. Consider catching (TimeoutError, FuturesTimeoutError, RuntimeError) or similar known failure modes, and letting truly unexpected exceptions bubble up during development.

That said, for a production-grade resilient system, the current approach is defensible. This is a low-priority suggestion.
Possible refinement
-            except Exception as e:
+            except (FuturesTimeoutError, RuntimeError, ValueError) as e:
                 logger.warning("search agent failed: %s: %s", type(e).__name__, e)
                 partial = True
Or keep Exception but add explicit re-raise for certain types you want to surface.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/search/agentic_search_service.py` around lines 199 -
202, The current except Exception block around the agent loop (which logs via
logger.warning and sets partial before returning batches, partial) is too broad;
change it to catch only expected failure modes such as TimeoutError,
concurrent.futures.TimeoutError (or FuturesTimeoutError), and RuntimeError so
programming errors can surface, or alternatively keep Exception but explicitly
re-raise critical exceptions (e.g., KeyboardInterrupt, SystemExit, BaseException
subclasses you want surfaced) before logging and setting partial; update the
except clause that wraps logger.warning and the partial assignment accordingly
and reference logger.warning, partial, and the surrounding agent processing loop
(the code that returns batches, partial).

reflexio/server/llm/model_defaults.py (1)

219-275: Several providers lack agentic role mappings.

The gemini, deepseek, openrouter, minimax, dashscope, xai, moonshot, and zai providers don't define angle_reader, critic, synthesizer, or reconciler model mappings. While the fallthrough logic in _auto_detect_model handles this gracefully (skipping to the next provider), users relying solely on these providers will get a RuntimeError when attempting to use agentic pipelines.

If this is intentional (requiring anthropic/openai for agentic features), consider adding a comment or logging when agentic roles are unavailable for the detected provider. Otherwise, consider adding reasonable defaults for these providers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/llm/model_defaults.py` around lines 219 - 275, The
ProviderDefaults entries for providers "gemini", "deepseek", "openrouter",
"minimax", "dashscope", "xai", "moonshot", and "zai" are missing agentic role
mappings (angle_reader, critic, synthesizer, reconciler), causing RuntimeError
in agentic pipelines when _auto_detect_model falls back; update each
ProviderDefaults instance in model_defaults.py to either include reasonable
default model strings for those four agentic roles or add a clear comment/log
when those roles are unsupported for the provider, and ensure any change
references the ProviderDefaults constructor and the _auto_detect_model code path
so behavior is explicit for users relying solely on those providers.

reflexio/server/llm/tools.py (2)

236-251: Latency measurement includes LLM response time, not just tool execution.

The t0 timestamp is set before the generate_chat_response call (line 217), so latency_ms in the trace includes both LLM round-trip time and tool handler execution. If the intent is to measure only tool execution time, t0 should be reset inside the tool processing loop. If the combined latency is intentional, consider renaming to turn_latency_ms for clarity.

⏱️ Option: Measure only tool handler latency

             for tc in tool_calls:
                 name = tc.function.name
                 args_json = tc.function.arguments
+                t_tool = time.monotonic()
                 result = registry.handle(name, args_json, ctx)
                 try:
                     args_dict = json.loads(args_json or "{}")
                 except json.JSONDecodeError:
                     args_dict = {}
                 trace.turns.append(
                     ToolLoopTurn(
                         tool_name=name,
                         args=args_dict,
                         result=result,
-                        latency_ms=int((time.monotonic() - t0) * 1000),
+                        latency_ms=int((time.monotonic() - t_tool) * 1000),
                     )
                 )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/llm/tools.py` around lines 236 - 251, The latency being
recorded in ToolLoopTurn.latency_ms currently uses t0 set before
generate_chat_response, so it includes LLM round-trip time; to measure only tool
handler execution reset t0 immediately before calling registry.handle(name,
args_json, ctx) inside the for tc in tool_calls loop (or compute
handler_latency_ms by taking a new timestamp around registry.handle), ensuring
ToolLoopTurn.latency_ms is set from that handler-only delta; alternatively, if
the combined LLM+handler duration is intentional, rename latency_ms to
turn_latency_ms (and update usages) to reflect that it measures the whole turn.

196-211: Document the fallback schema requirement for first field.

The fallback path assumes the schema's first field (via next(iter(type(parsed).model_fields))) is a list of items to dispatch. This implicit contract should be documented in the function docstring to prevent subtle bugs when defining fallback schemas.

📝 Suggested docstring addition

     Args:
         ...
         fallback_schema (type[BaseModel] | None): Pydantic schema for the
             capability-fallback path; required when tool-calling is unsupported.
+            The schema's **first field** must be a list of items, where each
+            item's ``model_dump_json()`` matches the ``fallback_tool_name``
+            tool's args model.
         fallback_tool_name (str | None): Name of the tool each fallback item
             is dispatched against.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/llm/tools.py` around lines 196 - 211, Update the docstring
for the function containing this fallback dispatch logic (the function that
returns ToolLoopResult and builds ToolLoopTurn entries) to explicitly state the
schema contract: the parsed object’s first model field (accessed via
next(iter(type(parsed).model_fields))) must be a list/iterable of items whose
model_dump_json() matches the fallback tool (fallback_tool_name) args model;
mention that each item will be iterated and passed to registry.handle and that
mismatching schemas will cause runtime errors in the ToolLoopTurn creation.
Include references to parsed, fallback_tool_name, registry.handle, ToolLoopTurn,
and ToolLoopResult so callers defining fallback schemas know the requirement.

reflexio/server/services/extraction/readers.py (1)

47-52: Consider adding type annotation for candidates field.

The candidates field is typed as bare list rather than list[Any] or a more specific type. While this works, explicit typing improves readability and IDE support.

✨ Suggested improvement

 `@dataclass`
 class ReaderCtx:
     """Mutable accumulator passed to tool handlers during one reader run."""

-    candidates: list = field(default_factory=list)
+    candidates: list[Any] = field(default_factory=list)
     finished: bool = False

You'll need to add Any to the imports from typing.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/extraction/readers.py` around lines 47 - 52, The
ReaderCtx dataclass uses a bare list for candidates; update its type annotation
to a parameterized type (e.g., candidates: list[Any] or a more specific type)
and keep the default_factory=list; also add Any to the typing imports so the
annotation resolves (refer to the ReaderCtx class and the candidates field).

reflexio/server/services/search/synthesizers.py (1)

105-111: Document the sentinel candidate_index=-1 for synthesizer-originated flags.

The _flag handler sets candidate_index=-1 for flags raised by synthesizers, distinguishing them from critic flags which use actual indices. This sentinel value should be documented to prevent confusion when processing flags downstream.

📝 Suggested documentation improvement

 def _flag(args: BaseModel, ctx: SynthCtx) -> dict:
-    """Tool handler: raise a cross-entity conflict flag tied to ctx.lane."""
+    """Tool handler: raise a cross-entity conflict flag tied to ctx.lane.
+
+    Synthesizer flags use ``candidate_index=-1`` as a sentinel since they
+    reference IDs rather than positional indices.
+    """
     a = cast(SynthFlagArgs, args)
     ctx.flags.append(
         CrossEntityFlag(candidate_index=-1, reason=f"{a.id}: {a.reason}", lane=ctx.lane)
     )
     return {"flagged": a.id}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/search/synthesizers.py` around lines 105 - 111,
Document that synthesizer-originated flags use the sentinel candidate_index=-1:
update the _flag handler docstring and the CrossEntityFlag (or the flags
collection docs) to state that candidate_index==-1 indicates a
synthesizer-generated flag (as created by _flag with SynthFlagArgs and appended
to ctx.flags), so downstream processors can distinguish these from critic flags
which use real candidate indices; include the sentinel value and its semantic
meaning in the same comment/block where CrossEntityFlag or ctx.flags is defined.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 5c09495f-8226-4140-8538-1bb03e6315b4

📥 Commits

Reviewing files that changed from the base of the PR and between b3a0ec5 and 6b63d95.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (78)

pyproject.toml
reflexio/models/api_schema/domain/entities.py
reflexio/models/config_schema.py
reflexio/server/llm/__init__.py
reflexio/server/llm/litellm_client.py
reflexio/server/llm/model_defaults.py
reflexio/server/llm/tools.py
reflexio/server/prompt/prompt_bank/playbook_critic/v1.0.0.prompt.md
reflexio/server/prompt/prompt_bank/playbook_reader_behavior/v1.0.0.prompt.md
reflexio/server/prompt/prompt_bank/playbook_reader_rationale/v1.0.0.prompt.md
reflexio/server/prompt/prompt_bank/playbook_reader_trigger/v1.0.0.prompt.md
reflexio/server/prompt/prompt_bank/playbook_search_context/v1.0.0.prompt.md
reflexio/server/prompt/prompt_bank/playbook_search_direct/v1.0.0.prompt.md
reflexio/server/prompt/prompt_bank/playbook_search_temporal/v1.0.0.prompt.md
reflexio/server/prompt/prompt_bank/playbook_synthesizer/v1.0.0.prompt.md
reflexio/server/prompt/prompt_bank/profile_critic/v1.0.0.prompt.md
reflexio/server/prompt/prompt_bank/profile_reader_context/v1.0.0.prompt.md
reflexio/server/prompt/prompt_bank/profile_reader_facts/v1.0.0.prompt.md
reflexio/server/prompt/prompt_bank/profile_reader_temporal/v1.0.0.prompt.md
reflexio/server/prompt/prompt_bank/profile_search_context/v1.0.0.prompt.md
reflexio/server/prompt/prompt_bank/profile_search_direct/v1.0.0.prompt.md
reflexio/server/prompt/prompt_bank/profile_search_temporal/v1.0.0.prompt.md
reflexio/server/prompt/prompt_bank/profile_synthesizer/v1.0.0.prompt.md
reflexio/server/prompt/prompt_bank/reconciler/v1.0.0.prompt.md
reflexio/server/services/extraction/__init__.py
reflexio/server/services/extraction/agentic_extraction_service.py
reflexio/server/services/extraction/critics.py
reflexio/server/services/extraction/readers.py
reflexio/server/services/generation_service.py
reflexio/server/services/playbook/playbook_service_utils.py
reflexio/server/services/profile/profile_generation_service_utils.py
reflexio/server/services/search/__init__.py
reflexio/server/services/search/agentic_search_service.py
reflexio/server/services/search/search_agents.py
reflexio/server/services/search/synthesizers.py
reflexio/server/services/storage/sqlite_storage/_base.py
reflexio/server/services/storage/sqlite_storage/_playbook.py
reflexio/server/services/storage/sqlite_storage/_profiles.py
reflexio/server/services/unified_search_service.py
reflexio/test_support/llm_mock.py
tests/conftest.py
tests/eval/__init__.py
tests/eval/aggregate.py
tests/eval/conftest.py
tests/eval/golden_set/extraction/mixed_ttl.yaml
tests/eval/golden_set/extraction/polars_vs_pandas.yaml
tests/eval/golden_set/extraction/superseded_state.yaml
tests/eval/golden_set/search/db_preference.yaml
tests/eval/golden_set/search/deadline_context.yaml
tests/eval/golden_set/search/superseded_rule.yaml
tests/eval/judge.py
tests/eval/judge_prompts/extraction_rubric.yaml
tests/eval/judge_prompts/search_rubric.yaml
tests/eval/test_agentic_vs_classic_extraction_integration.py
tests/eval/test_agentic_vs_classic_search_integration.py
tests/eval/test_aggregate.py
tests/eval/test_judge_unit.py
tests/models/api_schema/__init__.py
tests/models/api_schema/test_domain_entities.py
tests/models/api_schema/test_retriever_schema.py
tests/server/llm/test_litellm_client_tool_calls.py
tests/server/llm/test_model_roles.py
tests/server/llm/test_tools.py
tests/server/services/extraction/__init__.py
tests/server/services/extraction/test_agentic_extraction_service_integration.py
tests/server/services/extraction/test_critics.py
tests/server/services/extraction/test_readers.py
tests/server/services/playbook/test_structured_playbook_content.py
tests/server/services/profile/test_profile_add_item.py
tests/server/services/search/__init__.py
tests/server/services/search/test_agentic_search_service_integration.py
tests/server/services/search/test_search_agents.py
tests/server/services/search/test_synthesizers.py
tests/server/services/storage/sqlite_storage/__init__.py
tests/server/services/storage/sqlite_storage/test_agentic_signals.py
tests/server/services/test_agentic_backend_pipeline_integration.py
tests/server/services/test_generation_service_dispatcher.py
tests/server/services/test_prompt_model_mapping.py

coderabbitai · 2026-04-23T06:19:54Z

+def _supersede(args: BaseModel, ctx: ReconcilerCtx) -> dict:
+    a = cast(SupersedeArgs, args)
+    tgt = _lane_list(ctx, a.drop_lane)
+    if not 0 <= a.drop_index < len(tgt):
+        return {"error": "drop_index out of range"}
+    tgt.pop(a.drop_index)
+    return {"superseded": [a.drop_lane, a.drop_index]}


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if run_tool_loop processes multiple tool calls per turn that could trigger this rg -nA 10 'for tc in tool_calls:' reflexio/server/llm/tools.py

Repository: ReflexioAI/reflexio

Length of output: 566

🏁 Script executed:

#!/bin/bash # Find SupersedeArgs definition and understand the argument structure rg -nB 5 -A 5 'class SupersedeArgs' reflexio/server/services/extraction/

Repository: ReflexioAI/reflexio

Length of output: 826

🏁 Script executed:

#!/bin/bash # Check how _lane_list works - does it return a reference to the same list each time? rg -nB 2 -A 8 'def _lane_list' reflexio/server/services/extraction/critics.py

Repository: ReflexioAI/reflexio

Length of output: 462

🏁 Script executed:

#!/bin/bash # Check if there's any batching or deduplication of operations in the reconciler context rg -nA 5 'class ReconcilerCtx' reflexio/server/services/extraction/

Repository: ReflexioAI/reflexio

Length of output: 553

🏁 Script executed:

#!/bin/bash # Search for any existing tests or comments about multiple supersede operations rg -i 'supersede.*multiple\|multiple.*supersede' reflexio/

Repository: ReflexioAI/reflexio

Length of output: 45

🏁 Script executed:

#!/bin/bash # Check how reconciler is called - does it expect/support multiple operations per turn? rg -nB 5 -A 15 'run_reconciler\|call_reconciler' reflexio/server/services/extraction/ | head -60

Repository: ReflexioAI/reflexio

Length of output: 45

🏁 Script executed:

#!/bin/bash # Look for tests of reconciler to see if multiple operations per turn are tested find reflexio -name '*test*.py' -o -name '*_test.py' | xargs rg -l 'supersede\|reconcil' 2>/dev/null | head -5

Repository: ReflexioAI/reflexio

Length of output: 45

🏁 Script executed:

#!/bin/bash # Check the LLM tool schema/documentation for reconciler to understand how indices are provided rg -nB 5 -A 20 'RECONCILER_TOOLS\|reconciler.*tools\|_get_tool_definitions' reflexio/server/services/extraction/ | grep -A 20 -B 5 'supersede'

Repository: ReflexioAI/reflexio

Length of output: 45

🏁 Script executed:

#!/bin/bash # Find where the reconciler tool handlers are registered/called rg -n 'registry\|_supersede\|_merge\|_accept' reflexio/server/services/extraction/critics.py | head -30

Repository: ReflexioAI/reflexio

Length of output: 45

🏁 Script executed:

#!/bin/bash # Search for where ReconcilerCtx is instantiated - this shows how reconciler is invoked rg -nB 5 -A 10 'ReconcilerCtx(' reflexio/server/services/extraction/

Repository: ReflexioAI/reflexio

Length of output: 141

🏁 Script executed:

#!/bin/bash # Look for any LLM prompting or tool documentation that explains supersede to the model rg -i 'drop_lane\|keep_lane\|drop_index' reflexio/ --type py | head -20

Repository: ReflexioAI/reflexio

Length of output: 45

🏁 Script executed:

#!/bin/bash # Find the handler functions and their registration rg -n 'def _supersede\|def _merge\|def _accept\|def _reject\|def _flag' reflexio/server/services/extraction/critics.py

Repository: ReflexioAI/reflexio

Length of output: 45

🏁 Script executed:

#!/bin/bash # Search for handler registration or mapping rg -n 'HANDLERS\|handlers\s*=\|register' reflexio/server/services/extraction/critics.py | head -20

Repository: ReflexioAI/reflexio

Length of output: 45

🏁 Script executed:

#!/bin/bash # Look at the full reconciler module to understand how it's used cat -n reflexio/server/services/extraction/critics.py | sed -n '375,420p'

Repository: ReflexioAI/reflexio

Length of output: 1890

🏁 Script executed:

#!/bin/bash # Search for where these handler functions are actually used/called rg -n '_supersede\|_merge\|_keep_both\|_accept\|_reject' reflexio/ --type py -A 1 -B 1 | grep -v 'def _' | head -40

Repository: ReflexioAI/reflexio

Length of output: 45

🏁 Script executed:

#!/bin/bash # Look for tool definitions or handler dictionaries in the file cat -n reflexio/server/services/extraction/critics.py | sed -n '1,100p'

Repository: ReflexioAI/reflexio

Length of output: 3265

🏁 Script executed:

#!/bin/bash # Check if there's a tool handler mapping at the end of critics.py tail -50 reflexio/server/services/extraction/critics.py

Repository: ReflexioAI/reflexio

Length of output: 1918

🏁 Script executed:

#!/bin/bash # Find RECONCILER_TOOLS definition rg -n 'RECONCILER_TOOLS' reflexio/server/services/extraction/critics.py

Repository: ReflexioAI/reflexio

Length of output: 142

🏁 Script executed:

#!/bin/bash # Search the entire file for where RECONCILER_TOOLS is defined rg -nB 2 -A 20 'RECONCILER_TOOLS\s*=' reflexio/server/services/extraction/critics.py

Repository: ReflexioAI/reflexio

Length of output: 873

Multiple supersede calls in one turn can cause index shift issues.

The _supersede handler uses pop() which shifts list indices. If the LLM issues multiple supersede calls targeting the same lane in a single turn, subsequent calls may reference stale indices after the first pop(). While bounds checking prevents silent corruption, it could return a confusing error or remove the wrong item.

Consider processing all supersede operations in reverse index order, or collecting indices to remove and batch-processing them after the turn completes. Note: The _merge handler documents this issue (lines 412–413) but accepts it for cross-lane cases; _supersede should either match that approach or implement the batching fix.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@reflexio/server/services/extraction/critics.py` around lines 395 - 401, The _supersede handler currently mutates the target lane list immediately (using tgt.pop) which makes multiple supersede calls in one turn shift subsequent indices; update _supersede (and use SupersedeArgs and _lane_list) to collect all drop_index values for a given drop_lane and apply removals after gathering all requests in descending index order (or otherwise batch-process removals) so earlier removals don't change indices for later ones; ensure the function still validates indices against the original list length and return the same {"superseded": [...] } shape but derived from the batched operation (match the reasoning used by _merge if you prefer the documented accept-or-batch approach).

Not addressing — reflexio/server/services/extraction/critics.py was removed entirely as part of the v2 single-loop refactor (_supersede and _merge no longer exist). The agent now drives supersession via the same delete_* / create_* tools the prompt rules describe, with no shared mutable list to index-shift. Comment is moot.

@yilu331, understood — thanks for the clarification. Since reflexio/server/services/extraction/critics.py was removed entirely in the v2 single-loop refactor, the index-shift concern around _supersede is no longer applicable. The new approach using delete_* / create_* tools avoids the shared mutable list altogether, so this comment can be dismissed.

coderabbitai · 2026-04-23T06:19:54Z

+def test_playbook_search_agent_temporal_includes_archived(
+    real_client, tool_call_completion
+):
+    """Temporal intent: status_filter is None so archived items are in scope."""
+    make_tc, _ = tool_call_completion
+    storage = MagicMock()
+    storage.search_user_playbooks.return_value = []
+    req = MagicMock()
+    req.user_id = "u1"
+    agent = PlaybookSearchAgent(
+        "temporal", client=real_client, prompt_manager=_pm(), storage=storage
+    )
+    responses = [
+        make_tc(
+            "search_playbooks",
+            {"query": "x", "top_k": 10, "respect_ttl": False},
+        ),
+        make_tc("submit_candidates", {"ids": [], "why": "none"}),
+    ]
+    with patch("litellm.completion", side_effect=responses):
+        agent.run(query="x", req=req)
+
+    sent = storage.search_user_playbooks.call_args.args[0]
+    assert sent.status_filter is None


⚠️ Potential issue | 🟡 Minor

Temporal playbook test also misses a direct TTL assertion.

Like the profile temporal test, this currently verifies archived-scope behavior but not the claimed TTL toggle. Assert respect_ttl=False explicitly.

✅ Suggested assertion addition

with patch("litellm.completion", side_effect=responses): agent.run(query="x", req=req) sent = storage.search_user_playbooks.call_args.args[0] + assert sent.respect_ttl is False assert sent.status_filter is None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/server/services/search/test_search_agents.py` around lines 193 - 216, The temporal playbook test test_playbook_search_agent_temporal_includes_archived currently checks that status_filter is None but misses asserting the TTL toggle; after obtaining sent = storage.search_user_playbooks.call_args.args[0] (from the PlaybookSearchAgent run), add an assertion that sent.respect_ttl is False to explicitly verify the TTL toggle for the temporal intent.

Not addressing — same root cause as the sibling profile-temporal comment. PlaybookSearchAgent and the file holding this test (tests/server/services/search/test_search_agents.py) were removed in the single-SearchAgent refactor (commit 8052df8). No equivalent surface remains to assert against.

@yilu331 — understood, thanks for the context. Since PlaybookSearchAgent and the entire test_search_agents.py file were removed as part of the single-SearchAgent refactor in 8052df8, this comment is moot. I'll disregard it.

Extends ModelRole enum with four tool-calling agentic pipeline roles and adds matching ProviderDefaults fields (defaulting to None on providers that have not opted into tool-calling). Anthropic and claude-code providers are populated with concrete defaults; other providers fall through via _auto_detect_model's priority chain. Part of the agentic extraction & search rollout (docs/superpowers/specs/2026-04-22-agentic-extraction-search-design.md).

Introduce ToolCallingChatResponse and wire tools/tool_choice through _build_completion_params; resolve model_role via resolve_model_name. Also add openai ProviderDefaults for agentic roles (angle_reader, critic, synthesizer, reconciler) to fix pre-existing test failure.

Adds ToolLoopTurn/ToolLoopTrace/ToolLoopResult and run_tool_loop that drives an LLM through a tool loop via tools=/tool_choice= and falls back to structured output for non-tool-calling providers.

Catch exceptions in the native tool loop and return a terminal error ToolLoopResult instead of letting them propagate. Makes the 'error' Literal value actually reachable and adds test coverage.

…schemas Adds three optional fields to ProfileAddItem, StructuredPlaybookContent, UserProfile, and UserPlaybook. Fields are nullable and default to None, so callers and existing data are unchanged. Consumed by agentic extraction readers to carry evidence and attribution through to storage.

…and user_playbooks Extends the CREATE TABLE DDL and adds an in-place _migrate_agentic_signals helper so existing DBs are upgraded on startup. Adds the fields to the INSERT/UPDATE/SELECT row-mapping sites. Columns are nullable and default to NULL — classic extraction leaves them blank.

…ories Adds extraction_backend / search_backend Literal['classic','agentic'] fields to Config (default 'classic', keeps existing behavior). Adds build_extraction_service / build_search_service factories that dispatch on those flags. Agentic targets are lazy-imported — Phase 3/4 implement them, Phase 6 wires the factories into GenerationService.run. Also adds UnifiedSearchService class wrapper to unified_search_service.py so the dispatcher factory returns a uniform class-based handle.

…st readers

…arize

… F005 F006 F008 F009 F010 F011

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (5)

tests/server/test_logging_timezone.py (1)

31-33: Tighten the direct timestamp assertion to avoid false positives.

Line 32 uses match, which can still pass with trailing junk. For formatTime() output, fullmatch gives a stricter regression guard.

Proposed assertion tweak

-        assert _TZ_PATTERN.match(rendered), f"timestamp missing TZ offset: {rendered!r}"
+        assert _TZ_PATTERN.fullmatch(rendered), f"timestamp missing TZ offset: {rendered!r}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/server/test_logging_timezone.py` around lines 31 - 33, Replace the
loose regex check using _TZ_PATTERN.match(rendered) with a strict full match so
the entire formatted timestamp is validated; update the assertion in the test
(the line that sets rendered = formatter.formatTime(record) and the subsequent
assert) to use _TZ_PATTERN.fullmatch(rendered) instead of match (retain the same
failure message and variables: rendered, formatter.formatTime(record), and
_TZ_PATTERN).

tests/server/services/extraction/test_agentic_v2_e2e.py (1)

207-251: Storage assertion occurs after context manager exit.

Lines 246-248 access request_context.storage after the TemporaryDirectory context manager has exited. While SQLite typically handles this, it's safer to move assertions inside the context.

♻️ Move assertions inside the context

     with tempfile.TemporaryDirectory() as temp_dir:
         request_context = RequestContext(org_id=org_id, storage_base_dir=temp_dir)
         gs = GenerationService(llm_client=client, request_context=request_context)
         gs.configurator.get_config = MagicMock(return_value=_make_agentic_config())  # type: ignore[method-assign]

         with patch(
             "reflexio.server.services.extraction.agentic_adapter.ExtractionAgent"
         ) as mock_agent_cls:
             request = PublishUserInteractionRequest(...)
             result = gs.run(request)

-    # ExtractionAgent was never instantiated.
-    mock_agent_cls.assert_not_called()
+        # ExtractionAgent was never instantiated.
+        mock_agent_cls.assert_not_called()

-    # No profiles persisted.
-    assert request_context.storage is not None
-    profiles = request_context.storage.get_user_profile(user_id)
-    assert profiles == [], f"expected no profiles; got {profiles}"
+        # No profiles persisted.
+        assert request_context.storage is not None
+        profiles = request_context.storage.get_user_profile(user_id)
+        assert profiles == [], f"expected no profiles; got {profiles}"

-    # Result must not have raised (warnings may be empty or trivial).
-    assert result.request_id is not None
+        # Result must not have raised (warnings may be empty or trivial).
+        assert result.request_id is not None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/server/services/extraction/test_agentic_v2_e2e.py` around lines 207 -
251, In test_e2e_agentic_v2_extraction_agent_not_invoked_for_trivial_session
move the storage-related assertions so they run before the TemporaryDirectory
context exits: after the block that calls gs.run(request) (but still inside the
with tempfile.TemporaryDirectory() as temp_dir: scope) assert
request_context.storage is not None, call
request_context.storage.get_user_profile(user_id) and assert profiles == [], and
also assert result.request_id is not None there; this ensures
RequestContext/storage is still valid when inspected and removes the
post-context checks that currently run after the temp dir is closed.

reflexio/server/services/extraction/tools.py (1)

154-156: _status_from_str can raise KeyError on invalid input.

If an unexpected status string is passed (e.g., from malformed LLM output that bypasses Pydantic validation), the dict lookup will raise KeyError. Consider using .get() with a default or explicit error handling.

🛡️ Suggested defensive fix

 def _status_from_str(s: str) -> Status | None:
-    return {"current": None, "pending": Status.PENDING, "archived": Status.ARCHIVED}[s]
+    mapping = {"current": None, "pending": Status.PENDING, "archived": Status.ARCHIVED}
+    if s not in mapping:
+        raise ValueError(f"Invalid status: {s!r}")
+    return mapping[s]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/services/extraction/tools.py` around lines 154 - 156,
_status_from_str currently does a direct dict lookup which will raise KeyError
for unexpected strings; change it to handle invalid inputs defensively by using
a safe lookup (e.g., mapping.get(s)) and either return a default (None) or raise
a clear ValueError with context; update the function _status_from_str(s: str) to
explicitly handle unknown keys and reference the Status enum values
(Status.PENDING, Status.ARCHIVED) in the error/return path so malformed LLM
output doesn't propagate a KeyError.

tests/server/services/extraction/test_agentic_adapter.py (1)

131-134: Same os.environ.setdefault leak pattern as e2e tests.

Lines 133, 573, and 687 use the same pattern that leaks API key environment variables across test runs. Consider using patch.dict(os.environ, ...) for isolation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/server/services/extraction/test_agentic_adapter.py` around lines 131 -
134, Replace the leaking os.environ.setdefault calls used before creating
LiteLLMClient/LiteLLMConfig with a scoped environment patch using
unittest.mock.patch.dict to avoid cross-test pollution: wrap the client
instantiation (where LiteLLMClient(LiteLLMConfig(...)) is created) in a with
patch.dict(os.environ, {"ANTHROPIC_API_KEY": "test-key"}, clear=False): block
(or use the pytest monkeypatch fixture to setenv) so the API key is set only for
that test and automatically restored after; update all occurrences where
os.environ.setdefault("ANTHROPIC_API_KEY", ...) is used (including the instances
near the LiteLLMClient creation) to use this scoped approach.

reflexio/server/llm/tools.py (1)

53-55: Guard against duplicate tool registration names.

register() silently overwrites existing entries. A duplicate name should fail fast; otherwise the wrong handler can be invoked with no signal.

Proposed fix

     def register(self, tool: Tool) -> None:
+        if tool.name in self._tools:
+            raise ValueError(f"duplicate tool registration: {tool.name}")
         self._tools[tool.name] = tool

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@reflexio/server/llm/tools.py` around lines 53 - 55, The register method
currently overwrites existing entries in self._tools; modify register(self,
tool: Tool) to guard against duplicate tool names by checking if tool.name
already exists in self._tools and failing fast (e.g., raise a ValueError or
custom DuplicateToolError) with a clear message that includes the duplicate
name; ensure you reference register, self._tools, and tool.name so callers get
an explicit error instead of silent overwrite.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@reflexio/server/__init__.py`:
- Around line 69-76: The docstring for class _TZAwareFormatter is out of sync
with actual output: update the example timestamp and description to reflect the
runtime format emitted by _TZAwareFormatter (it uses an ISO-extended UTC offset
like -07:00 and may include a TZ abbreviation), not the old `-0700` form; edit
the class docstring in reflexio/server/__init__.py (class _TZAwareFormatter
inheriting logging.Formatter) to show a correct sample such as "2026-04-24
10:20:51.238 -07:00 PST" (or note that the offset is ISO-extended and a TZ
abbrev may be appended) and adjust the explanatory text to mention fallback
behavior remains +00:00 when timezone is missing.

In `@reflexio/server/llm/tools.py`:
- Around line 242-301: The per-tool latency uses t0 (turn start) so later tool
calls include model + earlier-tool time; change it to measure each tool's own
start and compute latency from that: inside the for tc in tool_calls loop record
a per-call start (e.g., call_t0 = time.monotonic()) immediately before calling
registry.handle(name, args_json, ctx) and then set ToolLoopTurn.latency_ms to
int((time.monotonic() - call_t0) * 1000) instead of using t0; keep t0 for
turn-level/model latency if needed and update references to t0 only where
turn-level timing is intended (ToolLoopTurn.model/prompt/completion/total tokens
remain unchanged).
- Around line 224-237: The fallback loop currently iterates all items unbounded
and always returns ToolLoopResult(..., finished_reason="finish_tool"); change
the loop in the function that iterates over items to respect the configured
max_steps cap (use the existing max_steps variable or parameter) by tracking a
step counter, break once steps >= max_steps before calling registry.handle for
the next item, and append ToolLoopTurn only for executed steps; set
trace.finished and ToolLoopResult.finished_reason to a distinct value (e.g.,
"max_steps" or "exhausted") when stopped by the cap, and only use "finish_tool"
when all items were processed naturally; reference symbols: items,
fallback_tool_name, registry.handle, ToolLoopTurn, trace, ctx, ToolLoopResult,
and max_steps.

In `@reflexio/server/prompt/prompt_bank/extraction_agent/v1.1.0.prompt.md`:
- Around line 39-44: The prompt uses ambiguous generic calls `delete`/`create`
in the supersession and profile-merge rules; change the wording to require
concrete, entity-specific tool names (e.g. `delete_user_profile` /
`create_user_profile`, `delete_order` / `create_order`, etc.) so the agent
invokes the correct runtime surface. Update the supersession line to instruct
"call delete_<entity> on the stale id, then create_<entity> with the new
content" and the profile-merge line to instruct "call delete_<entity> on each
duplicate, then create_<entity> once with the merged wording" referencing these
concrete patterns (delete_<entity>, create_<entity>, delete_user_profile,
create_user_profile) so callers know to use entity-specific tool names.

In `@tests/server/services/extraction/test_agentic_adapter.py`:
- Around line 519-527: The test constructs a Violation with code "B" but uses
severity "soft" — change the violation code to one that is actually soft (e.g.,
"E", "H", or "K") so the test validates soft-violation handling correctly;
update the Violation instantiation referenced by soft_violation (and the related
CommitResult result_with_soft) to use a soft-code like "E" (keeping
severity="soft" and the same msg/affected_op_indices).

---

Nitpick comments:
In `@reflexio/server/llm/tools.py`:
- Around line 53-55: The register method currently overwrites existing entries
in self._tools; modify register(self, tool: Tool) to guard against duplicate
tool names by checking if tool.name already exists in self._tools and failing
fast (e.g., raise a ValueError or custom DuplicateToolError) with a clear
message that includes the duplicate name; ensure you reference register,
self._tools, and tool.name so callers get an explicit error instead of silent
overwrite.

In `@reflexio/server/services/extraction/tools.py`:
- Around line 154-156: _status_from_str currently does a direct dict lookup
which will raise KeyError for unexpected strings; change it to handle invalid
inputs defensively by using a safe lookup (e.g., mapping.get(s)) and either
return a default (None) or raise a clear ValueError with context; update the
function _status_from_str(s: str) to explicitly handle unknown keys and
reference the Status enum values (Status.PENDING, Status.ARCHIVED) in the
error/return path so malformed LLM output doesn't propagate a KeyError.

In `@tests/server/services/extraction/test_agentic_adapter.py`:
- Around line 131-134: Replace the leaking os.environ.setdefault calls used
before creating LiteLLMClient/LiteLLMConfig with a scoped environment patch
using unittest.mock.patch.dict to avoid cross-test pollution: wrap the client
instantiation (where LiteLLMClient(LiteLLMConfig(...)) is created) in a with
patch.dict(os.environ, {"ANTHROPIC_API_KEY": "test-key"}, clear=False): block
(or use the pytest monkeypatch fixture to setenv) so the API key is set only for
that test and automatically restored after; update all occurrences where
os.environ.setdefault("ANTHROPIC_API_KEY", ...) is used (including the instances
near the LiteLLMClient creation) to use this scoped approach.

In `@tests/server/services/extraction/test_agentic_v2_e2e.py`:
- Around line 207-251: In
test_e2e_agentic_v2_extraction_agent_not_invoked_for_trivial_session move the
storage-related assertions so they run before the TemporaryDirectory context
exits: after the block that calls gs.run(request) (but still inside the with
tempfile.TemporaryDirectory() as temp_dir: scope) assert request_context.storage
is not None, call request_context.storage.get_user_profile(user_id) and assert
profiles == [], and also assert result.request_id is not None there; this
ensures RequestContext/storage is still valid when inspected and removes the
post-context checks that currently run after the temp dir is closed.

In `@tests/server/test_logging_timezone.py`:
- Around line 31-33: Replace the loose regex check using
_TZ_PATTERN.match(rendered) with a strict full match so the entire formatted
timestamp is validated; update the assertion in the test (the line that sets
rendered = formatter.formatTime(record) and the subsequent assert) to use
_TZ_PATTERN.fullmatch(rendered) instead of match (retain the same failure
message and variables: rendered, formatter.formatTime(record), and _TZ_PATTERN).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9cc3617b-a8e1-4495-9e14-1529e8b85aa4

📥 Commits

Reviewing files that changed from the base of the PR and between 465c4ed and ac15700.

📒 Files selected for processing (20)

reflexio/server/__init__.py
reflexio/server/llm/litellm_client.py
reflexio/server/llm/tools.py
reflexio/server/prompt/prompt_bank/extraction_agent/v1.0.0.prompt.md
reflexio/server/prompt/prompt_bank/extraction_agent/v1.1.0.prompt.md
reflexio/server/prompt/prompt_bank/extraction_agent/v1.2.0.prompt.md
reflexio/server/prompt/prompt_bank/search_agent/v1.0.0.prompt.md
reflexio/server/prompt/prompt_bank/search_agent/v1.1.0.prompt.md
reflexio/server/services/extraction/agentic_adapter.py
reflexio/server/services/extraction/extraction_agent.py
reflexio/server/services/extraction/tools.py
reflexio/server/services/search/search_agent.py
tests/server/llm/test_tools.py
tests/server/services/extraction/test_agentic_adapter.py
tests/server/services/extraction/test_agentic_v2_e2e.py
tests/server/services/extraction/test_extraction_agent.py
tests/server/services/extraction/test_tools.py
tests/server/services/search/test_search_agent.py
tests/server/services/test_prompt_model_mapping.py
tests/server/test_logging_timezone.py

✅ Files skipped from review due to trivial changes (1)

reflexio/server/prompt/prompt_bank/extraction_agent/v1.2.0.prompt.md

🚧 Files skipped from review as they are similar to previous changes (7)

reflexio/server/prompt/prompt_bank/search_agent/v1.0.0.prompt.md
reflexio/server/prompt/prompt_bank/extraction_agent/v1.0.0.prompt.md
tests/server/services/search/test_search_agent.py
tests/server/services/test_prompt_model_mapping.py
reflexio/server/services/extraction/agentic_adapter.py
tests/server/llm/test_tools.py
reflexio/server/services/extraction/extraction_agent.py

Three agentic search tool handlers (_handle_search_user_profiles, _handle_search_user_playbooks, _handle_search_agent_playbooks) built Search*Request objects declaring search_mode=HYBRID but never computed a query embedding. Storage's _effective_search_mode logged 'no query embedding provided — falling back to FTS' on every call and downgraded to FTS-only search — a quality regression vs. the classic path (unified_search_service.py:151-158 computes the embedding in Phase A before dispatching). Fix: new _maybe_embed_query(storage, query) helper reads storage._get_embedding (already used by classic), swallows AttributeError and provider failures, returns None for graceful fallback. Each of the three handlers now passes the computed embedding: - profile search uses query_embedding= kwarg (matching _profiles.py:199-205 signature) - playbook search methods use options=SearchOptions(...) (matching _playbook.py:403-434 signature — the signature asymmetry is a pre-existing BaseStorage oddity) Six regression tests cover: helper fallback (no embedder / raising embedder / happy path) and per-handler embedding plumbing.

Real sessions surfaced two extraction anti-patterns the v1.2.0 prompt did not prevent strongly enough: 1. Bundled profiles: 'User is a senior Go engineer and is on-call this week' — two atomic facts in one content string. Trapped under a single TTL (one loses precision), invariant E can't dedup, clean supersession impossible. 2. Profile / playbook content overlap: 'prefers no code review scheduling before 10:00 AM during on-call period' leaked into profile content AND was (correctly) emitted as a playbook. Two stores of the same rule → divergence when one updates, retrieval noise, breaks the self-improvement-axis separation. v1.3.0 changes: - Narrows the UserProfile scope description from ambiguous 'preferences' to concrete 'role, skills, environment, timezone, tools, ongoing status' — the LLM was latching onto 'preferences' as license to record rule-shaped text. - Adds explicit Fact vs. Rule disambiguator ('is this something the user IS / HAS, or what the agent should DO when X?'). - Adds concrete ✅/❌ examples for both kinds in the scope block. - New rule 3: 'One fact per profile'. New rule 7: 'No overlap between profile and playbook'. Existing rules renumbered. Old v1.2.0 marked active: false; PROMPT_VERSION_MAP bumped to v1.3.0. Sanity-render test extended to assert the new anti-pattern text and rules are present; guards against accidental revert.

load_reflexio_env now stores the resolved .env path in a module-level global; get_loaded_env_path() exposes it. print_startup_banner gains an optional config_paths dict that renders as a 'Config' section between the service URLs and the Logs line. HOME-prefixed paths collapse to ~/... for readability while remaining absolute in logs. Enables the enterprise run_services to surface Env / Config / Storage paths in the dev-server banner so operators can confirm which files the server actually loaded without grepping logs.

…ent for playbooks Today's extraction prompt (v1.3.0) covers fact-vs-rule separation but says nothing about the SHAPE of the trigger/content/rationale fields. Result: inconsistent playbook output — sometimes numbered inline lists with semicolons, sometimes run-on sentences, sometimes single clauses. The downstream agent has to re-parse each shape. Borrows the Agent Skills spec (agentskills.io) where skill description is the retrieval key and skill body is the on-activation instruction set — structurally identical to UserPlaybook.trigger and .content. v1.4.0 adds a new 'Playbook format' section with: - trigger rules — imperative conditional phrasing, keyword coverage, 150-300 char budget, explicit narrow/broad anti-patterns. - content rules — markdown bullet list (or numbered when order is load-bearing), verb-led instructions, self-sufficient bullets, 500-2000 char range. - rationale rules — one sentence WHY, leave empty rather than restate the content. - Concrete good/bad examples for each field, including the code-review playbook the user flagged as unstructured. Rule 6 (playbook expansion) updated to show bullet-preserving merge. Zero code/schema changes; purely prompt narrative + examples. One new sanity-render test asserts the format guidance is present. PROMPT_VERSION_MAP bumped; v1.3.0 retained with active: false.

…igger Extension of the prompt v1.4.0 work to the user → agent playbook aggregation step. The previous v2.0.0 aggregator emitted AgentPlaybook content as 'human-readable prose' (line 54 of v2.0.0) — single long sentences. Now that UserPlaybooks come in bullet-shaped from v1.4.0, the aggregator output was the odd one out. Problem: a downstream agent retrieving memory sees both UserPlaybooks (bullets) and AgentPlaybooks (prose) for the same retrieval query. Two shapes to parse; inconsistent display in any UI that shows both. v2.1.0 adds the same ## Playbook format section used in the extraction prompt, adapted for aggregation: - trigger: consolidate all input triggers into ONE imperative-conditional phrase with broad keyword coverage. - content: take the UNION of bullets across input entries, dedup semantically overlapping ones. When inputs are historical prose, re-shape to bullets on output (the aggregator is the right place to do the upgrade). - rationale: one sentence across inputs; omit rather than restate content. All 3 concrete examples at the bottom of the prompt upgraded to use bullet-shaped content so the LLM pattern-matches on the new shape. Zero code changes. Zero schema changes. PROMPT_VERSION_MAP bumped playbook_aggregation v2.0.0 -> v2.1.0. One sanity-render test guards against silent regression to prose.

The banner only showed dev_server.log, but when operators are debugging prompt or tool-call issues they need llm_io.log first. llm_io.log is where every agent turn / tool-call is rendered; dev_server.log is the general log with the agent SUMMARY lines (elapsed_ms, turns, tools, usage). Two different files for two different debugging modes. Banner now renders: -------------------------------------------- Dev log ~/.reflexio/logs/dev_server.log LLM I/O ~/.reflexio/logs/llm_io.log ============================================ HOME is collapsed to ~ for both paths (consistent with the Config section above) via a small _collapse_home helper — factored out of the inline expression used earlier in the config_paths loop.

Aligns the granular profile-search endpoint with the rest of the user_* CRUD surface (get_user_profile, add_user_profile, delete_user_profile, search_user_playbooks, search_agent_playbooks). The previous search_profiles was the only outlier. Affects: Python client method, lib mixin, HTTP route /api/search_user_profiles, CLI subcommand reflexio user-profiles search-user-profiles, retrieval-latency benchmark, and all tests + docs. No backward-compat shim — outright rename per pre-1.0 API discipline.

…laybooks carry provenance The agentic extraction backend was hardcoding empty strings into UserProfile.generated_from_request_id and UserPlaybook.request_id at commit time (tools.py:603, 616). Result: every agentic-extracted entity had no provenance back to its source publish_interaction request, which silently broke any downstream consumer that translates retrieved profiles back to their originating session — most visibly the LongMemEval recall@K metric. The classic backend already threads request_id correctly via service_config.request_id, embedding it into the model at extractor time. This change mirrors that pattern in the agentic path: 1. ExtractionCtx gains a request_id: str = "" field (default empty for backward-compat with test callers that don't have a publish in scope). 2. ExtractionAgent.run accepts request_id: str = "" and threads it into ExtractionCtx. 3. AgenticExtractionRunner.run already had request_id as a parameter (kept for contract parity); we drop the # noqa: ARG002 and forward it to agent.run. 4. apply_plan_op reads ctx.request_id and embeds it into both the UserProfile and UserPlaybook constructors. Tests: - 3 new unit tests in test_extraction_agent.py covering: profile thread, playbook thread, and backward-compat default (""). - test_agentic_v2_e2e.test_e2e_agentic_v2_full_flow strengthened with per-profile and per-playbook provenance assertions against result.request_id. Verified: 93 extraction tests + 2046 full submodule non-e2e tests pass. Lint clean. Pyright shows only 2 pre-existing import-path warnings in tools.py unrelated to this change.

Bug: SearchMixin.unified_search hardcoded a call to run_unified_search regardless of config.search_backend. The AgenticSearchService and the underlying SearchAgent (server/services/search/) were fully implemented but unreachable from the public API — setting Config.search_backend to "agentic" had no observable effect. Symptoms verified on a 500-question LongMemEval oracle benchmark run: - recall@5 was identical to 6 decimals between classic_<x>_search and agentic_<x>_search cells across every question - "search_agent elapsed_ms=" log marker count was zero across the entire 8.5h run - search wall-time was ~500ms (hybrid SQL+vector) instead of the expected ~15-20s (multi-turn LLM tool loop) Fix: lib/_search.py::unified_search now reads config.search_backend and dispatches to AgenticSearchService.search() when it equals "agentic", or falls through to run_unified_search otherwise. The dispatcher factory build_search_service() already exists in generation_service.py:448 but couldn't be reused as-is because UnifiedSearchService (the classic side) doesn't expose a .search() method — it's just a container class. Used explicit if/else here rather than restructuring UnifiedSearchService for parity, which would be a larger refactor. Tests: - New test_dispatches_to_agentic_when_search_backend_agentic asserts the AgenticSearchService is constructed and .search() called. - New test_dispatches_to_classic_when_search_backend_classic asserts run_unified_search is invoked and AgenticSearchService is not. - Existing test_delegation_to_service updated to set mock_config.search_backend = "classic" explicitly (was a MagicMock attribute which masked the dispatch logic). Verified end-to-end on 10-question oracle SS-A smoke at concurrency=4: - search_agent log marker count: 0 → 27 - search wall: 411 ms → 17,435 ms (agentic vs classic retrieve) - recall@5 cells diverge: agentic_classic 1.000 vs agentic_agentic 0.700 (was identical pre-fix)

Bug uncovered while debugging LongMemEval-oracle SS-A cell (agentic_agentic R@5 = 0.700 vs agentic_classic R@5 = 1.000): The v1.1.0 prompt's "Re-query freely. Rephrasing, narrowing, or trying orthogonal angles is expected" was being interpreted by gpt-5-mini as "extract keywords and search". 100% of search-tool calls (50/50 in a 10-question smoke) used keyword-bag queries like "shift rotation Admon Sunday GM social media" instead of the original natural-language question. Vector recall depends on natural-language phrasing matching how the underlying memory was written; bag-of-keywords queries produce poor text-embedding-3-small embeddings. Classic search uses the literal question and gets perfect R@5; the SearchAgent was hurting itself by reformulating before the first retrieval. v1.2.0 changes: - New rule 1 (was rule 4): "First call: verbatim. Your first tool call MUST pass the user's query VERBATIM as the `query` argument." - Updated rule 5: "Re-query is for narrowing or orthogonal angles, not for paraphrasing into keywords." - Concrete bad-example callout: "shift rotation Admon Sunday". Verified on the same 10-question SS-A smoke: - Turn-1 query distribution: 100% keyword-bag → 95% natural-language. - agentic_agentic R@5: 0.700 → 0.778 (+0.08). - agentic_agentic SS-A accuracy: 30.0% → 33.3% (+3.3pp). Deactivated v1.1.0 (active: false). Bumped PROMPT_VERSION_MAP entry to v1.2.0 in tests/server/services/test_prompt_model_mapping.py.

…t, cap tool budget to <=3, require explicit grounding and temporal handling, and add concise answer formatting and aggregation rules

Cap the extraction agent's tool-loop at 4 rounds (was 12) and the search agent's at 3 rounds (was 10), and surface that cap into the prompt as a template variable so the LLM agent plans accordingly. Why: per benchmark profiling, both agents routinely use 5-8 rounds when the productive shape is search → batch creates → finish (3 rounds for extraction; 1-2 search calls + finish for search). The accumulated tool-call history dominates input tokens by round 4, so capping early cuts ~30-50% of per-call wall time. The prompt change is materially load-bearing: - Without the {max_steps} variable, the agent doesn't know its budget and may fall short of finish() when the loop terminates early. - With it, the agent batches creates into one turn and reserves the last round explicitly for finish(). Floor analysis: extraction needs 3 (search → create → finish); search needs 2 (search → finish). The 4 / 3 caps leave one optional round each for follow-up searches. Tests updated: prompt-render tests pass max_steps so the prompt manager doesn't reject as missing-variable.

…l phrase matching and session-first heuristics, and reinforce existing grounding and budget rules to reduce rounds and improve temporal/count accuracy

…s and templates, emphasize inspecting top-5 snippets for missing atoms, and reinforce temporal/count handling while preserving core rules

…trengthen snippet inspection checklist, enforce targeted single follow-ups, and emphasize temporal/count handling to reduce reformulation loss and improve recall

…able-item guidance, and a rule to preserve temporal markers; reinforce single-fact profiles and content formatting (markdown bullet list, imperative conditional phrasing).

…phasize date encoding and atomic facts; keep required phrases and tighten playbook/profile separation

…xplicit pre-create checklist and temporal examples, and emphasize atomic facts and playbook formatting

…rules: require copying dates/ids verbatim, prefer enumeration of unique names for counts, add full-metadata follow-up template, inspect top ~10 for counts/temporal queries, and emphasize existing core constraints

…lity profile_id was a full 36-char str(uuid.uuid4()) — error-prone for smaller LLMs (gpt-5-mini and below) to copy verbatim from a search result back into a delete/update tool argument. Switch to 12 hex chars (~2.8e14 unique values via uuid.uuid4().hex[:12]). Why 12: short enough that LLM transcription is high-fidelity; long enough that birthday-paradox collisions are vanishingly small at any realistic per-user scale. PRIMARY KEY constraint catches the rare collision and fails loudly. Why only profile_id: playbook ids are INTEGER autoincrements (already LLM-friendly); interaction_id and request_id aren't passed back as tool args so their format doesn't affect agent reliability. Existing UUID profile_ids in storage continue to work — they're still valid TEXT primary keys. New profiles get the short format. Benchmarks that wipe + re-ingest will be 100% short-id immediately. Two call sites updated: - extraction/tools.py:596 (agentic backend) - profile/profile_extractor.py:282 (classic backend) Both go through new helper extraction.tools.new_profile_id(). Test brittleness fixes (unrelated to id change but surfaced by Phase 27 iter-2 prompt rewrite): two extraction-agent prompt-content tests were asserting on specific example strings ("prefers no code review...", '"When …"') that the proposer is allowed to evolve. Relaxed those to assert on the load-bearing rule wording instead.

The agentic_search service was coercing missing user_id and agent_version to empty strings before passing them to SearchAgent. Empty user_id then flowed through to storage operations — SqliteStorage would query with WHERE user_id='' (cross-user data leakage potential), DiskStorage would write to an unintended path. Surface the bug at the boundary instead of silently degrading. Addresses CodeRabbit review feedback on PR #34 (Critical).

Source fixes - run_tool_loop: respect max_steps in fallback path; use per-tool monotonic clock for ToolLoopTurn.latency_ms instead of turn-start (per-tool timings were inflated by model latency + earlier tools). - service_utils._format_tool_calls: fall back to repr() when args are not JSON-serialisable so the logging path can never raise. - extraction tools.commit_plan: validate DeleteUserPlaybookOp.id before int() coercion and raise a clear TypeError instead of leaking ValueError. - config_schema._migrate_field_names: strip null extraction_backend / search_backend so existing stored configs deserialize with defaults. - _TZAwareFormatter docstring: update example to match runtime output (ISO extended offset + optional TZ abbreviation). - extraction prompt v1.1.0: use entity-specific tool names in the supersession / profile-merge rules (delete_user_profile, etc.). Eval suite hardening - judge.py: constrain JudgeScore floats to [0, 1] via Field(ge, le). - conftest._load: sort golden cases by YAML 'id', fail loudly when a case has no id (previously sorted by filename, which silently desynced when filenames drifted). - judge_prompts: align rubric prompts with the JudgeScore schema — fold nuance / must-not-violated guidance into the existing signal_f1 / answer_correctness fields rather than emitting unrequested keys. - mixed_ttl golden case: use schema-valid TTL values (infinity, one_week) instead of 'persistent' / 'short_term'. - test_aggregate: pytest.approx for float means, bound p95 latency asserts top + bottom so overestimation fails. Test coverage - test_domain_entities: assert source_span and notes (not just reader_angle) on the optional-field positive paths. - test_tools: assert known_ids actually grew on the search-handler populates_known_ids tests. - test_generation_service_dispatcher: drop ImportError skip so a broken AgenticSearchService import surfaces immediately. - test_agentic_search_service: align org_id between RequestContext and the temp_storage fixture; pass agent_version on UnifiedSearchRequest to satisfy the API-boundary check. - test_agentic_v2_e2e: scope OPENAI_API_KEY env mutation to client construction via patch.dict so it can't leak between tests. - test_agentic_adapter: switch the soft-violation case to code 'E' (B is hard per invariants.py) so the soft-handling test is real.

The previous commit 5ad98e9 added validation that rejected empty request.agent_version with 500 errors. Only request.user_id has data-isolation implications (it scopes ALL storage operations); agent_version only scopes AgentPlaybook reads, which are read-only and where empty == 'no scope filter, return any agent_playbook'. The benchmark's client.search() doesn't pass agent_version, so the strict validation caused 500s on every retrieval call. Keep the user_id check, drop the agent_version check, default to '' inside the agent.run() call.

Add 'reflexio setup openai-codex' — a self-contained PKCE OAuth flow against auth.openai.com that lets reflexio sign in with the user's ChatGPT subscription and persist tokens to ~/.reflexio/auth/openai-codex.json. What this replaces: the codex proxy previously read tokens from ~/.openclaw/agents/main/agent/auth-profiles.json, making reflexio depend on the OpenClaw CLI to refresh OAuth state. Now reflexio owns its own OAuth flow + token storage, no external CLI required. What's in this commit (OS submodule): - reflexio/cli/codex_auth.py — token store + PKCE + browser-callback + refresh logic. Pure stdlib (urllib, http.server, secrets, hashlib). Tokens persist with 0600 file mode, 0700 dir mode where the filesystem supports it. - reflexio/cli/commands/setup_cmd.py — new 'setup openai-codex' subcommand with --no-browser, --timeout, --show, --logout flags. --show surfaces email/plan_type/expiry; --logout deletes the token file; default flow opens browser, captures the callback on localhost:1455, exchanges code for tokens, persists, and warms the refresh path so any clock skew is caught at sign-in time. - tests/cli/test_codex_auth.py — 13 unit tests covering PKCE pair generation (RFC 7636 shape, uniqueness), JWT decoding (with and without padding, malformed rejection), token-from-OAuth-response derivation (account_id / plan_type / email / expires_at, including fallback to expires_in when JWT lacks exp), token storage round trip + behaviour on missing/malformed files, expiry-with-lead-time threshold logic, and authorize-URL parameter completeness. The OAuth client_id, issuer, redirect URI, and scopes are derived from codex-rs source code (codex-rs/login/src/server.rs DEFAULT_ISSUER, codex-rs/login/src/lib.rs CLIENT_ID export) and verified against the JWT we already had in OpenClaw's auth-profiles.json (shared client_id == app_EMoamEEZ73f0CkXaXp7hrann).

A. T-R schema (typed dates field): - Add dates_mentioned: list[str] to UserProfile, UserPlaybook, ProfileView, UserPlaybookView (and converters). - Plumb a dates argument through CreateUserProfileArgs / CreateUserPlaybookArgs, CreateUserProfileOp / CreateUserPlaybookOp, and apply_plan_op. - SQLite: add dates_mentioned TEXT column (CREATE TABLE + idempotent ALTER TABLE migration), JSON-serialize on INSERT/UPDATE, parse back on row→model, append to FTS body so date queries hit profiles/playbooks. - Disk storage: Pydantic auto-serializes the new optional field. - Bump extraction_agent prompt to v1.5.0 with explicit guidance to populate the typed dates argument alongside the inline ISO date in content. B. enable_agent_answer toggle (search-only mode): - Add enable_agent_answer: bool | None = False to UnifiedSearchRequest, mirroring the enable_reformulation pattern. - SearchAgent constructor accepts the flag; when False, ctx.search_answer is dropped and the response carries agent_answer=None. SearchResult.answer becomes str | None. - SearchFinishArgs.answer is now Optional so the agent can call finish() with no synthesis in search-only mode. - Bump search_agent prompt to v1.3.0 with a conditional block that branches finish() behaviour on enable_agent_answer. - Default is False everywhere — UnifiedSearchViewResponse already strips agent_answer at the API boundary, so callers see no change unless they opt in explicitly. Tests: - 3 new extraction-agent tests asserting dates thread end-to-end (profile, playbook, default-empty backward compat). - 4 new storage tests (round-trip, default empty, FTS-by-date, migration idempotency on a pre-migration DB). - 3 new search-agent tests (search-only None answer, prompt branches by flag in both modes). - Existing tests that asserted on agent_answer set enable_agent_answer=True to keep their original intent. - PROMPT_VERSION_MAP updated for both agent versions.

The "one fact per profile" invariant already enforces single-event semantics, so a single date per fact (rather than a list) is the right shape. Multi-date events (trip start + trip end, recurring anniversaries) are split into multiple profiles by the extraction prompt. Benefits over the list: - Simpler tool-call surface — LLM no longer has to decide whether to wrap a single value in []. - Aligns with atomicity: one fact -> one date. - Storage stays as a TEXT column; FTS append becomes a single string instead of an iteration. - Future-proof: widening str -> list[str] later is additive; narrowing list -> str would be lossy. Touches schema (UserProfile, UserPlaybook, view models + converters), tool args (CreateUserProfileArgs, CreateUserPlaybookArgs), plan ops, apply_plan_op, SQLite (DDL column rename, idempotent ALTER TABLE, INSERT/UPDATE/FTS), the extraction_agent v1.5.0 prompt examples, and tests for round-trip, FTS, migration, and the extraction agent thread. The migration auto-adds `date_mentioned TEXT` on existing DBs. Local dev DBs that already had the previous `dates_mentioned` column from the prior commit will retain it dormant (idempotent migration is forward-only); the new column gets populated going forward.

…table extraction with temporal examples while preserving playbook format and core invariants.

Phase 27 iter 0 baseline on oracle (post-schema-change) showed T-R = 0/6, identical to the prior baseline before the field existed. The locked decision rule was: keep iff iter 0 shows T-R >= 1/6. It didn't, so revert. Diagnostic: recall@5 = 98.7% in iter 0 — retrieval surfaces the right profiles already, including profiles whose content carries dates inline ("user attended X on YYYY-MM-DD"). The bottleneck is the answer LLM, which still can't reason temporally even when the dates are reaching it. A typed field at the storage layer doesn't fix that. The Section B work (enable_agent_answer toggle, search_agent v1.3.0) stays intact — it's an independent improvement and isn't tied to the typed field. Reverts: - UserProfile / UserPlaybook: drop date_mentioned - ProfileView / UserPlaybookView: drop date_mentioned - to_profile_view / to_user_playbook_view: drop the converter line - CreateUserProfileArgs / CreateUserPlaybookArgs: drop date arg - CreateUserProfileOp / CreateUserPlaybookOp: drop date field - _handle_create_user_profile / _handle_create_user_playbook: revert call sites - apply_plan_op: revert profile/playbook construction - SQLite schema: drop date_mentioned column from CREATE TABLE - _migrate_date_mentioned: removed - INSERT/UPDATE/FTS: drop the column and FTS-append paths - _row_to_profile / _row_to_user_playbook: drop the field - extraction_agent: drop v1.5.0 prompt, restore v1.4.0 active=true - search_agent v1.3.0: drop the date_mentioned mention - PROMPT_VERSION_MAP: extraction_agent back to v1.4.0 - 7 tests removed (3 extraction agent, 2 storage contract, 2 SQLite) Phase 28 follow-up: T-R needs answer-LLM-side work, not extraction-side. Probably a date-aware "given these dated profiles, compute X" prompt for the answer model, OR session-metadata that the LLM can scan without relying on profile content.

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

yilu331 added 28 commits April 24, 2026 09:40

feat(llm): add Tool and ToolRegistry primitives

1d5abb5

test(llm): add tool-call mock helpers and fixture

d68bf97

feat(llm): implement run_tool_loop with capability fallback

79d6b88

Adds ToolLoopTurn/ToolLoopTrace/ToolLoopResult and run_tool_loop that drives an LLM through a tool loop via tools=/tool_choice= and falls back to structured output for non-tool-calling providers.

fix(llm): return finished_reason='error' when tool loop raises

d987e47

Catch exceptions in the native tool loop and return a terminal error ToolLoopResult instead of letting them propagate. Makes the 'error' Literal value actually reachable and adds test coverage.

feat(prompts): add 6 extraction reader prompts

e3bcfc5

feat(extraction): add ProfileReader and PlaybookReader angle-speciali…

54d3a8f

…st readers

feat(prompts): add critic and reconciler prompts

5606e22

feat(extraction): add ProfileCritic, PlaybookCritic, Reconciler, summ…

ebccdf3

…arize

feat(extraction): add AgenticExtractionService orchestrator

b59fabb

feat(prompts): add 6 search agent prompts

fdbbd38

feat(search): add ProfileSearchAgent and PlaybookSearchAgent

ecd9a91

feat(search): add ProfileSynthesizer and PlaybookSynthesizer

ed29fe0

feat(search): add AgenticSearchService orchestrator

7e76025

test(schema): pin UnifiedSearchResponse.msg round-trip contract

83faaf7

test(eval): add 3 extraction golden cases

0109e8c

test(eval): add 3 search golden cases

57fcb5e

test(eval): add extraction and search judge rubrics

518631d

test(eval): add LLMJudge and JudgeScore

637bf58

test(eval): add polars aggregator for eval results

c6651c3

test(eval): add agentic-vs-classic comparison harness

5ede554

test(integration): add agentic backend full-pipeline smoke

207a578

test: register Phase 3-4 agentic prompts in PROMPT_VERSION_MAP

442e4cb

fix: address review-loop findings (iteration 1) - F001 F002 F003 F004…

8d6da3d

… F005 F006 F008 F009 F010 F011

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

yilu331 added 22 commits April 24, 2026 12:28

tune(search): iter 1 — Tighten search workflow: enforce verbatim-firs…

7b44ae4

…t, cap tool budget to <=3, require explicit grounding and temporal handling, and add concise answer formatting and aggregation rules

tune(search): iter 1 — Tighten follow-up templates, emphasize tempora…

081628b

…l phrase matching and session-first heuristics, and reinforce existing grounding and budget rules to reduce rounds and improve temporal/count accuracy

tune(search): iter 2 — Clarify index-ordering, tighten follow-up rule…

3a6c74d

…s and templates, emphasize inspecting top-5 snippets for missing atoms, and reinforce temporal/count handling while preserving core rules

tune(search): iter 3 — Tighten and clarify verbatim-first workflow, s…

38c2565

…trengthen snippet inspection checklist, enforce targeted single follow-ups, and emphasize temporal/count handling to reduce reformulation loss and improve recall

tune(extraction): iter 1 — Add explicit temporal/date encoding, count…

b4d3560

…able-item guidance, and a rule to preserve temporal markers; reinforce single-fact profiles and content formatting (markdown bullet list, imperative conditional phrasing).

tune(extraction): iter 2 — Strengthen temporal and counting rules, em…

c73852b

…phasize date encoding and atomic facts; keep required phrases and tighten playbook/profile separation

tune(extraction): iter 3 — Tighten temporal and counting rules, add e…

0a046c5

…xplicit pre-create checklist and temporal examples, and emphasize atomic facts and playbook formatting

yilu331 changed the title ~~feat: add agentic extraction & search pipelines~~ feat: agentic extraction & search pipelines (single-loop v2) Apr 26, 2026

yilu331 added 6 commits April 26, 2026 04:09

tune(extraction): iter 1 — Strengthened date encoding and atomic coun…

a881744

…table extraction with temporal examples while preserving playbook format and core invariants.

Conversation

yilu331 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Diagrams

Classic vs agentic — extraction flow

Why agentic over classic?

Agent loop shape

commit_plan — the atomic commit stage

Why plan+commit instead of direct writes?

Architecture iteration history (Phase A → v2 single-loop)

Prompt + tool tuning iterations

Logging + observability

Profile_id shortening (12-hex)

API rename: search_profiles → search_user_profiles

Bug fixes shipped alongside

CodeRabbit review-loop fixes

Test plan

Out of scope

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yilu331 Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yilu331 Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yilu331 commented Apr 23, 2026 •

edited

Loading

`commit_plan` — the atomic commit stage

API rename: `search_profiles` → `search_user_profiles`

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

coderabbitai Bot Apr 23, 2026 •

edited

Loading

coderabbitai Bot Apr 23, 2026 •

edited

Loading