ProfSynapse · ProfSynapse · May 5, 2026 · May 5, 2026 · May 5, 2026
diff --git a/.agents/skills/case-studies/configs/agentic_search_rubrics.yaml b/.agents/skills/case-studies/configs/agentic_search_rubrics.yaml
@@ -16,8 +16,8 @@
 # TOOL AGNOSTICISM
 #   These rubrics are intentionally generic. They evaluate BEHAVIOR — what was
 #   searched for, what was opened, what was claimed — not the names or schemas
-#   of specific tools. Whether your pipeline uses searchManager, a RAG
-#   retriever, grep, an MCP tool, or any other search interface, the quality
+#   of specific tools. Whether your pipeline uses a configured search
+#   interface, a RAG retriever, an MCP tool, or any other retrieval surface, the quality
 #   criteria are the same:
 #
 #     - Did the search terms target the right vocabulary?
@@ -47,7 +47,7 @@
 name: Search Term Quality
 description: >
   Evaluates whether the search terms the model chose would actually surface
-  the target documents in a keyword or grep-style search. Penalizes terms
+  the target documents in the configured retrieval surface. Penalizes terms
   that are too broad, too narrow, or that parrot the user's exact phrasing
   when the underlying documents use different vocabulary.
 scope: tool_calls

diff --git a/.agents/skills/case-studies/configs/agentic_search_scenario.yaml b/.agents/skills/case-studies/configs/agentic_search_scenario.yaml
@@ -4,9 +4,9 @@
 # ║  BEFORE USING: Replace placeholder tool names with YOUR tools  ║
 # ║                                                                ║
 # ║  Placeholders:                                                 ║
-# ║    YOUR_SEARCH_TOOL  → e.g., grep, ripgrep, searchAPI, etc.   ║
-# ║    YOUR_READ_TOOL    → e.g., cat, readFile, fetchDoc, etc.    ║
-# ║    YOUR_LIST_TOOL    → e.g., ls, listFiles, dirList, etc.     ║
+# ║    YOUR_SEARCH_TOOL  → configured retrieval action            ║
+# ║    YOUR_READ_TOOL    → configured document read/fetch action   ║
+# ║    YOUR_LIST_TOOL    → configured list/browse action           ║
 # ║                                                                ║
 # ║  The three-stage pattern (search → read → answer) is the      ║
 # ║  constant. The actual tools are YOUR choice.                   ║

diff --git a/.agents/skills/case-studies/reference/agentic-search-pipeline.md b/.agents/skills/case-studies/reference/agentic-search-pipeline.md
@@ -6,7 +6,7 @@ A complete walkthrough of how to teach a language model to act as a RAG agent
 
 > **YOUR TOOLS ARE NOT OUR TOOLS**
 >
-> This case study describes a **three-stage pattern**: Search, Select, Answer. The tool names used below (`YOUR_SEARCH_TOOL`, `YOUR_READ_TOOL`, `YOUR_LIST_TOOL`) are **placeholders**. Your system might use `grep`, `ripgrep`, a vector DB query, `cat`, `readFile`, an HTTP API, or something else entirely.
+> This case study describes a **three-stage pattern**: Search, Select, Answer. The tool names used below (`YOUR_SEARCH_TOOL`, `YOUR_READ_TOOL`, `YOUR_LIST_TOOL`) are **placeholders**. Your system might use a configured keyword search, a vector DB query, a document reader, an HTTP API, or something else entirely.
 >
 > **Before you proceed past Stage 1, you must know:**
 > 1. What tool does "search" in your system?
@@ -23,7 +23,7 @@ A complete walkthrough of how to teach a language model to act as a RAG agent
 
 The model must learn a three-stage behavior loop:
 
-1. **Search** — given a question, generate search terms that would surface target documents in a keyword/grep-style search. The model must reason about vocabulary: what words would the answer contain?
+1. **Search** — given a question, generate search terms that would surface target documents in the configured retrieval surface. The model must reason about vocabulary: what words would the answer contain?
 2. **Selective Read** — from the search results, pick the relevant documents and skip distractors. Not everything returned is useful. The model must show judgment.
 3. **Grounded Answer** — answer using ONLY content from the documents it read. No hallucination. If the docs don't contain the answer, say so.
 
@@ -65,9 +65,9 @@ Define your tool mapping:
 
 | Stage | Placeholder | Your Tool | Example |
 |-------|------------|-----------|---------|
-| Search | `YOUR_SEARCH_TOOL` | ___________ | `grep`, `searchContent`, `vector_query` |
-| Read | `YOUR_READ_TOOL` | ___________ | `cat`, `read`, `fetchDocument` |
-| List | `YOUR_LIST_TOOL` | ___________ | `ls`, `list`, `listDirectory` |
+| Search | `YOUR_SEARCH_TOOL` | ___________ | configured keyword search, vector query |
+| Read | `YOUR_READ_TOOL` | ___________ | configured document read/fetch |
+| List | `YOUR_LIST_TOOL` | ___________ | configured list/browse action |
 
 Once you know your tools, create the tool schema and environment execution config that match your system. The scenario YAML references these tools by name — they must match exactly.
 

diff --git a/.agents/skills/case-studies/reference/pipeline-comparison.md b/.agents/skills/case-studies/reference/pipeline-comparison.md
@@ -1,160 +1,98 @@
 # Pipeline Comparison: Tool Calling vs Essay Style
 
-Side-by-side comparison of how the two training pipelines differ at each stage of the universal pipeline.
+Side-by-side comparison of how two common training pipelines differ at each
+stage of the universal pipeline.
 
 ---
 
-## Stage 1: Define the Capability
+## Stage 1: Define The Capability
 
 | Aspect | Tool Calling | Essay Style |
 |--------|-------------|-------------|
-| **Source of truth** | `tool-schemas.json` — JSON schema per tool | Essay corpus in `Meditations on Alignment/` |
-| **Format spec** | OpenAI function calling format with `tool_calls` field | Markdown outline with title, overview, sections, tone, themes |
-| **Behavioral spec** | 6 YAML behavior rubrics (intellectual humility, verification, etc.) | 2 quality rubrics (brainstorm quality, outline quality) |
-| **"Correct" defined as** | Right tool + right arguments + right context | Accurate structure + dialectical challenge + specific details |
-| **Key constraint** | Context object with 4 required fields (`workspaceId`, `sessionId`, `memory`, `goal`), `memory` never empty | Second person address, 4-6 sections matching essay, no generic headings |
+| Source of truth | Configured tool schemas and response format specs | Reference corpus |
+| Format spec | Structured tool-call response shape | Markdown/prose structure |
+| Behavioral spec | Scenario rubrics and environment gates | Quality rubrics |
+| Correct defined as | Right action, right arguments, right state handling | Accurate structure, voice, and content |
+| Key constraint | Schema-owned required fields and action payloads | Corpus-owned tone and structure |
 
 ---
 
 ## Stage 2: Create Training Data
 
 | Aspect | Tool Calling | Essay Style |
 |--------|-------------|-------------|
-| **Scenario type** | `type: tool` — template-based generation | `type: docs_based` — derived from source documents |
-| **Input to generator** | Tool schema + prompt template | Actual essay content (`{doc_content}`) |
-| **User turn** | Natural language request ("Create a folder for Q4") | Reverse-engineered brainstorm (messy, fragmented, 75-200 words) |
-| **Assistant turn** | Tool call with `<thinking>` block + arguments | Structured outline with dialectical opening |
-| **System prompt** | Yes — provides session/workspace context | No — model learns to outline from brainstorm alone |
-| **Data sources** | Handcrafted seeds + SynthChat + self-play (3 sources) | SynthChat docs-based generation (1 primary source) |
-| **Scaling strategy** | More prompt templates × temperature variations | More essays in corpus × per-doc variations |
-| **Generation command** | `--scenarios tools` | `--docs "path/" --scenarios essay_outline --per-doc 1` |
-
-### Template vs Docs: The Key Difference
-
-**Tool calling** uses prompt templates that can generate infinite variations:
-```yaml
-# Each run generates different user requests and tool calls
-prompts:
-  user: "Generate a natural user request that would require creating a folder."
-  assistant: "Generate assistant response with useTools call using storageManager.createFolder."
-```
-
-**Essay style** is anchored to real documents — each essay produces one training example:
-```yaml
-# Each essay file → one brainstorm + outline pair
-prompts:
-  user_system: |
-    You have access to the finished essay below.
-    <finished_essay>{doc_content}</finished_essay>
-    Imagine what the author's messy thoughts looked like BEFORE writing.
-```
+| Scenario type | Environment-backed or schema-backed tool scenarios | Docs-based scenarios |
+| Input to generator | Tool schema, environment config, and prompt template | Source document content |
+| User turn | Natural language task request | Reverse-engineered brainstorm or request |
+| Assistant turn | Configured tool-call payload, then text when complete | Structured natural-language response |
+| System prompt | Optional and deployment-aligned | Often omitted or minimal |
+| Scaling strategy | More scenarios, environments, and action paths | More source documents and variations |
+
+### Template vs Docs
+
+Tool-calling data usually uses templates plus generated environments so the
+same behavior appears across many states and surface details. Essay-style data
+is usually anchored to reference documents so the model learns the target voice,
+structure, and specificity from real examples.
 
 ---
 
-## Stage 3: Validate & Improve
+## Stage 3: Validate And Improve
 
 | Aspect | Tool Calling | Essay Style |
 |--------|-------------|-------------|
-| **Primary validation** | Deterministic schema checking (`validate_syngen.py`) | LLM-judged rubric scoring (SynthChat validate) |
-| **What's checkable automatically** | JSON structure, required fields, ID patterns, tool existence | Format presence (has overview? has tone?), section count |
-| **What needs human review** | Edge cases in tool selection logic | Voice accuracy, dialectical quality, specificity |
-| **Improvement mechanism** | Schema fix scripts + SynthChat improve | SynthChat improve with quality rubrics |
-| **Common structural errors** | Missing context fields, wrong tool name, empty `memory` | Generic headings, too many sections, third-person address |
-| **Validation command** | `python3 .skills/synethetic-data-generation/scripts/validate_syngen.py FILE` | `python -m SynthChat.run validate -i FILE --rubrics essay_*` |
-
-### Validation Confidence
-
-```
-Tool Calling:   [██████████████████████████████] 95% automated
-Essay Style:    [█████████████░░░░░░░░░░░░░░░░░] 45% automated
-                └── Structure checks  ─┘└── Voice, quality need human judgment ─┘
-```
+| Primary validation | Deterministic schema/environment checks | LLM-judged rubric scoring |
+| Automatically checkable | JSON shape, required fields, action existence, environment state | Section count and required blocks |
+| Human review focus | Edge cases in action choice and recovery behavior | Voice, specificity, and judgment |
+| Improvement mechanism | Config/rubric-driven repair loops | Rubric-driven improvement |
+| Common structural errors | Missing fields, malformed action payloads, wrong action order | Generic headings, bland content, missing sections |
 
 ---
 
 ## Stage 4: Train
 
 | Aspect | Tool Calling | Essay Style |
 |--------|-------------|-------------|
-| **SFT dataset** | Positive tool call examples (all labels true or absent) | Positive outline examples |
-| **SFT learning target** | Tool call syntax, context object, `<thinking>` blocks | Outline structure, dialectical opening, section format |
-| **KTO negative sources** | High-temp self-play errors, wrong tool selection, missing fields | Systematic degradation (generic headings, bland affirmation, bloat) |
-| **KTO learning target** | Prefer clarification > blind action, prefer complete context > lazy context | Prefer specific > generic, prefer dialectical > affirmative, prefer concise > bloated |
-| **Training commands** | Identical — same trainers, same flags, different `--local-file` | Identical — same trainers, same flags, different `--local-file` |
-| **Typical dataset size** | 1000-3000 examples | 50-200 examples (limited by corpus size) |
-
-### The Training Commands Are Identical
-
-```bash
-# Tool calling SFT
-python train_sft.py --model-size 7b --local-file ../../Datasets/tools_sft.jsonl
-
-# Essay style SFT
-python train_sft.py --model-size 7b --local-file ../../Datasets/essay_outlines.jsonl
+| SFT target | Structural shape and simple one-turn usage | Target response form and style |
+| GRPO target | Multi-step action behavior in an environment | Usually less central unless a verifier exists |
+| KTO negative sources | Invalid actions, missing context, unsafe behavior | Generic or low-quality variants |
+| Dataset size | Often hundreds to thousands | Often tens to hundreds |
 
-# Same command. Different data. Different capability.
-```
+The trainer commands can be identical. The capability changes because the
+dataset, reward surface, and evals change.
 
 ---
 
 ## Stage 5: Evaluate
 
 | Aspect | Tool Calling | Essay Style |
 |--------|-------------|-------------|
-| **Evaluation type** | Schema match + behavior check | Rubric scoring by judge LLM |
-| **PASS criteria** | Correct tool + correct arguments + good behavior | All outline parts present + specific + dialectical |
-| **WARN criteria** | Right tool but suboptimal behavior | Structure present but generic or bland |
-| **FAIL criteria** | Wrong tool, missing tool, or error | Missing major outline sections, completely off-topic |
-| **Key metrics** | `schema_pass_rate`, `behavior_pass_rate`, `by_tag` | Rubric dimension scores (structure, specificity, voice, dialectic) |
-| **Comparison baseline** | Base model tool calling accuracy | Base model outline quality |
+| Evaluation type | Schema, environment, and behavior checks | Rubric scoring by judge LLM |
+| PASS criteria | Correct configured response shape and successful task completion | Required parts present and high quality |
+| WARN criteria | Completed task with inefficient or noisy path | Structure present but generic |
+| FAIL criteria | Malformed payload, wrong action, failed environment task | Missing major sections or off-topic |
+| Key metrics | Schema pass, environment pass, behavior pass, by-tag rates | Rubric dimension scores |
 
 ---
 
 ## Stage 6: Iterate
 
 | Aspect | Tool Calling | Essay Style |
 |--------|-------------|-------------|
-| **Failure signal** | Schema validation errors, wrong tool counts | Low rubric dimension scores |
-| **Fix strategy** | Generate more examples for weak tools/behaviors | Add more essays with the failing characteristics |
-| **Dataset expansion** | Easy — more prompt templates, more self-play | Harder — need more source essays or essay variations |
-| **Convergence speed** | Fast — deterministic validation, clear pass/fail | Slower — subjective quality, harder to measure improvement |
+| Failure signal | Schema errors, environment traces, wrong action coverage | Low rubric dimensions |
+| Fix strategy | Add scenarios for weak behaviors and tighten config gates | Add source documents or targeted variants |
+| Dataset expansion | Usually easy if environments are generated | Limited by corpus and quality review |
+| Convergence speed | Faster when deterministic validation is strong | Slower when quality is subjective |
 
 ---
 
-## Decision Guide: Which Pattern to Follow?
+## Decision Guide
 
-Use the **tool calling pattern** when:
-- Output is structured data (JSON, function calls, API requests)
-- "Correct" is binary (right tool or wrong tool)
-- You can define a schema that covers all valid outputs
-- You need high volume (1000+ examples)
-- Validation can be mostly automated
-
-Use the **essay style pattern** when:
-- Output is natural language (prose, outlines, summaries, creative text)
-- "Correct" is a quality spectrum (good, mediocre, bad)
-- Quality is defined by rubrics, not schemas
-- You have a reference corpus to learn from
-- Voice, tone, and style matter
-
-### Hybrid Capabilities
-
-Some capabilities blend both patterns:
-- **Tool call with rich explanation** — tool calling format + natural language quality
-- **Structured report generation** — template structure + prose quality
-- **Code generation with comments** — syntax correctness + explanation quality
-
-For hybrids, use tool calling validation for structure and essay-style rubrics for prose quality.
-
----
+Use the tool-calling pattern when output is structured, correctness can be
+checked against schemas or environment state, and you need high-volume examples.
 
-## Summary Table
+Use the essay-style pattern when the output is natural language, quality is a
+spectrum, and a reference corpus defines the target behavior.
 
-| Pipeline Stage | Tool Calling | Essay Style | Shared |
-|---------------|-------------|-------------|--------|
-| Define | JSON schemas + behavior rubrics | Essay corpus + quality rubrics | Rubric-driven quality definition |
-| Generate | Template scenarios + self-play | Docs-based scenarios | SynthChat engine |
-| Validate | Schema validation (deterministic) | Rubric scoring (LLM-judged) | SynthChat validate/improve |
-| Train | SFT → KTO → GRPO | SFT → KTO | Same trainers, same commands |
-| Evaluate | Schema + behavior tests | Rubric dimension scoring | Same evaluator framework |
-| Iterate | More templates, more self-play | More essays, more degradation variants | Failure analysis → targeted generation |
+Hybrid capabilities can use deterministic structure validation for the tool or
+template portion and rubric scoring for prose quality.