Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
# TOOL AGNOSTICISM
# These rubrics are intentionally generic. They evaluate BEHAVIOR — what was
# searched for, what was opened, what was claimed — not the names or schemas
# of specific tools. Whether your pipeline uses searchManager, a RAG
# retriever, grep, an MCP tool, or any other search interface, the quality
# of specific tools. Whether your pipeline uses a configured search
# interface, a RAG retriever, an MCP tool, or any other retrieval surface, the quality
# criteria are the same:
#
# - Did the search terms target the right vocabulary?
Expand Down Expand Up @@ -47,7 +47,7 @@
name: Search Term Quality
description: >
Evaluates whether the search terms the model chose would actually surface
the target documents in a keyword or grep-style search. Penalizes terms
the target documents in the configured retrieval surface. Penalizes terms
that are too broad, too narrow, or that parrot the user's exact phrasing
when the underlying documents use different vocabulary.
scope: tool_calls
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
# ║ BEFORE USING: Replace placeholder tool names with YOUR tools ║
# ║ ║
# ║ Placeholders: ║
# ║ YOUR_SEARCH_TOOL → e.g., grep, ripgrep, searchAPI, etc.
# ║ YOUR_READ_TOOL → e.g., cat, readFile, fetchDoc, etc.
# ║ YOUR_LIST_TOOL → e.g., ls, listFiles, dirList, etc.
# ║ YOUR_SEARCH_TOOL → configured retrieval action
# ║ YOUR_READ_TOOL → configured document read/fetch action
# ║ YOUR_LIST_TOOL → configured list/browse action
# ║ ║
# ║ The three-stage pattern (search → read → answer) is the ║
# ║ constant. The actual tools are YOUR choice. ║
Expand Down
10 changes: 5 additions & 5 deletions .agents/skills/case-studies/reference/agentic-search-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ A complete walkthrough of how to teach a language model to act as a RAG agent

> **YOUR TOOLS ARE NOT OUR TOOLS**
>
> This case study describes a **three-stage pattern**: Search, Select, Answer. The tool names used below (`YOUR_SEARCH_TOOL`, `YOUR_READ_TOOL`, `YOUR_LIST_TOOL`) are **placeholders**. Your system might use `grep`, `ripgrep`, a vector DB query, `cat`, `readFile`, an HTTP API, or something else entirely.
> This case study describes a **three-stage pattern**: Search, Select, Answer. The tool names used below (`YOUR_SEARCH_TOOL`, `YOUR_READ_TOOL`, `YOUR_LIST_TOOL`) are **placeholders**. Your system might use a configured keyword search, a vector DB query, a document reader, an HTTP API, or something else entirely.
>
> **Before you proceed past Stage 1, you must know:**
> 1. What tool does "search" in your system?
Expand All @@ -23,7 +23,7 @@ A complete walkthrough of how to teach a language model to act as a RAG agent

The model must learn a three-stage behavior loop:

1. **Search** — given a question, generate search terms that would surface target documents in a keyword/grep-style search. The model must reason about vocabulary: what words would the answer contain?
1. **Search** — given a question, generate search terms that would surface target documents in the configured retrieval surface. The model must reason about vocabulary: what words would the answer contain?
2. **Selective Read** — from the search results, pick the relevant documents and skip distractors. Not everything returned is useful. The model must show judgment.
3. **Grounded Answer** — answer using ONLY content from the documents it read. No hallucination. If the docs don't contain the answer, say so.

Expand Down Expand Up @@ -65,9 +65,9 @@ Define your tool mapping:

| Stage | Placeholder | Your Tool | Example |
|-------|------------|-----------|---------|
| Search | `YOUR_SEARCH_TOOL` | ___________ | `grep`, `searchContent`, `vector_query` |
| Read | `YOUR_READ_TOOL` | ___________ | `cat`, `read`, `fetchDocument` |
| List | `YOUR_LIST_TOOL` | ___________ | `ls`, `list`, `listDirectory` |
| Search | `YOUR_SEARCH_TOOL` | ___________ | configured keyword search, vector query |
| Read | `YOUR_READ_TOOL` | ___________ | configured document read/fetch |
| List | `YOUR_LIST_TOOL` | ___________ | configured list/browse action |

Once you know your tools, create the tool schema and environment execution config that match your system. The scenario YAML references these tools by name — they must match exactly.

Expand Down
160 changes: 49 additions & 111 deletions .agents/skills/case-studies/reference/pipeline-comparison.md
Original file line number Diff line number Diff line change
@@ -1,160 +1,98 @@
# Pipeline Comparison: Tool Calling vs Essay Style

Side-by-side comparison of how the two training pipelines differ at each stage of the universal pipeline.
Side-by-side comparison of how two common training pipelines differ at each
stage of the universal pipeline.

---

## Stage 1: Define the Capability
## Stage 1: Define The Capability

| Aspect | Tool Calling | Essay Style |
|--------|-------------|-------------|
| **Source of truth** | `tool-schemas.json` — JSON schema per tool | Essay corpus in `Meditations on Alignment/` |
| **Format spec** | OpenAI function calling format with `tool_calls` field | Markdown outline with title, overview, sections, tone, themes |
| **Behavioral spec** | 6 YAML behavior rubrics (intellectual humility, verification, etc.) | 2 quality rubrics (brainstorm quality, outline quality) |
| **"Correct" defined as** | Right tool + right arguments + right context | Accurate structure + dialectical challenge + specific details |
| **Key constraint** | Context object with 4 required fields (`workspaceId`, `sessionId`, `memory`, `goal`), `memory` never empty | Second person address, 4-6 sections matching essay, no generic headings |
| Source of truth | Configured tool schemas and response format specs | Reference corpus |
| Format spec | Structured tool-call response shape | Markdown/prose structure |
| Behavioral spec | Scenario rubrics and environment gates | Quality rubrics |
| Correct defined as | Right action, right arguments, right state handling | Accurate structure, voice, and content |
| Key constraint | Schema-owned required fields and action payloads | Corpus-owned tone and structure |

---

## Stage 2: Create Training Data

| Aspect | Tool Calling | Essay Style |
|--------|-------------|-------------|
| **Scenario type** | `type: tool` — template-based generation | `type: docs_based` — derived from source documents |
| **Input to generator** | Tool schema + prompt template | Actual essay content (`{doc_content}`) |
| **User turn** | Natural language request ("Create a folder for Q4") | Reverse-engineered brainstorm (messy, fragmented, 75-200 words) |
| **Assistant turn** | Tool call with `<thinking>` block + arguments | Structured outline with dialectical opening |
| **System prompt** | Yes — provides session/workspace context | No — model learns to outline from brainstorm alone |
| **Data sources** | Handcrafted seeds + SynthChat + self-play (3 sources) | SynthChat docs-based generation (1 primary source) |
| **Scaling strategy** | More prompt templates × temperature variations | More essays in corpus × per-doc variations |
| **Generation command** | `--scenarios tools` | `--docs "path/" --scenarios essay_outline --per-doc 1` |

### Template vs Docs: The Key Difference

**Tool calling** uses prompt templates that can generate infinite variations:
```yaml
# Each run generates different user requests and tool calls
prompts:
user: "Generate a natural user request that would require creating a folder."
assistant: "Generate assistant response with useTools call using storageManager.createFolder."
```

**Essay style** is anchored to real documents — each essay produces one training example:
```yaml
# Each essay file → one brainstorm + outline pair
prompts:
user_system: |
You have access to the finished essay below.
<finished_essay>{doc_content}</finished_essay>
Imagine what the author's messy thoughts looked like BEFORE writing.
```
| Scenario type | Environment-backed or schema-backed tool scenarios | Docs-based scenarios |
| Input to generator | Tool schema, environment config, and prompt template | Source document content |
| User turn | Natural language task request | Reverse-engineered brainstorm or request |
| Assistant turn | Configured tool-call payload, then text when complete | Structured natural-language response |
| System prompt | Optional and deployment-aligned | Often omitted or minimal |
| Scaling strategy | More scenarios, environments, and action paths | More source documents and variations |

### Template vs Docs

Tool-calling data usually uses templates plus generated environments so the
same behavior appears across many states and surface details. Essay-style data
is usually anchored to reference documents so the model learns the target voice,
structure, and specificity from real examples.

---

## Stage 3: Validate & Improve
## Stage 3: Validate And Improve

| Aspect | Tool Calling | Essay Style |
|--------|-------------|-------------|
| **Primary validation** | Deterministic schema checking (`validate_syngen.py`) | LLM-judged rubric scoring (SynthChat validate) |
| **What's checkable automatically** | JSON structure, required fields, ID patterns, tool existence | Format presence (has overview? has tone?), section count |
| **What needs human review** | Edge cases in tool selection logic | Voice accuracy, dialectical quality, specificity |
| **Improvement mechanism** | Schema fix scripts + SynthChat improve | SynthChat improve with quality rubrics |
| **Common structural errors** | Missing context fields, wrong tool name, empty `memory` | Generic headings, too many sections, third-person address |
| **Validation command** | `python3 .skills/synethetic-data-generation/scripts/validate_syngen.py FILE` | `python -m SynthChat.run validate -i FILE --rubrics essay_*` |

### Validation Confidence

```
Tool Calling: [██████████████████████████████] 95% automated
Essay Style: [█████████████░░░░░░░░░░░░░░░░░] 45% automated
└── Structure checks ─┘└── Voice, quality need human judgment ─┘
```
| Primary validation | Deterministic schema/environment checks | LLM-judged rubric scoring |
| Automatically checkable | JSON shape, required fields, action existence, environment state | Section count and required blocks |
| Human review focus | Edge cases in action choice and recovery behavior | Voice, specificity, and judgment |
| Improvement mechanism | Config/rubric-driven repair loops | Rubric-driven improvement |
| Common structural errors | Missing fields, malformed action payloads, wrong action order | Generic headings, bland content, missing sections |

---

## Stage 4: Train

| Aspect | Tool Calling | Essay Style |
|--------|-------------|-------------|
| **SFT dataset** | Positive tool call examples (all labels true or absent) | Positive outline examples |
| **SFT learning target** | Tool call syntax, context object, `<thinking>` blocks | Outline structure, dialectical opening, section format |
| **KTO negative sources** | High-temp self-play errors, wrong tool selection, missing fields | Systematic degradation (generic headings, bland affirmation, bloat) |
| **KTO learning target** | Prefer clarification > blind action, prefer complete context > lazy context | Prefer specific > generic, prefer dialectical > affirmative, prefer concise > bloated |
| **Training commands** | Identical — same trainers, same flags, different `--local-file` | Identical — same trainers, same flags, different `--local-file` |
| **Typical dataset size** | 1000-3000 examples | 50-200 examples (limited by corpus size) |

### The Training Commands Are Identical

```bash
# Tool calling SFT
python train_sft.py --model-size 7b --local-file ../../Datasets/tools_sft.jsonl

# Essay style SFT
python train_sft.py --model-size 7b --local-file ../../Datasets/essay_outlines.jsonl
| SFT target | Structural shape and simple one-turn usage | Target response form and style |
| GRPO target | Multi-step action behavior in an environment | Usually less central unless a verifier exists |
| KTO negative sources | Invalid actions, missing context, unsafe behavior | Generic or low-quality variants |
| Dataset size | Often hundreds to thousands | Often tens to hundreds |

# Same command. Different data. Different capability.
```
The trainer commands can be identical. The capability changes because the
dataset, reward surface, and evals change.

---

## Stage 5: Evaluate

| Aspect | Tool Calling | Essay Style |
|--------|-------------|-------------|
| **Evaluation type** | Schema match + behavior check | Rubric scoring by judge LLM |
| **PASS criteria** | Correct tool + correct arguments + good behavior | All outline parts present + specific + dialectical |
| **WARN criteria** | Right tool but suboptimal behavior | Structure present but generic or bland |
| **FAIL criteria** | Wrong tool, missing tool, or error | Missing major outline sections, completely off-topic |
| **Key metrics** | `schema_pass_rate`, `behavior_pass_rate`, `by_tag` | Rubric dimension scores (structure, specificity, voice, dialectic) |
| **Comparison baseline** | Base model tool calling accuracy | Base model outline quality |
| Evaluation type | Schema, environment, and behavior checks | Rubric scoring by judge LLM |
| PASS criteria | Correct configured response shape and successful task completion | Required parts present and high quality |
| WARN criteria | Completed task with inefficient or noisy path | Structure present but generic |
| FAIL criteria | Malformed payload, wrong action, failed environment task | Missing major sections or off-topic |
| Key metrics | Schema pass, environment pass, behavior pass, by-tag rates | Rubric dimension scores |

---

## Stage 6: Iterate

| Aspect | Tool Calling | Essay Style |
|--------|-------------|-------------|
| **Failure signal** | Schema validation errors, wrong tool counts | Low rubric dimension scores |
| **Fix strategy** | Generate more examples for weak tools/behaviors | Add more essays with the failing characteristics |
| **Dataset expansion** | Easy — more prompt templates, more self-play | Harder — need more source essays or essay variations |
| **Convergence speed** | Fast — deterministic validation, clear pass/fail | Slower — subjective quality, harder to measure improvement |
| Failure signal | Schema errors, environment traces, wrong action coverage | Low rubric dimensions |
| Fix strategy | Add scenarios for weak behaviors and tighten config gates | Add source documents or targeted variants |
| Dataset expansion | Usually easy if environments are generated | Limited by corpus and quality review |
| Convergence speed | Faster when deterministic validation is strong | Slower when quality is subjective |

---

## Decision Guide: Which Pattern to Follow?
## Decision Guide

Use the **tool calling pattern** when:
- Output is structured data (JSON, function calls, API requests)
- "Correct" is binary (right tool or wrong tool)
- You can define a schema that covers all valid outputs
- You need high volume (1000+ examples)
- Validation can be mostly automated

Use the **essay style pattern** when:
- Output is natural language (prose, outlines, summaries, creative text)
- "Correct" is a quality spectrum (good, mediocre, bad)
- Quality is defined by rubrics, not schemas
- You have a reference corpus to learn from
- Voice, tone, and style matter

### Hybrid Capabilities

Some capabilities blend both patterns:
- **Tool call with rich explanation** — tool calling format + natural language quality
- **Structured report generation** — template structure + prose quality
- **Code generation with comments** — syntax correctness + explanation quality

For hybrids, use tool calling validation for structure and essay-style rubrics for prose quality.

---
Use the tool-calling pattern when output is structured, correctness can be
checked against schemas or environment state, and you need high-volume examples.

## Summary Table
Use the essay-style pattern when the output is natural language, quality is a
spectrum, and a reference corpus defines the target behavior.

| Pipeline Stage | Tool Calling | Essay Style | Shared |
|---------------|-------------|-------------|--------|
| Define | JSON schemas + behavior rubrics | Essay corpus + quality rubrics | Rubric-driven quality definition |
| Generate | Template scenarios + self-play | Docs-based scenarios | SynthChat engine |
| Validate | Schema validation (deterministic) | Rubric scoring (LLM-judged) | SynthChat validate/improve |
| Train | SFT → KTO → GRPO | SFT → KTO | Same trainers, same commands |
| Evaluate | Schema + behavior tests | Rubric dimension scoring | Same evaluator framework |
| Iterate | More templates, more self-play | More essays, more degradation variants | Failure analysis → targeted generation |
Hybrid capabilities can use deterministic structure validation for the tool or
template portion and rubric scoring for prose quality.
Loading