Spec vs Implementation Gap Analysis: Priority Roadmap & Missing Modalities

## Overview

An audit of the AIL spec (`spec/core/`, `spec/runner/`) against the current v0.2 implementation reveals a significant number of specced-but-unimplemented features, plus several functional modalities that are **not in the spec at all** but are critical based on how LLMs are used in practice and what research shows improves accuracy, recall, and capabilities.

This issue provides a priority ranking by external value potential (considering both weakness mitigation and pipeline application value), followed by identification of missing modalities.

---

## Priority Ranking by External Value

### Tier 1 — Without these, pipelines are toys

**1. Iterative refinement loops (NOT IN SPEC)**

This is the single biggest gap. The spec is purely linear — steps run in sequence. But the dominant pattern in real-world agent use is `generate → test → fix → retest` loops. Every effective coding agent today works this way. `on_result` branching can fake a single retry, but there's no bounded loop construct (`max_iterations`, exit condition). This is how people actually use LLMs for anything non-trivial.

**2. Error handling — `on_error` (§16)**

Without retry/fallback, any production pipeline is fragile. LLM calls fail — rate limits, network errors, malformed output, context window overflow. This is table stakes for anyone running AIL unattended. The spec already has the right design (`continue`, `retry` + `max_retries`, `abort_pipeline`); it just needs implementation.

**3. Condition expressions (§12.2)**

`always`/`never` is a binary switch. The entire value proposition of a pipeline over a single prompt is conditional flow: "if tests fail, fix them; if the diff is large, request review." Without real conditions, AIL is a sequential prompt chain with extra YAML. The named conditions (`if_code_changed`, `if_files_modified`) plus general expressions are what make pipelines adaptive.

**4. Parallel execution (§21)**

Independent steps running sequentially is pure waste — lint and test don't depend on each other. More importantly, this unlocks the ensemble/comparison patterns (Tier 2). Even basic grouped parallelism would cut pipeline latency significantly.

### Tier 2 — Differentiation from "just run Claude"

**5. Multi-provider routing and comparison (§21, D-020)**

This is where AIL becomes genuinely more capable than a single agent session. Three research-backed patterns:
- **Model cascading**: try the cheap/fast model first, escalate to expensive model if confidence is low. Research shows 50-80% cost reduction with minimal quality loss.
- **Ensemble voting**: multiple models answer, select the consensus or best answer. Improves reliability on factual/reasoning tasks.
- **Cross-model verification**: one model generates, a different model critiques. Reduces single-model blind spots.

**6. Structured I/O schemas (§21)**

Validating step outputs against JSON Schema, with automatic retry on validation failure. This is the standard pattern for reliable LLM pipelines — generate, validate, retry. Without it, downstream steps receive garbage and propagate errors silently.

**7. Evaluation / self-critique steps (NOT IN SPEC)**

Research strongly supports LLM-as-judge patterns: have a model score or accept/reject output against rubric criteria. This is fundamentally different from `on_result: contains:` (string matching). A first-class `evaluate:` step type that takes criteria and returns a structured verdict would make pipelines self-correcting. Constitutional AI, chain-of-verification, and self-consistency all depend on this primitive.

**8. `before:`/`then:` chains (§5.7, §5.10)**

Composability. Lets you attach pre/post processing to any step without flattening everything into the main sequence. This is how reusable step fragments work.

### Tier 3 — Production readiness

**9. `FROM` inheritance (§7)**

Reusable pipeline templates. Without this, every project copies boilerplate. Important for ecosystem growth but not for individual pipeline value.

**10. Cost/token budgets (NOT IN SPEC)**

No concept of per-step or per-run token limits, cost caps, or budget-aware model selection. In production, runaway costs from long context or retry loops are a real operational risk. A `budget:` field per step or per pipeline would enable model cascading and prevent cost blowouts.

**11. Safety guardrails (§21)**

Enterprise adoption requires allowlists, blocklists, and output filtering. Not exciting, but a hard gate for certain buyers.

**12. Dry run mode (§21)**

Essential for pipeline development. Render resolved prompts without making LLM calls. Cheap to implement, high quality-of-life value.

**13. Observability — OpenTelemetry (§21)**

Production debugging. The turn log is good for audit; OpenTelemetry is good for understanding latency, cost, and failure patterns in real time.

**14. HITL modify action (§13.2) + richer human-in-the-loop**

`pause_for_human` is a no-op. The real pattern is richer: human reviews output, provides feedback, pipeline incorporates it and continues. Approval gates with context display.

### Tier 4 — Ecosystem / long-term

**15. Skill parameters & built-in modules (§14)**
**16. Named pipelines (§10)**
**17. Pipeline registry & versioning (§21)**
**18. Plugin extensibility (§21)**
**19. Self-modifying pipelines (§21)** — interesting but risky; low trust without strong safety primitives first

---

## Major Missing Modalities (Not in Spec)

These are **not in the spec at all** and represent gaps relative to how LLMs actually work and what research shows matters:

### A. Loops / Bounded Iteration

**The most critical gap.** Every effective agent system today uses generate-test-fix cycles. The spec has no loop primitive. You need something like:

```yaml
- id: fix-loop
  loop:
    max_iterations: 5
    exit_when: "{{ step.test.exit_code }} == 0"
  steps:
    - id: fix
      prompt: "Fix the failing tests..."
    - id: test
      context:
        shell: "cargo test"
```

Without this, AIL can't express the most common agent pattern in existence.

### B. Sampling Parameter Control

No way to set temperature, top-p, or reasoning mode per step. Research is clear: low temperature for code/factual tasks, higher for brainstorming/exploration. Best-of-N sampling (generate N candidates, pick the best) is a proven technique. Extended thinking / chain-of-thought toggles per step matter too — some steps need deep reasoning, others need fast responses.

### C. Retrieval / Context Injection

`context: shell:` and `context: mcp:` exist, but there's no first-class retrieval primitive — "search these files/this vector store for content relevant to the prompt and inject it." RAG is the dominant accuracy-improvement technique. A `context: retrieve:` or `context: search:` type would be high value.

### D. Caching / Memoization

If a `context: shell:` step returns the same result as last run (e.g., `cargo test` with no code changes), reuse it. If a prompt+context hash matches a prior run, optionally reuse the LLM response. This saves significant cost and latency in iterative development, where you re-run pipelines frequently with small changes.

### E. Output Transformation / Extraction

No way to extract structured data from a step's response before passing it to the next step. In practice you often need: "take the JSON from step A's response, extract the `errors` array, pass each error to step B." A `transform:` or `extract:` directive (regex, JSONPath, or LLM-based extraction) would bridge steps cleanly.

### F. Confidence / Uncertainty Signals

Modern LLM APIs expose logprobs. Research on selective prediction shows you can use confidence scores to decide whether to accept output, retry with more context, or escalate to a better model. No spec concept exists for this. It ties directly into model cascading (Tier 2, #5).

---

## Recommended Build Order

If maximizing external value per unit of effort:

| Order | Feature | Type |
|-------|---------|------|
| 1 | **Loops / bounded iteration** | NEW — spec + implement |
| 2 | **`on_error` with retry** (§16) | Specced — implement |
| 3 | **Condition expressions** (§12.2) | Specced — implement |
| 4 | **Evaluation / critique steps** | NEW — spec + implement |
| 5 | **Parallel execution** (§21) | Specced — implement |
| 6 | **Structured I/O schemas** (§21) | Specced — implement |
| 7 | **Multi-provider routing** (§21) | Specced — implement |
| 8 | **Sampling parameter control** | NEW — spec + implement |

Items 1–3 are foundational. Items 4–8 are what would make AIL genuinely better than running a single agent session.

---

## Full Inventory: Specced but Unimplemented

For completeness, here is every specced feature that is not yet implemented:

| Feature | Spec Section | Status |
|---------|-------------|--------|
| `skill:` step type | §6 | Rejected at parse time |
| `before:` chains | §5.10 | Deferred |
| `then:` chains | §5.7 | Deferred |
| `FROM` inheritance | §7 | Deferred |
| Hook operations | §7.2 | Deferred |
| Hook ordering (onion model) | §8 | Deferred |
| Named pipelines | §10 | Deferred |
| Condition expressions | §12.2 | Deferred |
| Named conditions beyond `always`/`never` | §12.1 | Partial |
| Provider string format & aliases | §15 | Deferred |
| HITL `modify` action | §13.2 | Deferred |
| `on_error` handling | §16 | Deferred |
| Built-in modules | §14 | Deferred |
| `materialize --expand-pipelines` | §17 | Partial |
| Clean session creation | §4.4, §22 | Deferred |
| Full `step.<id>.turns[]` access | §22 | Deferred |
| Parallel step execution | §21 | Planned |
| Multi-provider comparison | §21 | Planned (D-020) |
| Structured I/O schemas | §21 | Planned |
| Remote `FROM` targets | §21 | Planned |
| Native LLM provider support | §21 | Planned |
| Self-modifying pipelines | §21 | Planned |
| Direct MCP tool invocation (Mode 2) | §21 | Exploratory |
| Safety guardrails | §21 | Exploratory |
| Dry run mode | §21 | Planned |
| Step output visibility (`display:`) | §21 | Planned |
| Template variable fallbacks | §21 | Planned |
| Observability block | §21 | Exploratory |
| Plugin extensibility (`x-` prefix) | §21 | Exploratory |
| Pipeline registry & versioning | §21 | Planned |


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spec vs Implementation Gap Analysis: Priority Roadmap & Missing Modalities #105

Overview

Priority Ranking by External Value

Tier 1 — Without these, pipelines are toys

Tier 2 — Differentiation from "just run Claude"

Tier 3 — Production readiness

Tier 4 — Ecosystem / long-term

Major Missing Modalities (Not in Spec)

A. Loops / Bounded Iteration

B. Sampling Parameter Control

C. Retrieval / Context Injection

D. Caching / Memoization

E. Output Transformation / Extraction

F. Confidence / Uncertainty Signals

Recommended Build Order

Full Inventory: Specced but Unimplemented

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Order	Feature	Type
1	Loops / bounded iteration	NEW — spec + implement
2	`on_error` with retry (§16)	Specced — implement
3	Condition expressions (§12.2)	Specced — implement
4	Evaluation / critique steps	NEW — spec + implement
5	Parallel execution (§21)	Specced — implement
6	Structured I/O schemas (§21)	Specced — implement
7	Multi-provider routing (§21)	Specced — implement
8	Sampling parameter control	NEW — spec + implement

Feature	Spec Section	Status
`skill:` step type	§6	Rejected at parse time
`before:` chains	§5.10	Deferred
`then:` chains	§5.7	Deferred
`FROM` inheritance	§7	Deferred
Hook operations	§7.2	Deferred
Hook ordering (onion model)	§8	Deferred
Named pipelines	§10	Deferred
Condition expressions	§12.2	Deferred
Named conditions beyond `always`/`never`	§12.1	Partial
Provider string format & aliases	§15	Deferred
HITL `modify` action	§13.2	Deferred
`on_error` handling	§16	Deferred
Built-in modules	§14	Deferred
`materialize --expand-pipelines`	§17	Partial
Clean session creation	§4.4, §22	Deferred
Full `step.<id>.turns[]` access	§22	Deferred
Parallel step execution	§21	Planned
Multi-provider comparison	§21	Planned (D-020)
Structured I/O schemas	§21	Planned
Remote `FROM` targets	§21	Planned
Native LLM provider support	§21	Planned
Self-modifying pipelines	§21	Planned
Direct MCP tool invocation (Mode 2)	§21	Exploratory
Safety guardrails	§21	Exploratory
Dry run mode	§21	Planned
Step output visibility (`display:`)	§21	Planned
Template variable fallbacks	§21	Planned
Observability block	§21	Exploratory
Plugin extensibility (`x-` prefix)	§21	Exploratory
Pipeline registry & versioning	§21	Planned

Uh oh!

Spec vs Implementation Gap Analysis: Priority Roadmap & Missing Modalities #105

Description

Overview

Priority Ranking by External Value

Tier 1 — Without these, pipelines are toys

Tier 2 — Differentiation from "just run Claude"

Tier 3 — Production readiness

Tier 4 — Ecosystem / long-term

Major Missing Modalities (Not in Spec)

A. Loops / Bounded Iteration

B. Sampling Parameter Control

C. Retrieval / Context Injection

D. Caching / Memoization

E. Output Transformation / Extraction

F. Confidence / Uncertainty Signals

Recommended Build Order

Full Inventory: Specced but Unimplemented

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions