Overview
An audit of the AIL spec (spec/core/, spec/runner/) against the current v0.2 implementation reveals a significant number of specced-but-unimplemented features, plus several functional modalities that are not in the spec at all but are critical based on how LLMs are used in practice and what research shows improves accuracy, recall, and capabilities.
This issue provides a priority ranking by external value potential (considering both weakness mitigation and pipeline application value), followed by identification of missing modalities.
Priority Ranking by External Value
Tier 1 — Without these, pipelines are toys
1. Iterative refinement loops (NOT IN SPEC)
This is the single biggest gap. The spec is purely linear — steps run in sequence. But the dominant pattern in real-world agent use is generate → test → fix → retest loops. Every effective coding agent today works this way. on_result branching can fake a single retry, but there's no bounded loop construct (max_iterations, exit condition). This is how people actually use LLMs for anything non-trivial.
2. Error handling — on_error (§16)
Without retry/fallback, any production pipeline is fragile. LLM calls fail — rate limits, network errors, malformed output, context window overflow. This is table stakes for anyone running AIL unattended. The spec already has the right design (continue, retry + max_retries, abort_pipeline); it just needs implementation.
3. Condition expressions (§12.2)
always/never is a binary switch. The entire value proposition of a pipeline over a single prompt is conditional flow: "if tests fail, fix them; if the diff is large, request review." Without real conditions, AIL is a sequential prompt chain with extra YAML. The named conditions (if_code_changed, if_files_modified) plus general expressions are what make pipelines adaptive.
4. Parallel execution (§21)
Independent steps running sequentially is pure waste — lint and test don't depend on each other. More importantly, this unlocks the ensemble/comparison patterns (Tier 2). Even basic grouped parallelism would cut pipeline latency significantly.
Tier 2 — Differentiation from "just run Claude"
5. Multi-provider routing and comparison (§21, D-020)
This is where AIL becomes genuinely more capable than a single agent session. Three research-backed patterns:
- Model cascading: try the cheap/fast model first, escalate to expensive model if confidence is low. Research shows 50-80% cost reduction with minimal quality loss.
- Ensemble voting: multiple models answer, select the consensus or best answer. Improves reliability on factual/reasoning tasks.
- Cross-model verification: one model generates, a different model critiques. Reduces single-model blind spots.
6. Structured I/O schemas (§21)
Validating step outputs against JSON Schema, with automatic retry on validation failure. This is the standard pattern for reliable LLM pipelines — generate, validate, retry. Without it, downstream steps receive garbage and propagate errors silently.
7. Evaluation / self-critique steps (NOT IN SPEC)
Research strongly supports LLM-as-judge patterns: have a model score or accept/reject output against rubric criteria. This is fundamentally different from on_result: contains: (string matching). A first-class evaluate: step type that takes criteria and returns a structured verdict would make pipelines self-correcting. Constitutional AI, chain-of-verification, and self-consistency all depend on this primitive.
8. before:/then: chains (§5.7, §5.10)
Composability. Lets you attach pre/post processing to any step without flattening everything into the main sequence. This is how reusable step fragments work.
Tier 3 — Production readiness
9. FROM inheritance (§7)
Reusable pipeline templates. Without this, every project copies boilerplate. Important for ecosystem growth but not for individual pipeline value.
10. Cost/token budgets (NOT IN SPEC)
No concept of per-step or per-run token limits, cost caps, or budget-aware model selection. In production, runaway costs from long context or retry loops are a real operational risk. A budget: field per step or per pipeline would enable model cascading and prevent cost blowouts.
11. Safety guardrails (§21)
Enterprise adoption requires allowlists, blocklists, and output filtering. Not exciting, but a hard gate for certain buyers.
12. Dry run mode (§21)
Essential for pipeline development. Render resolved prompts without making LLM calls. Cheap to implement, high quality-of-life value.
13. Observability — OpenTelemetry (§21)
Production debugging. The turn log is good for audit; OpenTelemetry is good for understanding latency, cost, and failure patterns in real time.
14. HITL modify action (§13.2) + richer human-in-the-loop
pause_for_human is a no-op. The real pattern is richer: human reviews output, provides feedback, pipeline incorporates it and continues. Approval gates with context display.
Tier 4 — Ecosystem / long-term
15. Skill parameters & built-in modules (§14)
16. Named pipelines (§10)
17. Pipeline registry & versioning (§21)
18. Plugin extensibility (§21)
19. Self-modifying pipelines (§21) — interesting but risky; low trust without strong safety primitives first
Major Missing Modalities (Not in Spec)
These are not in the spec at all and represent gaps relative to how LLMs actually work and what research shows matters:
A. Loops / Bounded Iteration
The most critical gap. Every effective agent system today uses generate-test-fix cycles. The spec has no loop primitive. You need something like:
- id: fix-loop
loop:
max_iterations: 5
exit_when: "{{ step.test.exit_code }} == 0"
steps:
- id: fix
prompt: "Fix the failing tests..."
- id: test
context:
shell: "cargo test"
Without this, AIL can't express the most common agent pattern in existence.
B. Sampling Parameter Control
No way to set temperature, top-p, or reasoning mode per step. Research is clear: low temperature for code/factual tasks, higher for brainstorming/exploration. Best-of-N sampling (generate N candidates, pick the best) is a proven technique. Extended thinking / chain-of-thought toggles per step matter too — some steps need deep reasoning, others need fast responses.
C. Retrieval / Context Injection
context: shell: and context: mcp: exist, but there's no first-class retrieval primitive — "search these files/this vector store for content relevant to the prompt and inject it." RAG is the dominant accuracy-improvement technique. A context: retrieve: or context: search: type would be high value.
D. Caching / Memoization
If a context: shell: step returns the same result as last run (e.g., cargo test with no code changes), reuse it. If a prompt+context hash matches a prior run, optionally reuse the LLM response. This saves significant cost and latency in iterative development, where you re-run pipelines frequently with small changes.
E. Output Transformation / Extraction
No way to extract structured data from a step's response before passing it to the next step. In practice you often need: "take the JSON from step A's response, extract the errors array, pass each error to step B." A transform: or extract: directive (regex, JSONPath, or LLM-based extraction) would bridge steps cleanly.
F. Confidence / Uncertainty Signals
Modern LLM APIs expose logprobs. Research on selective prediction shows you can use confidence scores to decide whether to accept output, retry with more context, or escalate to a better model. No spec concept exists for this. It ties directly into model cascading (Tier 2, #5).
Recommended Build Order
If maximizing external value per unit of effort:
| Order |
Feature |
Type |
| 1 |
Loops / bounded iteration |
NEW — spec + implement |
| 2 |
on_error with retry (§16) |
Specced — implement |
| 3 |
Condition expressions (§12.2) |
Specced — implement |
| 4 |
Evaluation / critique steps |
NEW — spec + implement |
| 5 |
Parallel execution (§21) |
Specced — implement |
| 6 |
Structured I/O schemas (§21) |
Specced — implement |
| 7 |
Multi-provider routing (§21) |
Specced — implement |
| 8 |
Sampling parameter control |
NEW — spec + implement |
Items 1–3 are foundational. Items 4–8 are what would make AIL genuinely better than running a single agent session.
Full Inventory: Specced but Unimplemented
For completeness, here is every specced feature that is not yet implemented:
| Feature |
Spec Section |
Status |
skill: step type |
§6 |
Rejected at parse time |
before: chains |
§5.10 |
Deferred |
then: chains |
§5.7 |
Deferred |
FROM inheritance |
§7 |
Deferred |
| Hook operations |
§7.2 |
Deferred |
| Hook ordering (onion model) |
§8 |
Deferred |
| Named pipelines |
§10 |
Deferred |
| Condition expressions |
§12.2 |
Deferred |
Named conditions beyond always/never |
§12.1 |
Partial |
| Provider string format & aliases |
§15 |
Deferred |
HITL modify action |
§13.2 |
Deferred |
on_error handling |
§16 |
Deferred |
| Built-in modules |
§14 |
Deferred |
materialize --expand-pipelines |
§17 |
Partial |
| Clean session creation |
§4.4, §22 |
Deferred |
Full step.<id>.turns[] access |
§22 |
Deferred |
| Parallel step execution |
§21 |
Planned |
| Multi-provider comparison |
§21 |
Planned (D-020) |
| Structured I/O schemas |
§21 |
Planned |
Remote FROM targets |
§21 |
Planned |
| Native LLM provider support |
§21 |
Planned |
| Self-modifying pipelines |
§21 |
Planned |
| Direct MCP tool invocation (Mode 2) |
§21 |
Exploratory |
| Safety guardrails |
§21 |
Exploratory |
| Dry run mode |
§21 |
Planned |
Step output visibility (display:) |
§21 |
Planned |
| Template variable fallbacks |
§21 |
Planned |
| Observability block |
§21 |
Exploratory |
Plugin extensibility (x- prefix) |
§21 |
Exploratory |
| Pipeline registry & versioning |
§21 |
Planned |
Overview
An audit of the AIL spec (
spec/core/,spec/runner/) against the current v0.2 implementation reveals a significant number of specced-but-unimplemented features, plus several functional modalities that are not in the spec at all but are critical based on how LLMs are used in practice and what research shows improves accuracy, recall, and capabilities.This issue provides a priority ranking by external value potential (considering both weakness mitigation and pipeline application value), followed by identification of missing modalities.
Priority Ranking by External Value
Tier 1 — Without these, pipelines are toys
1. Iterative refinement loops (NOT IN SPEC)
This is the single biggest gap. The spec is purely linear — steps run in sequence. But the dominant pattern in real-world agent use is
generate → test → fix → retestloops. Every effective coding agent today works this way.on_resultbranching can fake a single retry, but there's no bounded loop construct (max_iterations, exit condition). This is how people actually use LLMs for anything non-trivial.2. Error handling —
on_error(§16)Without retry/fallback, any production pipeline is fragile. LLM calls fail — rate limits, network errors, malformed output, context window overflow. This is table stakes for anyone running AIL unattended. The spec already has the right design (
continue,retry+max_retries,abort_pipeline); it just needs implementation.3. Condition expressions (§12.2)
always/neveris a binary switch. The entire value proposition of a pipeline over a single prompt is conditional flow: "if tests fail, fix them; if the diff is large, request review." Without real conditions, AIL is a sequential prompt chain with extra YAML. The named conditions (if_code_changed,if_files_modified) plus general expressions are what make pipelines adaptive.4. Parallel execution (§21)
Independent steps running sequentially is pure waste — lint and test don't depend on each other. More importantly, this unlocks the ensemble/comparison patterns (Tier 2). Even basic grouped parallelism would cut pipeline latency significantly.
Tier 2 — Differentiation from "just run Claude"
5. Multi-provider routing and comparison (§21, D-020)
This is where AIL becomes genuinely more capable than a single agent session. Three research-backed patterns:
6. Structured I/O schemas (§21)
Validating step outputs against JSON Schema, with automatic retry on validation failure. This is the standard pattern for reliable LLM pipelines — generate, validate, retry. Without it, downstream steps receive garbage and propagate errors silently.
7. Evaluation / self-critique steps (NOT IN SPEC)
Research strongly supports LLM-as-judge patterns: have a model score or accept/reject output against rubric criteria. This is fundamentally different from
on_result: contains:(string matching). A first-classevaluate:step type that takes criteria and returns a structured verdict would make pipelines self-correcting. Constitutional AI, chain-of-verification, and self-consistency all depend on this primitive.8.
before:/then:chains (§5.7, §5.10)Composability. Lets you attach pre/post processing to any step without flattening everything into the main sequence. This is how reusable step fragments work.
Tier 3 — Production readiness
9.
FROMinheritance (§7)Reusable pipeline templates. Without this, every project copies boilerplate. Important for ecosystem growth but not for individual pipeline value.
10. Cost/token budgets (NOT IN SPEC)
No concept of per-step or per-run token limits, cost caps, or budget-aware model selection. In production, runaway costs from long context or retry loops are a real operational risk. A
budget:field per step or per pipeline would enable model cascading and prevent cost blowouts.11. Safety guardrails (§21)
Enterprise adoption requires allowlists, blocklists, and output filtering. Not exciting, but a hard gate for certain buyers.
12. Dry run mode (§21)
Essential for pipeline development. Render resolved prompts without making LLM calls. Cheap to implement, high quality-of-life value.
13. Observability — OpenTelemetry (§21)
Production debugging. The turn log is good for audit; OpenTelemetry is good for understanding latency, cost, and failure patterns in real time.
14. HITL modify action (§13.2) + richer human-in-the-loop
pause_for_humanis a no-op. The real pattern is richer: human reviews output, provides feedback, pipeline incorporates it and continues. Approval gates with context display.Tier 4 — Ecosystem / long-term
15. Skill parameters & built-in modules (§14)
16. Named pipelines (§10)
17. Pipeline registry & versioning (§21)
18. Plugin extensibility (§21)
19. Self-modifying pipelines (§21) — interesting but risky; low trust without strong safety primitives first
Major Missing Modalities (Not in Spec)
These are not in the spec at all and represent gaps relative to how LLMs actually work and what research shows matters:
A. Loops / Bounded Iteration
The most critical gap. Every effective agent system today uses generate-test-fix cycles. The spec has no loop primitive. You need something like:
Without this, AIL can't express the most common agent pattern in existence.
B. Sampling Parameter Control
No way to set temperature, top-p, or reasoning mode per step. Research is clear: low temperature for code/factual tasks, higher for brainstorming/exploration. Best-of-N sampling (generate N candidates, pick the best) is a proven technique. Extended thinking / chain-of-thought toggles per step matter too — some steps need deep reasoning, others need fast responses.
C. Retrieval / Context Injection
context: shell:andcontext: mcp:exist, but there's no first-class retrieval primitive — "search these files/this vector store for content relevant to the prompt and inject it." RAG is the dominant accuracy-improvement technique. Acontext: retrieve:orcontext: search:type would be high value.D. Caching / Memoization
If a
context: shell:step returns the same result as last run (e.g.,cargo testwith no code changes), reuse it. If a prompt+context hash matches a prior run, optionally reuse the LLM response. This saves significant cost and latency in iterative development, where you re-run pipelines frequently with small changes.E. Output Transformation / Extraction
No way to extract structured data from a step's response before passing it to the next step. In practice you often need: "take the JSON from step A's response, extract the
errorsarray, pass each error to step B." Atransform:orextract:directive (regex, JSONPath, or LLM-based extraction) would bridge steps cleanly.F. Confidence / Uncertainty Signals
Modern LLM APIs expose logprobs. Research on selective prediction shows you can use confidence scores to decide whether to accept output, retry with more context, or escalate to a better model. No spec concept exists for this. It ties directly into model cascading (Tier 2, #5).
Recommended Build Order
If maximizing external value per unit of effort:
on_errorwith retry (§16)Items 1–3 are foundational. Items 4–8 are what would make AIL genuinely better than running a single agent session.
Full Inventory: Specced but Unimplemented
For completeness, here is every specced feature that is not yet implemented:
skill:step typebefore:chainsthen:chainsFROMinheritancealways/nevermodifyactionon_errorhandlingmaterialize --expand-pipelinesstep.<id>.turns[]accessFROMtargetsdisplay:)x-prefix)