Add Burbot PESA runtime by shouc · Pull Request #63 · berabuddies/puffer

shouc · 2026-04-28T21:37:17Z

Summary

Implements Burbot-only PESA runtime behavior: contract-scored plan graph scheduling, recoverable verifier/repair expansion, epoch-aware duplicate handling, and background Bash support through Puffer tools without changing shared Puffer behavior.
Removes Burbot runtime reliance on natural-language/tool-string parsing for planning equivalence, repair binding, and failure classification. Burbot now uses direct semantic contract slots and structured JSON output only; imported command intent extractors and regex-bound repair rules are ignored/dropped in Burbot.
Adds retryable model-provider handling that records 503/transport failures as inconclusive model availability, schedules a contract-declared wait only when no useful frontier exists, and blocks repeated same-epoch provider retry waits so outages stall cleanly instead of spinning forever.
Adds structural progress evidence for unknown-side-effect model actions: state epochs no longer advance just because an opaque command exited successfully, and empty-output model terminal actions become no_progress failures without parsing shell text.
Adds Burbot OpenAI-compatible Chat Completions fallback for API-key auth endpoints that do not support Responses API, and lets OPENAI_BASE_URL override configured OpenAI base URL for env API-key runs. Verified against DeepSeek deepseek-v4-pro via https://api.deepseek.com.
Keeps egg as a bounded optimizer for read-only structured intent slices only; it no longer optimizes Bash command text.

Verification

cargo fmt --package puffer-burbot
cargo test -p puffer-burbot (123 passed)
cargo test -p puffer-burbot llm::tests:: (23 passed)
cargo build -p puffer-burbot --bin burbot --target x86_64-unknown-linux-musl
git diff --check
DeepSeek smoke probe through Burbot: deepseek-v4-pro used https://api.deepseek.com/v1/chat/completions and returned {"burbot":"ok"}.
Earlier cargo test --workspace was attempted, but puffer-core had pre-existing snapshot/config/workflow failures and one LSP shutdown test hung; the run was terminated after failures were already present.

TB2 status with gpt-5.4-mini/local Codex auth

burbot-gpt54mini-structural-20260428: distribution-search reward 1.0, code-from-image reward 1.0, cobol-modernization reward 0.0, feal-linear-cryptanalysis stalled/reward 0.0, kv-store-grpc stalled/reward 0.0.
burbot-gpt54mini-structural2-20260428 after prompt tightening: cobol-modernization, feal-linear-cryptanalysis, and kv-store-grpc all stalled/reward 0.0, dominated by repeated OpenAI 503 proposal failures after the initial cheap observation.
burbot-gpt54mini-unsolved-runtimefix-20260429 after provider-retry liveness fixes: 0/3 solved, all three stalled cleanly after repeated OpenAI 503s; no infinite retry loop.

TB2 status with DeepSeek V4 Pro via OpenAI-compatible connector

Tag: burbot-deepseek-v4-pro-unsolved-20260429
Harness summary: 0/3 solved because all three were agent failures under Harbor accounting.
Reward-level result: cobol-modernization reward 1.0, feal-linear-cryptanalysis reward 0.0, kv-store-grpc reward 1.0.
cobol-modernization and kv-store-grpc reached passing verifier state but hit AgentTimeoutError at 900s before Burbot declared verified completion.
feal-linear-cryptanalysis stalled with open_actions=0; trace showed model response-read failures and one structurally invalid fallback proposal with an extra id field.

Remaining risks

Burbot can now use DeepSeek's OpenAI-compatible Chat Completions endpoint, but DeepSeek model latency plus Burbot's verifier/repair sequencing can exceed the Harbor 900s agent timeout even when the task state is already correct.
Burbot still needs better completion liveness: when hidden/verifier-equivalent evidence is already sufficient, it should declare completion faster instead of continuing to add verifier/repair candidates until the benchmark times out.
The standalone eval harness still has substring-based output expectations; this is not used by runtime planning/repair/failure/equivalence decisions.

Captures the working state from the Terminal Bench 2.0 evaluation session: forced goal-check, yolo mode, frontier-free verifier context, harness refactor (no python3-in-container), model policy, artifact review, dependencies/filesystem witness/observation modules, plus a detailed architecture writeup in crates/puffer-burbot/burbot.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

shouc and others added 5 commits April 28, 2026 14:35

Add Burbot PESA runtime

6c83725

Implement structural PESA runtime for Burbot

69aaf3e

Fix Burbot PESA retry liveness

6f0f4a7

Add Burbot OpenAI-compatible chat fallback

55cbe35

shouc closed this May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Burbot PESA runtime#63

Add Burbot PESA runtime#63
shouc wants to merge 5 commits into
masterfrom
feature/burbot-pesa-runtime

shouc commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shouc commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

TB2 status with gpt-5.4-mini/local Codex auth

TB2 status with DeepSeek V4 Pro via OpenAI-compatible connector

Remaining risks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shouc commented Apr 28, 2026 •

edited

Loading