Skip to content

Add Burbot PESA runtime#63

Closed
shouc wants to merge 5 commits into
masterfrom
feature/burbot-pesa-runtime
Closed

Add Burbot PESA runtime#63
shouc wants to merge 5 commits into
masterfrom
feature/burbot-pesa-runtime

Conversation

@shouc
Copy link
Copy Markdown
Contributor

@shouc shouc commented Apr 28, 2026

Summary

  • Implements Burbot-only PESA runtime behavior: contract-scored plan graph scheduling, recoverable verifier/repair expansion, epoch-aware duplicate handling, and background Bash support through Puffer tools without changing shared Puffer behavior.
  • Removes Burbot runtime reliance on natural-language/tool-string parsing for planning equivalence, repair binding, and failure classification. Burbot now uses direct semantic contract slots and structured JSON output only; imported command intent extractors and regex-bound repair rules are ignored/dropped in Burbot.
  • Adds retryable model-provider handling that records 503/transport failures as inconclusive model availability, schedules a contract-declared wait only when no useful frontier exists, and blocks repeated same-epoch provider retry waits so outages stall cleanly instead of spinning forever.
  • Adds structural progress evidence for unknown-side-effect model actions: state epochs no longer advance just because an opaque command exited successfully, and empty-output model terminal actions become no_progress failures without parsing shell text.
  • Adds Burbot OpenAI-compatible Chat Completions fallback for API-key auth endpoints that do not support Responses API, and lets OPENAI_BASE_URL override configured OpenAI base URL for env API-key runs. Verified against DeepSeek deepseek-v4-pro via https://api.deepseek.com.
  • Keeps egg as a bounded optimizer for read-only structured intent slices only; it no longer optimizes Bash command text.

Verification

  • cargo fmt --package puffer-burbot
  • cargo test -p puffer-burbot (123 passed)
  • cargo test -p puffer-burbot llm::tests:: (23 passed)
  • cargo build -p puffer-burbot --bin burbot --target x86_64-unknown-linux-musl
  • git diff --check
  • DeepSeek smoke probe through Burbot: deepseek-v4-pro used https://api.deepseek.com/v1/chat/completions and returned {"burbot":"ok"}.
  • Earlier cargo test --workspace was attempted, but puffer-core had pre-existing snapshot/config/workflow failures and one LSP shutdown test hung; the run was terminated after failures were already present.

TB2 status with gpt-5.4-mini/local Codex auth

  • burbot-gpt54mini-structural-20260428: distribution-search reward 1.0, code-from-image reward 1.0, cobol-modernization reward 0.0, feal-linear-cryptanalysis stalled/reward 0.0, kv-store-grpc stalled/reward 0.0.
  • burbot-gpt54mini-structural2-20260428 after prompt tightening: cobol-modernization, feal-linear-cryptanalysis, and kv-store-grpc all stalled/reward 0.0, dominated by repeated OpenAI 503 proposal failures after the initial cheap observation.
  • burbot-gpt54mini-unsolved-runtimefix-20260429 after provider-retry liveness fixes: 0/3 solved, all three stalled cleanly after repeated OpenAI 503s; no infinite retry loop.

TB2 status with DeepSeek V4 Pro via OpenAI-compatible connector

  • Tag: burbot-deepseek-v4-pro-unsolved-20260429
  • Harness summary: 0/3 solved because all three were agent failures under Harbor accounting.
  • Reward-level result: cobol-modernization reward 1.0, feal-linear-cryptanalysis reward 0.0, kv-store-grpc reward 1.0.
  • cobol-modernization and kv-store-grpc reached passing verifier state but hit AgentTimeoutError at 900s before Burbot declared verified completion.
  • feal-linear-cryptanalysis stalled with open_actions=0; trace showed model response-read failures and one structurally invalid fallback proposal with an extra id field.

Remaining risks

  • Burbot can now use DeepSeek's OpenAI-compatible Chat Completions endpoint, but DeepSeek model latency plus Burbot's verifier/repair sequencing can exceed the Harbor 900s agent timeout even when the task state is already correct.
  • Burbot still needs better completion liveness: when hidden/verifier-equivalent evidence is already sufficient, it should declare completion faster instead of continuing to add verifier/repair candidates until the benchmark times out.
  • The standalone eval harness still has substring-based output expectations; this is not used by runtime planning/repair/failure/equivalence decisions.

shouc and others added 5 commits April 28, 2026 14:35
Captures the working state from the Terminal Bench 2.0 evaluation session:
forced goal-check, yolo mode, frontier-free verifier context, harness
refactor (no python3-in-container), model policy, artifact review,
dependencies/filesystem witness/observation modules, plus a detailed
architecture writeup in crates/puffer-burbot/burbot.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shouc shouc closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant