Add Burbot PESA runtime#63
Closed
shouc wants to merge 5 commits into
Closed
Conversation
Captures the working state from the Terminal Bench 2.0 evaluation session: forced goal-check, yolo mode, frontier-free verifier context, harness refactor (no python3-in-container), model policy, artifact review, dependencies/filesystem witness/observation modules, plus a detailed architecture writeup in crates/puffer-burbot/burbot.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
no_progressfailures without parsing shell text.OPENAI_BASE_URLoverride configured OpenAI base URL for env API-key runs. Verified against DeepSeekdeepseek-v4-proviahttps://api.deepseek.com.Verification
cargo fmt --package puffer-burbotcargo test -p puffer-burbot(123 passed)cargo test -p puffer-burbot llm::tests::(23 passed)cargo build -p puffer-burbot --bin burbot --target x86_64-unknown-linux-muslgit diff --checkdeepseek-v4-prousedhttps://api.deepseek.com/v1/chat/completionsand returned{"burbot":"ok"}.cargo test --workspacewas attempted, but puffer-core had pre-existing snapshot/config/workflow failures and one LSP shutdown test hung; the run was terminated after failures were already present.TB2 status with gpt-5.4-mini/local Codex auth
burbot-gpt54mini-structural-20260428: distribution-search reward 1.0, code-from-image reward 1.0, cobol-modernization reward 0.0, feal-linear-cryptanalysis stalled/reward 0.0, kv-store-grpc stalled/reward 0.0.burbot-gpt54mini-structural2-20260428after prompt tightening: cobol-modernization, feal-linear-cryptanalysis, and kv-store-grpc all stalled/reward 0.0, dominated by repeated OpenAI 503 proposal failures after the initial cheap observation.burbot-gpt54mini-unsolved-runtimefix-20260429after provider-retry liveness fixes: 0/3 solved, all three stalled cleanly after repeated OpenAI 503s; no infinite retry loop.TB2 status with DeepSeek V4 Pro via OpenAI-compatible connector
burbot-deepseek-v4-pro-unsolved-20260429AgentTimeoutErrorat 900s before Burbot declared verified completion.open_actions=0; trace showed model response-read failures and one structurally invalid fallback proposal with an extraidfield.Remaining risks