Skip to content

Claude/review model sync g vjuw#290

Open
PetrAnto wants to merge 416 commits intocloudflare:mainfrom
PetrAnto:claude/review-model-sync-gVjuw
Open

Claude/review model sync g vjuw#290
PetrAnto wants to merge 416 commits intocloudflare:mainfrom
PetrAnto:claude/review-model-sync-gVjuw

Conversation

@PetrAnto
Copy link

No description provided.

PetrAnto and others added 30 commits February 9, 2026 15:34
…CBrlK

test(openrouter): add comprehensive briefing-aggregator tests for Pha…
- Split /start and /help into separate messages
- /start: friendly welcome explaining 7 capabilities (Chat, Vision,
  Tools, Images, Reasoning, JSON, Briefing) with quick-start tips
- /help: full command reference with all 12 tools listed individually,
  grouped sections (Core, Costs, Briefing, Image Gen, Checkpoints,
  Models, Tools, Prefixes, Vision)
- Add TEST_PROTOCOL.md: 39-step manual test checklist covering basics,
  model switching, all tool types, vision, JSON mode, reasoning,
  image gen, briefing, bug regressions, and session management
- Update briefing-aggregator tests for new help message format

https://claude.ai/code/session_01NbL359VJGJE4Xsg5tTVR8u
…CBrlK

feat(telegram): rewrite /help and /start, add manual test protocol
…nt summary

Model catalog cleanup:
- Remove mimo (xiaomi/mimo-v2-flash:free) — free period ended Jan 2026
- Remove llama405free — deprecated, not in OpenRouter free collection
- Remove nemofree (mistral-nemo:free) — no longer in free collection
- Fix opus cost: $15/$75 → $5/$25 (actual OpenRouter price)
- Fix qwenthink maxContext: 131072 → 262144

Checkpoint preview feature:
- Add getCheckpointConversation() to storage — reads messages from R2
- /save <name> now generates an AI summary of the conversation content
  using /auto model, showing what was discussed and accomplished
- Falls back gracefully to metadata-only if summary fails

Update TEST_PROTOCOL.md with checkpoint summary test (#35)

https://claude.ai/code/session_01NbL359VJGJE4Xsg5tTVR8u
…CBrlK

fix(models): remove dead models, fix prices; feat(telegram): checkpoi…
…uter

- Add xiaomi/mimo-v2-flash as paid model ($0.10/$0.30)
- Add /syncmodels command to fetch free models from OpenRouter API at runtime
- Dynamic models system: DYNAMIC_MODELS map with registerDynamicModels(),
  getAllModels(), getModel() that checks dynamic before static
- R2 persistence for synced models (survives redeploys)
- Auto-load dynamic models from R2 on handler init
- Update /help with /syncmodels documentation

https://claude.ai/code/session_01NbL359VJGJE4Xsg5tTVR8u
Rewrite /syncmodels from auto-add-all to an interactive Telegram
inline keyboard picker:

- Fetches free models from OpenRouter API
- Shows new models (not in catalog) and stale models (no longer free)
  with context size, vision support, and model IDs
- Toggle buttons (☐/☑) to select which models to add/remove
- Validate button applies all selections at once
- Cancel button discards without changes

Supporting changes:
- Add blocked models mechanism (BLOCKED_ALIASES set in models.ts)
  so stale models can be hidden at runtime via getModel()/getAllModels()
- Add editMessageWithButtons to TelegramBot for updating message
  text + inline keyboard in a single API call
- Update storage.ts to persist blocked list alongside dynamic models
- Fix /pick button: mimo is now paid, not free

https://claude.ai/code/session_01NbL359VJGJE4Xsg5tTVR8u
…CBrlK

Claude/test briefing aggregator c brl k
When a free model hits 429/503 rate limits during a DO task, the
processor now automatically rotates to the next free tool-supporting
model and continues from the same iteration. Cycles through all
free models (qwencoderfree, pony, trinitymini, devstral, gptoss,
phi4reason) before giving up.

Also fixes "No response generated" — when a model returns empty
content after tool calls, the processor now nudges it up to 2 times
with a follow-up message before accepting the empty result.

Changes:
- task-processor.ts: free model rotation on 429/503 errors, empty
  content retry with MAX_EMPTY_RETRIES=2, use task.modelAlias
  instead of request.modelAlias for rotation support
- models.ts: add getFreeToolModels() helper
- handler.ts: add /syncreset command to clean up stale auto-synced
  dynamic models from R2

https://claude.ai/code/session_01NbL359VJGJE4Xsg5tTVR8u
…CBrlK

feat(task-processor): free model rotation + empty response retry
Cloudflare Workers are stateless — the in-memory syncSessions Map was
lost between requests, making all toggle buttons non-functional.

Now sync sessions are stored in R2 (saveSyncSession/loadSyncSession/
deleteSyncSession) so button callbacks work across Worker invocations.
Also changed selectedAdd/selectedRemove from Set to string[] for JSON
serialization compatibility.

https://claude.ai/code/session_01NbL359VJGJE4Xsg5tTVR8u
…CBrlK

fix(telegram): persist sync sessions in R2 instead of in-memory Map
Non-tool models (like auto-synced free models) were routed through
the Worker's direct path which has a 10s timeout. Slow models like
DeepSeek R1 would silently timeout with no response.

Changes:
- handler.ts: Always route through Durable Object when available,
  regardless of tool support. Worker fallback only when DO is not
  configured.
- task-processor.ts: Conditionally inject tools based on model's
  supportsTools flag. Non-tool models go through DO but without
  tool definitions — they get unlimited time, checkpointing, and
  auto-resume for free.

https://claude.ai/code/session_01NbL359VJGJE4Xsg5tTVR8u
The original deepseek/deepseek-r1:free endpoint was removed from
OpenRouter ("No endpoints found" error). Update to the newer
deepseek/deepseek-r1-0528:free which is still available.

https://claude.ai/code/session_01NbL359VJGJE4Xsg5tTVR8u
…CBrlK

fix(models): update deepfree to deepseek-r1-0528 (old endpoint dead)

The original deepseek/deepseek-r1:free endpoint was removed from
OpenRouter ("No endpoints found" error). Update to the newer
deepseek/deepseek-r1-0528:free which is still available.

https://claude.ai/code/session_01NbL359VJGJE4Xsg5tTVR8u
Free models cost nothing so they get 50 auto-resume attempts instead of
10, letting complex tasks grind through rate limits and timeouts. Paid
models keep the 10x limit to avoid burning credits on stuck tasks.

https://claude.ai/code/session_01NbL359VJGJE4Xsg5tTVR8u
…CBrlK

feat(task-processor): dynamic auto-resume limits (50x free, 10x paid)
Add two new tools for code modification capabilities:

1. github_create_pr: Creates a branch, commits file changes (create/update/delete),
   and opens a PR using the GitHub Git Data API. Supports up to 20 files, 1MB total.
   Auto-prefixes branches with bot/ to avoid conflicts. Full input validation
   (owner/repo format, path traversal, branch names, content size).

2. sandbox_exec: Executes shell commands in a Cloudflare Sandbox container for
   complex refactors needing build/test. Runs commands sequentially with fail-fast
   behavior, configurable timeout (5-300s), and dangerous command blocking.
   Injects GitHub token as env vars for git/gh CLI auth.

Also extends ToolContext with SandboxLike interface, wires sandbox through
TelegramHandler, and updates /help and /status commands. Adds 30 new tests
covering validation, API mocking, error handling, and edge cases.

https://claude.ai/code/session_01E4joY3pFyYfTxVZegqe52P
feat(tools): add github_create_pr and sandbox_exec tools
Extract structured metadata (tools used, model, iterations, success/failure,
category, duration) after each completed DO task and store in R2. Before new
tasks, inject relevant past patterns into the system prompt to improve future
tool selection and execution strategy.

New: src/openrouter/learnings.ts — extraction, storage, retrieval
New: src/openrouter/learnings.test.ts — 36 tests
Modified: task-processor.ts — learning extraction on completion/failure
Modified: handler.ts — learning injection into system prompt

AI: Claude Opus 4.6 (Session: 018gmCDcuBJqs9ffrrDHHBBd)

https://claude.ai/code/session_018gmCDcuBJqs9ffrrDHHBBd
Add gap tests identified in test protocol:
- categorizeTask: tie-breaking, duplicates, all-github-tools
- extractLearning: empty message, zero duration/iterations, auto-timestamp
- storeLearning: write error propagation, updatedAt, key format per user
- loadLearnings: R2 get() throw, key verification
- getRelevantLearnings: null history, category mismatch, no-bonus-without-base,
  short word filtering, case insensitivity, combined scoring, partial vs exact
- formatLearningsForPrompt: multi-tool display, leading newlines, duration
  boundaries (0s, 59999ms, 60000ms)

AI: Claude Opus 4.6 (Session: 018gmCDcuBJqs9ffrrDHHBBd)

https://claude.ai/code/session_018gmCDcuBJqs9ffrrDHHBBd
1. GLM supportsTools: add missing flag so glmfree uses tools
   instead of hallucinating (models.ts)

2. 402 error handling: fail fast on quota exceeded, rotate to
   free model if possible, show helpful message (task-processor.ts)

3. Cross-task context: store last task summary in R2, inject into
   next task's system prompt (expires after 1h) to prevent
   "I haven't seen your website" amnesia (learnings.ts, handler.ts)

4. Elapsed time cap: 15min for free models, 30min for paid,
   prevents runaway auto-resume loops (task-processor.ts)

5. Tool-intent detection: warn users when message needs tools
   but model doesn't support them, suggest alternatives
   (models.ts, handler.ts)

6. Parallel tool-call prompt: stronger instruction for models
   with parallelCalls flag to batch tool calls (handler.ts)

Tests: 447 total (33 new — 22 models, 11 learnings)

https://claude.ai/code/session_018gmCDcuBJqs9ffrrDHHBBd
Auto-resume counter was persisting across different tasks because
processTask() inherited autoResumeCount from any previous task in DO
storage. Now only inherits when resuming the SAME task (matching taskId).

Reverted supportsTools on glmfree — live testing confirmed GLM 4.5 Air
free tier doesn't generate tool_calls (answers from training data with
0 unique tools). Paid GLM 4.7 still has tools enabled.

https://claude.ai/code/session_018gmCDcuBJqs9ffrrDHHBBd
Includes the complete system prompt reflecting all 14 tools,
tool usage guidelines, and response style for Telegram.
README explains R2 bucket structure and upload instructions.

https://claude.ai/code/session_018gmCDcuBJqs9ffrrDHHBBd
docs(r2): add storia-orchestrator skill prompt for R2 bucket
…mmands

- /start now shows inline keyboard with 8 feature categories (Coding,
  Research, Images, Tools, Vision, Reasoning, Pick Model, All Commands)
- Each button sends a detailed guide for that feature with actionable
  examples and model recommendations
- Back to Menu and Pick Model buttons for navigation
- Added setMyCommands to TelegramBot class, registered 12 commands
  during /setup so Telegram shows the correct command menu
- Enhanced R2 skill prompt with Storia identity, model recommendations,
  stronger tool-first behavior, and better response style guidelines

https://claude.ai/code/session_018gmCDcuBJqs9ffrrDHHBBd
claude and others added 30 commits February 23, 2026 08:43
- GLOBAL_ROADMAP: add Model Sync section (MS.1-6), 2 changelog entries, update project overview
- WORK_STATUS: add MS.1-6 tasks, update test count (1227), sprint velocity (57 tasks), strikethrough completed priorities
- next_prompt: add MS.1-6 to Recently Completed, update timestamp

https://claude.ai/code/session_01V82ZPEL4WPcLtvGC6szgt5
…queries

Route simple queries (weather, greetings, crypto) to GPT-4o Mini for
lower latency when user is on default 'auto' model. Explicit model
choices via /use are never overridden.

- routeByComplexity() in src/openrouter/model-router.ts
- FAST_MODEL_CANDIDATES: mini > flash > haiku (ordered by cost)
- autoRoute user preference (default: true, toggle via /autoroute)
- Logging: [ModelRouter] on every routing decision
- /status shows auto-route state
- 15 new tests (1242 total)

https://claude.ai/code/session_01V82ZPEL4WPcLtvGC6szgt5
- GLOBAL_ROADMAP: mark 7B.2 complete, add changelog entry, update dependency graph
- WORK_STATUS: add 7B.2 task, update test count (1242), sprint velocity (58)
- next_prompt: advance to 7B.3 Pre-fetching Context as next task

https://claude.ai/code/session_01V82ZPEL4WPcLtvGC6szgt5
Extract file paths from user messages and pre-fetch them from GitHub
in parallel with the first LLM call. When the model calls
github_read_file, the content is already in the prefetch cache.

- extractFilePaths() regex extraction with false-positive filtering
- extractGitHubContext() finds owner/repo from system prompt or message
- startFilePrefetch() in task-processor fires GitHub reads in parallel
- Prefetch cache checked in executeToolWithCache() for github_read_file
- Export githubReadFile from tools.ts for direct pre-fetch use
- 31 new tests (1273 total)

https://claude.ai/code/session_01V82ZPEL4WPcLtvGC6szgt5
- GLOBAL_ROADMAP: mark 7B.3 complete, add changelog entry, update dependency graph
- WORK_STATUS: add 7B.3 task, update test count (1273), sprint velocity (59)
- next_prompt: advance to 7A.4 Structured Step Decomposition as next task

https://claude.ai/code/session_01V82ZPEL4WPcLtvGC6szgt5
…ith file pre-loading

Replace free-form plan phase prompt with STRUCTURED_PLAN_PROMPT that requests
JSON {steps: [{action, files, description}]}. parseStructuredPlan() uses 3-tier
parsing: code block → raw JSON → free-form file extraction fallback.
prefetchPlanFiles() pre-loads all referenced files at plan→work transition,
merging into existing prefetch cache. 26 new tests (1299 total).

https://claude.ai/code/session_01V82ZPEL4WPcLtvGC6szgt5
feat(quality): 7A.4 Structured Step Decomposition — JSON plan steps w…
…o context

After plan→work transition, awaits all prefetch promises and injects
[FILE: path] blocks directly into conversation context. Model sees
files already loaded and skips github_read_file calls, reducing
typical multi-file tasks from ~8 iterations to 3-4.

- awaitAndFormatPrefetchedFiles() in step-decomposition.ts
- Binary detection, 8KB/file truncation, 50KB total cap
- Also injects user-message prefetch files (7B.3 fallback path)
- 13 new tests (1312 total), typecheck clean

https://claude.ai/code/session_01V82ZPEL4WPcLtvGC6szgt5
feat(perf): 7B.4 Reduce Iteration Count — inject pre-loaded files int…
…ith retry

At work→review transition, scans tool results for unacknowledged
mutation errors, test failures, missing PRs, and unverified claims.
If failures found, injects details and gives model one retry iteration
before proceeding to review phase.

- shouldVerify() + verifyWorkPhase() in cove-verification.ts
- Smart "0 failed" exclusion to avoid false positives
- coveRetried flag limits to single retry
- 24 new tests (1336 total), typecheck clean

https://claude.ai/code/session_01V82ZPEL4WPcLtvGC6szgt5
feat(quality): 7A.1 CoVe Verification Loop — post-work verification w…
… handling

- github_create_pr description now explains read-modify-write update workflow
  (read with github_read_file → modify → pass COMPLETE content with action "update")
- github_read_file description mentions 50KB limit
- LARGE_FILE_THRESHOLD raised: 300→500 lines, 15→30KB (tools support 50KB,
  previous thresholds were overly conservative for modern models)
- Orchestra run prompt gets "How to Update Existing Files" section
- Orchestra run prompt gets "Step 4.5: HANDLE PARTIAL FAILURES" section
  for logging blocked/partial tasks in WORK_LOG.md and ROADMAP.md
- Orchestra redo prompt gets matching update workflow + failure handling
- 12 new tests (1348 total), typecheck clean

Fixes issues observed in real bot conversations where models incorrectly
claimed they couldn't edit existing files or silently gave up on large files.

https://claude.ai/code/session_01V82ZPEL4WPcLtvGC6szgt5
fix(orchestra+tools): improve tool descriptions + add partial failure…
…messages

Replace generic "Thinking..." with rich real-time progress updates in Telegram:

- formatProgressMessage() builds phase-aware strings with emoji labels:
  📋 Planning, 🔨 Working, 🔍 Reviewing, 🔄 Verifying
- humanizeToolName() maps 16 tool names to readable labels
  ("github_read_file" → "Reading", "sandbox_exec" → "Running commands")
- extractToolContext() extracts display info from tool args
  (file paths, URLs, commands, PR titles, search queries)
- estimateCurrentStep() shows plan step progress (step 2/5: Add JWT)
- shouldSendUpdate() throttle gate (15s interval)
- sendProgressUpdate() helper wired into task-processor iteration loop
- Both parallel and sequential tool execution paths update progress
- 44 new tests (1392 total), typecheck clean

Example progress messages:
  ⏳ 🔨 Reading: src/App.tsx (12s)
  ⏳ 🔨 Working (step 2/5: Add JWT validation) (iter 4, 6 tools, 35s)
  ⏳ 🔨 Running commands: npm test (48s)
  ⏳ 🔄 Verifying results… (1m30s)

https://claude.ai/code/session_01V82ZPEL4WPcLtvGC6szgt5
…streaming

Add onToolCallReady callback to parseSSEStream that fires when a
tool_call is complete during SSE streaming. createSpeculativeExecutor()
starts PARALLEL_SAFE tools immediately while the model continues
generating. Task-processor checks speculative cache before executing,
reusing pre-computed results and saving 2-10s per multi-tool iteration.

Detection: fires on new tool_call index (previous done) and on
finish_reason='tool_calls' (all done). Safety: only PARALLEL_SAFE_TOOLS,
max 5 speculative, 30s timeout. 19 new tests (1411 total).

All Phase 7 (Performance & Quality Engine) now complete — 10/10 tasks.

https://claude.ai/code/session_01V82ZPEL4WPcLtvGC6szgt5
Routes the review phase to a different model than the worker for
independent verification. A "fresh pair of eyes" catches hallucinated
claims, incomplete answers, and unacknowledged tool errors that
self-review misses.

- New reviewer.ts: model selection (cross-family), context building,
  response parsing (approve/revise)
- Reviewer candidates: Sonnet > Grok > Gemini Pro > Mini > Flash
- Eligibility: mutation tools, 3+ tool calls, or 3+ iterations
- Falls back to same-model review when no reviewer available or call fails
- Progress shows reviewer model: "⏳ 🔍 Reviewing (sonnet)…"
- Attribution footer: "🔍 Reviewed by Claude Sonnet 4.5"
- 47 new tests (1458 total), typecheck clean

https://claude.ai/code/session_01V82ZPEL4WPcLtvGC6szgt5
Compares curated models against live OpenRouter catalog to detect:
- Models removed from OpenRouter (deprecated upstream)
- Pricing changes for existing curated models
- New models from tracked families (anthropic, google, openai, etc.)

Available via Telegram /synccheck and GET /api/admin/models/check.

https://claude.ai/code/session_01K2mQTABDGY7DnnposPdDjw
After /syncall, auto-synced models get hyphenated aliases like
"claude-sonnet-46" but users try "sonnet46" or "claudesonnet".
getModel() only did exact key lookups, so these all failed.

Added fuzzy fallback with 4 passes:
1. Normalized exact (strip hyphens/dots)
2. Suffix match ("sonnet46" → "claude-sonnet-46")
3. Prefix match ("claudesonnet" → "claude-sonnet-46")
4. Model ID match ("gpt4o" → openai/gpt-4o)

Also stores canonical alias in /use handler so subsequent lookups
are always exact matches.

https://claude.ai/code/session_01K2mQTABDGY7DnnposPdDjw
…heck

/models changes:
- Add "AUTO-SYNCED HIGHLIGHTS" section showing top 2 flagship models
  per major provider (Anthropic, Google, OpenAI, etc.)
- Filter value tier sections to curated-only (prevents 300+ models
  flooding the listing)
- Sonnet 4.6, Opus 4.1, etc. now visible in /models

/synccheck changes:
- Group models by family with count, show top 4 per family (flagship first)
- Collapse older/variant models into "+N older/variant" summary
- Show auto-sync alias (→ /claude-sonnet-46) for each model
- Add note that models are usable via /use after /syncall

https://claude.ai/code/session_01K2mQTABDGY7DnnposPdDjw
…lock

- Bump openclaw 2026.2.3 → 2026.2.6-3 in Dockerfile (upstream PR cloudflare#204)
- Add redactWsPayload() to sanitize sensitive fields (api_key, token,
  auth, etc.) from WebSocket debug logs (upstream PR cloudflare#206)
- Add container-level lock file to prevent concurrent R2 sync operations,
  with 5-min stale lock cleanup (upstream PRs cloudflare#199, cloudflare#202)
- Add logging.test.ts for redaction utilities

https://claude.ai/code/session_01K2mQTABDGY7DnnposPdDjw
MiniMax API rejects requests with reasoning disabled — error:
"Reasoning is mandatory for this endpoint and cannot be disabled."

Change from 'configurable' to 'fixed' so getReasoningParam() returns
undefined (no reasoning param sent), letting MiniMax handle it natively.

https://claude.ai/code/session_01K2mQTABDGY7DnnposPdDjw
…ated PR claims

Models (especially Grok) can claim "PR #3 created successfully" when
github_create_pr actually failed with guardrail violations. This adds
three layers of protection:

Fix 2: Tag github_create_pr errors with unmistakable ❌ PR NOT CREATED
banner + "Do NOT claim a PR was created" instruction in tool result.

Fix 3: validateOrchestraResult() cross-references parsed ORCHESTRA_RESULT
against all tool outputs — if failure patterns found (Destructive update
blocked, INCOMPLETE REFACTOR, DATA FABRICATION, etc.) with no matching
success evidence, flags as phantom PR and clears the URL.

Fix 1: Post-execution PR verification via GitHub API — after all parsing,
if a PR URL survives, verify it actually exists (GET /repos/.../pulls/N).
Non-fatal on network errors, but catches any edge case the other layers miss.

https://claude.ai/code/session_01K2mQTABDGY7DnnposPdDjw
5 representative tasks testing each 7B optimization:
- Task A: Simple chat → 7B.2 model routing (< 5s, fast model)
- Task B: Multi-tool → 7B.1 speculative execution (< 20s, 2 tools/1 iter)
- Task C: GitHub read → 7B.3+7B.4 prefetch+injection (< 30s, ≤ 3 iter)
- Task D: Orchestra → all optimizations end-to-end (< 3min, ≤ 15 iter)
- Task E: Reasoning → 7B.5 streaming feedback (first update < 3s)

Includes pass/conditional/fail criteria and comparison notes.

https://claude.ai/code/session_01K2mQTABDGY7DnnposPdDjw
…res conversation length

Two bugs found during Phase 7B.6 benchmark:

1. extractUserQuestion() iterated forward and returned the FIRST user
   message. In multi-turn conversations the reviewer evaluated the
   assistant's answer against the wrong question (e.g. "capital of
   France" instead of "read README.md and summarize"). Fixed by
   iterating backwards. Also skips 7B.4 file-injection blocks.

2. Model routing used classifyTaskComplexity(msg, conversationLength)
   which gates on conversationLength >= 3 → 'complex', preventing
   simple messages from routing to fast models in longer conversations.
   Fixed by passing conversationLength=0 for routing decisions so only
   message content determines complexity.

https://claude.ai/code/session_01K2mQTABDGY7DnnposPdDjw
…ol-limit fixes

Bug 1: startTime reset on every auto-resume — each processTask() call
created a new TaskState with startTime=Date.now(), so the elapsed time
cap (15min free / 30min paid) never triggered across resumes. Fix:
preserve startTime from the original task when resuming.

Bug 2: elapsed time cap only checked when task appears stuck — the
alarm handler returned early ("still active") before reaching the
elapsed check. Fix: move elapsed check before the "still active"
early return so it fires regardless of task activity.

Bug 3: no total tool call limit — a model could make unlimited tool
calls across its lifetime. Fix: add MAX_TOTAL_TOOLS_FREE=50 and
MAX_TOTAL_TOOLS_PAID=100 with a nudge message when exceeded.

Also adds defense-in-depth elapsed check in the main processTask loop.

These bugs caused a 2-file GitHub read to take 46 minutes with 8
auto-resumes and 29 tool calls instead of stopping at the time cap.

https://claude.ai/code/session_01K2mQTABDGY7DnnposPdDjw
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants