diff --git a/README.md b/README.md
index 83684e3c..d6b0d0a1 100644
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@ English | [中文](docs/i18n/README.CN.MD) | [한국어](docs/i18n/README.KO.MD)
-
+
@@ -39,7 +39,8 @@ Other install methods: [pip install](#alternative-install-with-pip) | [uv instal
## 🔥🔥🔥 News (Pacific Time)
-- June 5, 2026 (latest, **v3.05.82**): **User-controllable token/cost budgets** — `/budget $5` / `/budget 200k` / `/budget daily $20` cap spend per session or per day, enforced before each model call; on hit the session auto-saves and you're shown how to `/resume` or raise the cap and continue (warns at ≥80%/95%; `--budget` sets it at startup). Details: [docs/guides/features.md](docs/guides/features.md) · [docs/news.md](docs/news.md).
+- June 6, 2026 (latest, **v3.5.82**): **macOS install reliably puts `cheetahclaws` on PATH, and local Ollama models that emit tool calls as text now actually execute them** (two fixes from issue #131). **(1) Install/PATH on macOS:** the installer `source`s the dedicated venv it creates, which made the post-install `command -v cheetahclaws` check succeed *inside the script's own shell* — so it reported "on PATH" and **skipped the entire rc-file step**, leaving `~/.zshrc` untouched and the binary unreachable in new terminals. It now symlinks only the `cheetahclaws` entry point into `~/.local/bin` (pipx-style, so the venv's `python`/`pip` don't shadow yours), creates `~/.zshrc` / `.bash_profile` if missing, and appends `~/.local/bin` to PATH there — without trusting the venv-polluted `command -v` (`scripts/install.sh`). **(2) Ollama tool calls:** `stream_ollama` only read Ollama's structured `message.tool_calls` field, while the cloud path already recovers calls a model emits as **text**, so Qwen-coder / Gemma / Mistral over Ollama produced "tool-calling-style chat" that streamed as plain text and never ran — the model seemed to "just keep talking." `stream_ollama` now mirrors the cloud path's interceptor: it buffers from the first `
` / `<|tool_call|>` / `[TOOL_CALLS]` marker (so raw markup never reaches the user) and parses it into real tool calls at end-of-stream (`providers.py`). Details: [docs/guides/usage.md](docs/guides/usage.md#usage-open-source-models-local) · [docs/guides/faq.md](docs/guides/faq.md) · [docs/news.md](docs/news.md).
+- June 5, 2026 (**v3.5.82**): **User-controllable token/cost budgets** — `/budget $5` / `/budget 200k` / `/budget daily $20` cap spend per session or per day, enforced before each model call; on hit the session auto-saves and you're shown how to `/resume` or raise the cap and continue (warns at ≥80%/95%; `--budget` sets it at startup). Details: [docs/guides/features.md](docs/guides/features.md) · [docs/news.md](docs/news.md).
- June 5, 2026: **Adaptive Markdown streaming — live output stays correct on every device** by auto-selecting a per-device tier (`live` in-place redraw on capable terminals incl. modern SSH emulators, append-only `commit` for SSH/Apple Terminal/pipes/CJK text so frames never duplicate, `plain` fallback); also ships a visual `/context` usage grid and a 1M context window for `deepseek-v4-flash`. Details: [docs/guides/features.md](docs/guides/features.md) · [docs/news.md](docs/news.md).
- June 4, 2026 (**v3.05.81**): **Claude-Code-style quiet output** hides per-tool execution and shows one summary line per turn (on by default), with a live spinner timer + token estimate and a `✻ Worked for…` footer; `/verbose` overrides, toggle with `/quiet`. Details: [docs/guides/features.md](docs/guides/features.md) · [docs/news.md](docs/news.md).
- June 4, 2026: **Context-window override** — `/config context_window=` sets the context length that drives the prompt `%`, `/context`, the compaction trigger, and the output cap consistently (distinct from `max_tokens`; read live, no restart). Details: [docs/guides/reference.md](docs/guides/reference.md) · [docs/news.md](docs/news.md).
@@ -76,25 +77,25 @@ CheetahClaws: **A Fast** and **Easy-to-Use** Python native Agent Harness Infrast
### Demos
-

+
Task execution in the terminal
-

+
Web UI: browser chat — sidebar, tool cards, approval prompts, Markdown streaming
-

+
Autonomous trading agent
-> More animated demos (code review, `/research`, `/brainstorm`, `/lab`, Telegram/WeChat/Slack bridges) live in [`docs/media/`](docs/media/).
+> More animated demos (code review, `/research`, `/brainstorm`, `/lab`, Telegram/WeChat/Slack bridges) live in [`docs/media/`](https://github.com/SafeRL-Lab/cheetahclaws/tree/main/docs/media/).
---
@@ -208,7 +209,7 @@ Claude Code is a powerful, production-grade AI coding assistant — but its sour
| `phi4` · `gemma3` · `codellama` | 14B · 4–27B · 7–34B | Reasoning / open / code | `ollama pull phi4` |
| `llava` · `llama3.2-vision` | 7–13B · 11B | **Vision** | `ollama pull llava` |
-> **Tool calling** needs a function-calling model — recommended: `qwen2.5-coder`, `llama3.3`, `mistral`, `phi4`. Reasoning models (`deepseek-r1`, `qwen3`, `gemma4`) stream native `` blocks; enable with `/verbose` + `/thinking`.
+> **Tool calling** needs a function-calling model — recommended: `qwen2.5-coder`, `llama3.3`, `mistral`, `phi4`. Models that emit tool calls as **text** (`…`, `[TOOL_CALLS]…`) instead of Ollama's structured field are auto-recovered, so they execute tools out of the box rather than just chatting about it. Reasoning models (`deepseek-r1`, `qwen3`, `gemma4`) stream native `` blocks; enable with `/verbose` + `/thinking`.
---
@@ -478,8 +479,11 @@ A few common questions — the **full FAQ** is in [docs/guides/faq.md](docs/guid
/mcp add git uvx mcp-server-git # or create .mcp.json in your project, then /mcp reload
```
-**Q: Tool calls don't work with my local Ollama model.**
-Not all models support function calling — use `qwen2.5-coder`, `llama3.3`, `mistral`, or `phi4`.
+**Q: Tool calls don't work with my local Ollama model (it just keeps describing what it would do instead of doing it).**
+CheetahClaws now auto-recovers tool calls that local models emit as **text** (`…`, `[TOOL_CALLS]…`) instead of in Ollama's structured field, so most function-calling models execute tools out of the box. For best reliability use a tool-calling model — `qwen2.5-coder`, `llama3.3`, `mistral`, or `phi4`. Small models are also weaker at agentic tool use than cloud models, so expect them to need clearer, more concrete prompts.
+
+**Q: After installing on macOS, `cheetahclaws: command not found` and no `~/.zshrc` was created.**
+Reload your shell first: `source ~/.zshrc` (zsh) or `source ~/.bash_profile` (bash). The installer creates `~/.zshrc` if missing, symlinks the binary into `~/.local/bin`, and adds it to PATH. If you installed an older version, either re-run the installer or add this line yourself: `echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc && source ~/.zshrc`.
**Q: How do I connect to a remote GPU server running vLLM?**
```
diff --git a/docs/guides/faq.md b/docs/guides/faq.md
index de14e15f..d3084294 100644
--- a/docs/guides/faq.md
+++ b/docs/guides/faq.md
@@ -67,15 +67,19 @@ For stdio servers with env-based auth:
## Models & providers
-**Q: Tool calls don't work with my local Ollama model.**
+**Q: Tool calls don't work with my local Ollama model (it just keeps describing what it would do instead of doing it).**
-Not all models support function calling. Use one of the recommended tool-calling models: `qwen2.5-coder`, `llama3.3`, `mistral`, or `phi4`.
+CheetahClaws now auto-recovers tool calls that local models emit as **text** — `…` (Qwen/Hermes), `<|tool_call|>…` (Gemma), `[TOOL_CALLS]…` (Mistral) — instead of in Ollama's structured `message.tool_calls` field. Previously those were streamed as chat and never executed, which is why the model seemed to "keep talking." Most function-calling models now execute tools out of the box.
+
+For best reliability use one of the recommended tool-calling models. Small local models are also weaker at agentic tool use than cloud models, so give them clear, concrete prompts (a path, a filename, an exact command):
```bash
ollama pull qwen2.5-coder
cheetahclaws --model ollama/qwen2.5-coder
```
+If a model returns `500` on the first tool-enabled request, it has no tool template — CheetahClaws falls back to chat-only (a yellow `[warn]` is printed). Pull one of the models above instead.
+
**Q: How do I connect to a remote GPU server running vLLM?**
```
@@ -130,6 +134,22 @@ uv tool install ".[all]"
After that, just run `cheetahclaws` from any directory. To update after pulling changes, run `uv tool install ".[all]" --reinstall`. For a minimal install, use `uv tool install .` and add extras as needed.
+**Q: After installing on macOS I get `cheetahclaws: command not found`, and `~/.zshrc` was never created.**
+
+Reload your shell in a new terminal first:
+
+```bash
+source ~/.zshrc # zsh (macOS default)
+source ~/.bash_profile # bash on macOS
+```
+
+On macOS the installer creates a dedicated virtual environment (`~/.cheetahclaws-venv`), symlinks the `cheetahclaws` entry point into `~/.local/bin`, creates `~/.zshrc` if it's missing, and appends `~/.local/bin` to your `PATH` there. (It links only the one binary rather than putting the whole venv on `PATH`, so your own `python`/`pip` aren't shadowed.) If you installed an older build that skipped this, either re-run the installer or add it yourself:
+
+```bash
+echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc
+source ~/.zshrc
+```
+
## Voice
**Q: How do I set up voice input?**
diff --git a/docs/guides/usage.md b/docs/guides/usage.md
index a76803a9..527438d7 100644
--- a/docs/guides/usage.md
+++ b/docs/guides/usage.md
@@ -218,6 +218,18 @@ Then use any model from the list:
cheetahclaws --model ollama/
```
+**If a local model "just keeps talking" instead of editing files / running commands:**
+that means it emitted its tool calls as text rather than as structured calls.
+CheetahClaws auto-recovers the common text formats — `…`
+(Qwen/Hermes), `<|tool_call|>…` (Gemma), and `[TOOL_CALLS]…` (Mistral) — so they
+now execute. For best results pick a function-calling model (`qwen2.5-coder`,
+`llama3.3`, `mistral`, `phi4`) and give concrete prompts (a path, a filename, an
+exact command). Small local models are inherently weaker at agentic tool use than
+cloud models, so they may still need more explicit instructions. If a model has no
+tool template at all, the first tool-enabled request returns `500` and CheetahClaws
+falls back to chat-only mode (a yellow `[warn]` is printed) — pull one of the
+recommended models instead.
+
---
### Option B — LM Studio
diff --git a/docs/news.md b/docs/news.md
index 28994ef9..25f266fb 100644
--- a/docs/news.md
+++ b/docs/news.md
@@ -3,8 +3,9 @@
## 🔥🔥🔥 News (Pacific Time)
-- June 5, 2026 (**v3.05.82**) (latest): **User-controllable token / cost budgets — set a spend cap; on hit the session auto-saves and you can resume or raise it.** The quota engine (`quota.py`: per-session + per-day token/cost counters, enforced before each model call) already existed but had no friendly surface — you had to know four config keys (`session_token_budget` / `session_cost_budget` / `daily_token_budget` / `daily_cost_budget`) and there was no way to see how close you were, no warning before the wall, and the hard stop printed a bare `[Quota exceeded]`. This adds the UX layer on top of the unchanged engine: a **`/budget`** command — no args shows usage vs every budget as colored bars + percentages; **`/budget $5`** sets a session **cost** cap (the `$` means USD), **`/budget 200k`** a session **token** cap (parses `200k` / `1.5m` / `200000`), **`/budget daily $20`** / **`/budget daily 2m`** the daily caps, and **`/budget clear`** removes all. A **`--budget $5`** / **`--budget 200k`** startup flag sets the session cap at launch. **Proximity warnings** fire at the end of any turn that crosses **≥80%** (yellow) / **≥95%** (red) of a cap, so the wall never arrives by surprise. **On hit** the agent now yields a `QuotaPause` event (instead of a plain text line): the REPL **auto-saves the session** (`session_latest.json` + daily backup, the same path `/resume` reads) and prints a friendly next-steps block — raise the **same** cap or remove it (`/budget clear`) then resend, or restart later and `/resume`. So a long task that runs out of budget is never lost: you analyze, adjust, and continue. **Tight enforcement (no surprise overshoot):** the check projects the next request's *input* (`compaction.estimate_tokens`) and stops *before* the call if it would cross the cap, and clamps that call's `max_tokens` to the remaining headroom (`quota.output_room`) — so a single tool-heavy turn can't blow 40k→49k past the budget the way a pure "already-spent ≥ limit" check let it. **One budget per scope:** setting a cap *replaces* the other unit for that scope (`/budget $5` after `/budget 200k` switches the session cap to cost rather than stacking), so a leftover token cap can't silently keep blocking after you switch to a `$` cap. **Unit-matched hint:** `QuotaExceeded` / `QuotaPause` carry which cap broke (`key`/`scope`/`unit`/`limit`), so the "raise it" suggestion is in the *right* unit — a token cap shows `/budget 40k`, a daily cost cap shows `/budget daily $40` — instead of a generic `$` amount that wouldn't lift a token cap. New helpers `quota.parse_budget` / `fmt_amount` / `usage_vs_limits` / `warnings` / `output_room`; command in `commands/core.py:cmd_budget`; `QuotaPause` in `agent.py`; REPL handling + `--budget` in `cheetahclaws.py`; 42-case `tests/test_budget.py` (isolated quota dir, incl. a regression that the hint matches the breached unit and that switching units clears the stale cap). The daemon's conservative `serve`-mode defaults (200k tok / $2 per session, 2M / $20 per day) are unchanged — interactive stays unlimited by default, the server stays guard-railed. See [docs/guides/features.md](guides/features.md) · [docs/guides/reference.md](guides/reference.md).
-- June 5, 2026 (**v3.05.82**): **Adaptive Markdown streaming — live output that stays correct on every device.** In-place Rich Live redraw is great on capable terminals but breaks elsewhere: it was disabled wholesale over SSH (so SSH users got raw tokens with no formatting), and where it *did* run it could leave **duplicate or stale frames** — on macOS Terminal (which can't erase above the scroll boundary), over laggy network PTYs, or with **wide CJK / emoji text** whose display width a naive line-count gets wrong. The renderer now selects a **streaming tier per device** in `ui.render.auto_stream_mode(config)`: **`live`** — full in-place redraw, only on terminals known to handle cursor-up (local TTYs, and modern emulators *even over SSH*: iTerm2, WezTerm, Windows Terminal, VSCode, kitty, Alacritty, Ghostty, detected via `TERM_PROGRAM` / `TERM` / `WT_SESSION` / `KITTY_WINDOW_ID` / `ALACRITTY_WINDOW_ID` / `WEZTERM_PANE`); **`commit`** — **append-only progressive Markdown**, the safe default for unknown-SSH / Apple Terminal / pipes / non-TTY, where each completed block (split on blank lines, respecting open code fences so a fenced block renders atomically) is rendered and printed **permanently** and the cursor is **never moved**, making a duplicate frame structurally impossible regardless of terminal, latency, or character width; **`plain`** — raw tokens, only when `rich` is unavailable. The append-only floor is provably duplication-free; `live` is progressive enhancement on top. Override with **`/config stream_mode=live|commit|plain`** (legacy boolean **`/config rich_live=true|false`** still works → `live`/`commit`). Implemented in `ui/render.py` (`set_stream_mode` / `auto_stream_mode` / `_safe_commit_point` / `_commit_stream` / `_commit_flush`), wired in at REPL start in `cheetahclaws.py`, with a 26-case test suite in `tests/test_stream_modes.py` (device routing, code-fence-aware block boundaries, append-only commit, and a regression asserting commit mode emits **zero** cursor sequences even on a TTY with CJK text). Two related UX items shipped alongside: **`/context` is now a visual grid** — a Claude-Code-style 20×10 cell grid of context-window usage, colored and broken down by category (system prompt / system tools / memory files / skills / messages / free space) with per-category token counts and percentages, adapting to the model's real context window and falling back to `#`/`.` on non-UTF-8 terminals (`commands/core.py:cmd_context`); and **`deepseek-v4-flash` is registered at its 1M context window** in `providers._MODEL_CONTEXT_LIMITS` (overriding the 128K deepseek provider default, which still applies to `deepseek-chat` / `deepseek-v4-pro`), so the prompt `%`, `/context`, and the compaction trigger all reflect the true 1M window. See [docs/guides/features.md](guides/features.md) · [docs/guides/reference.md](guides/reference.md).
+- June 6, 2026 (**v3.5.82**) (latest): **macOS install reliably puts `cheetahclaws` on PATH, and local Ollama models that emit tool calls as text now actually execute them.** Two fixes reported in issue #131. **(1) Install / PATH on macOS.** On macOS the installer creates a dedicated venv (`~/.cheetahclaws-venv`) and `source`s it, so the post-install verification `if command -v cheetahclaws` succeeded *inside the script's own activated shell* — it printed "cheetahclaws is on PATH" and **short-circuited past the entire rc-file block**, including the `touch ~/.zshrc` that was supposed to create the file. Result: `~/.zshrc` was never created/updated, and in a fresh terminal (no venv active) the binary was unreachable, so users had to hunt for the install location by hand. The verification step no longer trusts the venv-polluted `command -v`: it confirms the binary at the expected `BIN_DIR`, then (for venv installs) **symlinks only the `cheetahclaws` entry point into `~/.local/bin`** — pipx-style, so the venv's `python`/`pip` never get prepended to PATH and can't shadow the user's own — creates the right rc file if missing (`~/.zshrc` for zsh, `~/.bash_profile` for bash on macOS, `config.fish` for fish), and appends the exposure dir to PATH there. The fish branch now also writes fish (`set -gx PATH …`) syntax instead of `export`, and the reload hint points bash-on-macOS at `.bash_profile` (`scripts/install.sh`). **(2) Ollama tool calls (the "model just keeps talking" bug).** The Ollama streaming path (`stream_ollama`) only read tool calls from Ollama's structured `message.tool_calls` field, whereas the OpenAI-compatible cloud path (`stream_openai_compat`) *also* recovers tool calls a model emits as **text** via `_find_native_tool_marker` + `_extract_native_tool_calls`. Many local models — Qwen-coder, Gemma, Mistral — emit calls as `{…}` / `<|tool_call|>…` / `[TOOL_CALLS][…]` inside `content`; on the Ollama path that markup was streamed straight to the screen as chat and never executed, so the agent loop saw no tool calls and ended the turn — exactly the reported "tool-calling-style chat that never runs." `stream_ollama` now mirrors the cloud path: when a native marker appears in the streamed content it **buffers from that point** (so the user never sees raw markup), and at end-of-stream parses the buffer into real tool calls (falling back to surfacing the buffered text if parsing fails, so nothing is silently swallowed). Note: Ollama's native `/api/chat` does not accept a `tool_choice` parameter, so the fix is the text-format recovery, not a request-param change. Existing provider + cache-token suites stay green. See [docs/guides/usage.md](guides/usage.md#usage-open-source-models-local) · [docs/guides/faq.md](guides/faq.md).
+- June 5, 2026 (**v3.5.82**): **User-controllable token / cost budgets — set a spend cap; on hit the session auto-saves and you can resume or raise it.** The quota engine (`quota.py`: per-session + per-day token/cost counters, enforced before each model call) already existed but had no friendly surface — you had to know four config keys (`session_token_budget` / `session_cost_budget` / `daily_token_budget` / `daily_cost_budget`) and there was no way to see how close you were, no warning before the wall, and the hard stop printed a bare `[Quota exceeded]`. This adds the UX layer on top of the unchanged engine: a **`/budget`** command — no args shows usage vs every budget as colored bars + percentages; **`/budget $5`** sets a session **cost** cap (the `$` means USD), **`/budget 200k`** a session **token** cap (parses `200k` / `1.5m` / `200000`), **`/budget daily $20`** / **`/budget daily 2m`** the daily caps, and **`/budget clear`** removes all. A **`--budget $5`** / **`--budget 200k`** startup flag sets the session cap at launch. **Proximity warnings** fire at the end of any turn that crosses **≥80%** (yellow) / **≥95%** (red) of a cap, so the wall never arrives by surprise. **On hit** the agent now yields a `QuotaPause` event (instead of a plain text line): the REPL **auto-saves the session** (`session_latest.json` + daily backup, the same path `/resume` reads) and prints a friendly next-steps block — raise the **same** cap or remove it (`/budget clear`) then resend, or restart later and `/resume`. So a long task that runs out of budget is never lost: you analyze, adjust, and continue. **Tight enforcement (no surprise overshoot):** the check projects the next request's *input* (`compaction.estimate_tokens`) and stops *before* the call if it would cross the cap, and clamps that call's `max_tokens` to the remaining headroom (`quota.output_room`) — so a single tool-heavy turn can't blow 40k→49k past the budget the way a pure "already-spent ≥ limit" check let it. **One budget per scope:** setting a cap *replaces* the other unit for that scope (`/budget $5` after `/budget 200k` switches the session cap to cost rather than stacking), so a leftover token cap can't silently keep blocking after you switch to a `$` cap. **Unit-matched hint:** `QuotaExceeded` / `QuotaPause` carry which cap broke (`key`/`scope`/`unit`/`limit`), so the "raise it" suggestion is in the *right* unit — a token cap shows `/budget 40k`, a daily cost cap shows `/budget daily $40` — instead of a generic `$` amount that wouldn't lift a token cap. New helpers `quota.parse_budget` / `fmt_amount` / `usage_vs_limits` / `warnings` / `output_room`; command in `commands/core.py:cmd_budget`; `QuotaPause` in `agent.py`; REPL handling + `--budget` in `cheetahclaws.py`; 42-case `tests/test_budget.py` (isolated quota dir, incl. a regression that the hint matches the breached unit and that switching units clears the stale cap). The daemon's conservative `serve`-mode defaults (200k tok / $2 per session, 2M / $20 per day) are unchanged — interactive stays unlimited by default, the server stays guard-railed. See [docs/guides/features.md](guides/features.md) · [docs/guides/reference.md](guides/reference.md).
+- June 5, 2026 (**v3.5.82**): **Adaptive Markdown streaming — live output that stays correct on every device.** In-place Rich Live redraw is great on capable terminals but breaks elsewhere: it was disabled wholesale over SSH (so SSH users got raw tokens with no formatting), and where it *did* run it could leave **duplicate or stale frames** — on macOS Terminal (which can't erase above the scroll boundary), over laggy network PTYs, or with **wide CJK / emoji text** whose display width a naive line-count gets wrong. The renderer now selects a **streaming tier per device** in `ui.render.auto_stream_mode(config)`: **`live`** — full in-place redraw, only on terminals known to handle cursor-up (local TTYs, and modern emulators *even over SSH*: iTerm2, WezTerm, Windows Terminal, VSCode, kitty, Alacritty, Ghostty, detected via `TERM_PROGRAM` / `TERM` / `WT_SESSION` / `KITTY_WINDOW_ID` / `ALACRITTY_WINDOW_ID` / `WEZTERM_PANE`); **`commit`** — **append-only progressive Markdown**, the safe default for unknown-SSH / Apple Terminal / pipes / non-TTY, where each completed block (split on blank lines, respecting open code fences so a fenced block renders atomically) is rendered and printed **permanently** and the cursor is **never moved**, making a duplicate frame structurally impossible regardless of terminal, latency, or character width; **`plain`** — raw tokens, only when `rich` is unavailable. The append-only floor is provably duplication-free; `live` is progressive enhancement on top. Override with **`/config stream_mode=live|commit|plain`** (legacy boolean **`/config rich_live=true|false`** still works → `live`/`commit`). Implemented in `ui/render.py` (`set_stream_mode` / `auto_stream_mode` / `_safe_commit_point` / `_commit_stream` / `_commit_flush`), wired in at REPL start in `cheetahclaws.py`, with a 26-case test suite in `tests/test_stream_modes.py` (device routing, code-fence-aware block boundaries, append-only commit, and a regression asserting commit mode emits **zero** cursor sequences even on a TTY with CJK text). Two related UX items shipped alongside: **`/context` is now a visual grid** — a Claude-Code-style 20×10 cell grid of context-window usage, colored and broken down by category (system prompt / system tools / memory files / skills / messages / free space) with per-category token counts and percentages, adapting to the model's real context window and falling back to `#`/`.` on non-UTF-8 terminals (`commands/core.py:cmd_context`); and **`deepseek-v4-flash` is registered at its 1M context window** in `providers._MODEL_CONTEXT_LIMITS` (overriding the 128K deepseek provider default, which still applies to `deepseek-chat` / `deepseek-v4-pro`), so the prompt `%`, `/context`, and the compaction trigger all reflect the true 1M window. See [docs/guides/features.md](guides/features.md) · [docs/guides/reference.md](guides/reference.md).
- June 4, 2026 (**v3.05.81**): **Claude-Code-style quiet output — hide tool execution, show one summary line per turn.** Long analysis turns used to scroll the terminal with a `⚙ Bash(...)` line and a `✓ → N lines (… chars)` line for *every* tool call, and the permission prompt dumped the entire inline script (e.g. a 60-line `python3 << 'PYEOF'` heredoc). A new **quiet mode (on by default)** suppresses the per-tool lines — the spinner conveys live activity and a single summary line is emitted at the tool→text boundary, sitting just above the reply (`Read 2 files, ran 3 shell commands`), the way Claude Code does. Errors and denials still surface so a mid-turn failure is never silent. In quiet mode the **permission prompt also collapses** a multi-line command to one line (`Run: python3 << 'PYEOF' … (+59 行)`) instead of printing the whole script. `/verbose` overrides quiet (full per-tool lines + inputs + token counts); toggle with **`/quiet`**, or launch with **`--show-tools`** (alias `--no-quiet`). The startup banner gains an **`Output: quiet` / `Output: full`** line so the active mode is visible at a glance. **Live status line:** the spinner now shows elapsed time plus a running output-token estimate (`Thinking… (7s · ↓ 435 tokens)`) — char-based, since providers only report real usage at the end — and each quiet turn closes with a real-usage footer **`✻ Worked for 7.2s · ↑ 1.2k · ↓ 435`** built from the true `TurnDone` counts. Implemented in `ui/render.py` (turn-level tool accumulator + `turn_summary_line()`, spinner token meter, `print_turn_stats()`), wired through the REPL event loop in `cheetahclaws.py`, with the `/quiet` toggle in `commands/config_cmd.py`. See [docs/guides/features.md](guides/features.md).
- June 4, 2026: **Context-window override — the prompt % and compaction now follow a settable context length.** The prompt's context-usage `%` (and the compaction trigger) derive from the model's context window, which previously could only be a hardcoded provider default — and `max_tokens` (the OUTPUT cap) doesn't change it, so `/config max_tokens=…` left the `%` unchanged (a common point of confusion). New per-session key **`context_window`** (`/config context_window=`, `0` = model default) overrides it, kept deliberately distinct from `max_tokens`. A single parser (`providers.context_window_override`) feeds the prompt `%`, `/context`, the compaction trigger, **and** the per-call output-token cap, so all four stay consistent; it is bidirectional — a smaller value forces earlier compaction, a larger value corrects a stale default. The value is read live each prompt, so switching model **or** `context_window` updates the `%` with no restart. `/config` warns when the value exceeds the model's real window (which would disable compaction and let the API reject oversized prompts). No-op when unset, so existing behavior is unchanged. See [docs/guides/reference.md](guides/reference.md).
- June 4, 2026: **Rich Live streaming — long responses stay live via a bounded tail window.** Large streamed responses that would overflow the terminal's redraw area could leave duplicate or stale frames behind on some emulators (macOS Terminal, etc.), because Rich Live redraws the whole accumulated output in place and the cursor can't reach content that has scrolled into the scrollback. Building on the per-response fallback from PR #133, Rich Live now keeps the live region **bounded to the viewport**: a short response is shown in full, but once it would overflow, only the **last screenful of rendered lines (a tail window) is redrawn** — so the Live region can never exceed the terminal and cannot leave stale frames. The complete output is committed once when the response finishes (including on Ctrl-C, since the REPL flushes on interrupt), so the head that scrolled out of the window is never lost. Plain streaming is kept only as a safety net (precise render failed, or the terminal is too small to bound a window). A cheap per-line wrap estimate short-circuits the expensive full `render_lines()` measurement while a response stays well under the limit, so normal responses pay no extra Markdown re-render per chunk. Adds focused tests covering full-frame streaming, the full→tail transition, tail-window commit-on-flush, real `Segments` rendering, and both safety-net fallbacks. See [docs/guides/features.md](guides/features.md).
diff --git a/providers.py b/providers.py
index df7537d6..9237013d 100644
--- a/providers.py
+++ b/providers.py
@@ -1342,6 +1342,14 @@ def _make_request(p):
text = ""
tool_buf: dict = {}
+ # Native tool-call interceptor. Many local models (Qwen, Gemma, Mistral)
+ # emit tool calls as plain text in `content` — e.g. `{...}`
+ # or `[TOOL_CALLS][...]` — instead of Ollama's structured `message.tool_calls`
+ # field. Without this we stream that markup as chat and never execute the
+ # tool: the classic "the local model just keeps talking" symptom. Buffer from
+ # the first marker so the user never sees raw markup, then parse at end-of-stream.
+ native_tool_buffering = False
+ native_tool_buffer = ""
try:
resp_cm = urllib.request.urlopen(req)
@@ -1384,10 +1392,26 @@ def _make_request(p):
if "thinking" in msg and msg["thinking"]:
yield ThinkingChunk(msg["thinking"])
- if "content" in msg and msg["content"]:
- text += msg["content"]
- yield TextChunk(msg["content"])
-
+ if msg.get("content"):
+ new = msg["content"]
+ if not native_tool_buffering:
+ # Detect a native tool-call marker, even if split across
+ # streamed chunks, by scanning the joined accumulated text.
+ joined = text + new
+ marker_idx = _find_native_tool_marker(joined)
+ if marker_idx is not None and marker_idx >= len(text):
+ split = marker_idx - len(text)
+ if split > 0:
+ text += new[:split]
+ yield TextChunk(new[:split])
+ native_tool_buffering = True
+ native_tool_buffer = new[split:]
+ else:
+ text += new
+ yield TextChunk(new)
+ else:
+ native_tool_buffer += new
+
# Handle native ollama tools format which mirrors OpenAI
for tc in msg.get("tool_calls", []):
fn = tc.get("function", {})
@@ -1404,6 +1428,18 @@ def _make_request(p):
v = tool_buf[idx]
tool_calls.append({"id": v["id"], "name": v["name"], "input": v["input"]})
+ # Fallback: the model emitted its tool calls as text rather than in the
+ # structured `tool_calls` field. Parse the buffered markup into real calls.
+ if native_tool_buffering:
+ native_calls = _extract_native_tool_calls(native_tool_buffer)
+ if native_calls:
+ tool_calls.extend(native_calls)
+ else:
+ # Couldn't parse — surface the buffer as text rather than swallow it,
+ # so the user sees something instead of a silent stall.
+ text += native_tool_buffer
+ yield TextChunk(native_tool_buffer)
+
# Ollama doesn't return exact token counts via livestream easily until "done",
# but we can do a rough estimate or 0, cheetahclaws handles zero gracefully
yield AssistantTurn(text, tool_calls, 0, 0, 0, 0)
diff --git a/pyproject.toml b/pyproject.toml
index b5d0aebd..74a74989 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -4,11 +4,35 @@ build-backend = "setuptools.build_meta"
[project]
name = "cheetahclaws"
-version = "3.05.82"
+version = "3.5.82"
description = "CheetahClaws: An Extensible, Python-Native Agent System for Autonomous Multi-Model Workflows"
readme = "README.md"
requires-python = ">=3.10"
license = { text = "Apache-2.0" }
+authors = [{ name = "SAIL Lab (Safe AI and Robot Learning Lab)", email = "gushangding@gmail.com" }]
+maintainers = [{ name = "SAIL Lab (Safe AI and Robot Learning Lab)", email = "gushangding@gmail.com" }]
+keywords = [
+ "ai", "agent", "llm", "claude", "openai", "gemini", "deepseek",
+ "coding-assistant", "cli", "terminal", "autonomous-agents",
+ "multi-model", "mcp", "tool-use", "ollama",
+]
+classifiers = [
+ "Development Status :: 5 - Production/Stable",
+ "Environment :: Console",
+ "Intended Audience :: Developers",
+ "Intended Audience :: Science/Research",
+ "License :: OSI Approved :: Apache Software License",
+ "Operating System :: POSIX :: Linux",
+ "Operating System :: MacOS",
+ "Programming Language :: Python :: 3",
+ "Programming Language :: Python :: 3.10",
+ "Programming Language :: Python :: 3.11",
+ "Programming Language :: Python :: 3.12",
+ "Programming Language :: Python :: 3.13",
+ "Topic :: Scientific/Engineering :: Artificial Intelligence",
+ "Topic :: Software Development :: Code Generators",
+ "Topic :: Utilities",
+]
dependencies = [
"anthropic>=0.40.0",
"openai>=1.30.0",
@@ -30,6 +54,12 @@ litellm = ["litellm>=1.60.0,<2.0.0"]
qq = ["qq-botpy>=1.2.1"]
all = ["sounddevice", "Pillow", "prompt_toolkit>=3.0.43", "playwright", "pymupdf", "openpyxl", "pytesseract", "yfinance>=0.2.30", "rank-bm25>=0.2.2", "sqlalchemy>=2.0", "bcrypt>=4.0", "PyJWT>=2.8.0", "litellm>=1.60.0,<2.0.0", "qq-botpy>=1.2.1"]
+[project.urls]
+Homepage = "https://cheetahclaws.github.io/"
+Repository = "https://github.com/SafeRL-Lab/cheetahclaws"
+Issues = "https://github.com/SafeRL-Lab/cheetahclaws/issues"
+Documentation = "https://github.com/SafeRL-Lab/cheetahclaws/tree/main/docs"
+
[project.scripts]
cheetahclaws = "cheetahclaws:main"
diff --git a/scripts/install.sh b/scripts/install.sh
index ed3ff21c..312e14bc 100755
--- a/scripts/install.sh
+++ b/scripts/install.sh
@@ -178,24 +178,63 @@ fi
ok "CheetahClaws installed"
-# ── Verify installation & add to PATH ─────────────────────────────────
-# Determine where the binary lives
+# ── Verify installation & expose on PATH for future shells ────────────
+# Determine where pip/venv placed the entry point.
if [ "$USE_VENV" = true ]; then
BIN_DIR="$VENV_DIR/bin"
else
BIN_DIR="$PIP_BIN"
fi
-if command -v cheetahclaws &>/dev/null; then
- ok "cheetahclaws is on PATH"
-elif [ -f "$BIN_DIR/cheetahclaws" ]; then
+# Decide which directory to put on PATH (EXPOSE_DIR).
+# - venv install: symlink ONLY the cheetahclaws entry point into
+# ~/.local/bin. Putting the whole venv/bin on PATH would shadow the
+# user's python/pip in every new shell — pipx avoids this the same way.
+# - user install: PIP_BIN itself is the directory to expose.
+# NOTE: we deliberately do NOT short-circuit on `command -v cheetahclaws`.
+# When we installed into a venv that this script just `source`d, the binary
+# is on PATH for THIS shell only — not for the new shells the user opens.
+# Trusting it here was the bug that left .zshrc untouched on macOS.
+EXPOSE_DIR=""
+if [ -f "$BIN_DIR/cheetahclaws" ]; then
+ if [ "$USE_VENV" = true ]; then
+ LOCAL_BIN="$HOME/.local/bin"
+ mkdir -p "$LOCAL_BIN"
+ ln -sf "$BIN_DIR/cheetahclaws" "$LOCAL_BIN/cheetahclaws"
+ EXPOSE_DIR="$LOCAL_BIN"
+ ok "Linked cheetahclaws into $LOCAL_BIN"
+ else
+ EXPOSE_DIR="$BIN_DIR"
+ fi
+elif command -v cheetahclaws &>/dev/null; then
+ # pip put it somewhere already on PATH — expose that directory.
+ EXPOSE_DIR="$(dirname "$(command -v cheetahclaws)")"
+else
+ warn "cheetahclaws binary not found at $BIN_DIR — you may need to add pip's bin directory to PATH manually."
+fi
+
+if [ -n "$EXPOSE_DIR" ]; then
+ # Pick the rc file the user's login shell actually reads, creating it if
+ # missing (macOS ships no default .zshrc, so a fresh zsh user has none).
SHELL_RC=""
+ IS_FISH=false
CURRENT_SH="$(basename "${SHELL:-bash}")"
if [ "$CURRENT_SH" = "zsh" ]; then
SHELL_RC="$HOME/.zshrc"
- touch "$SHELL_RC" # ensure it exists on macOS
+ touch "$SHELL_RC"
elif [ "$CURRENT_SH" = "fish" ]; then
SHELL_RC="$HOME/.config/fish/config.fish"
+ IS_FISH=true
+ mkdir -p "$(dirname "$SHELL_RC")"
+ touch "$SHELL_RC"
+ elif [ "$CURRENT_SH" = "bash" ]; then
+ # macOS bash loads .bash_profile for login shells; Linux loads .bashrc.
+ if [ "$PLATFORM" = "macos" ]; then
+ SHELL_RC="$HOME/.bash_profile"
+ else
+ SHELL_RC="$HOME/.bashrc"
+ fi
+ touch "$SHELL_RC"
elif [ -f "$HOME/.bashrc" ]; then
SHELL_RC="$HOME/.bashrc"
elif [ -f "$HOME/.bash_profile" ]; then
@@ -203,16 +242,26 @@ elif [ -f "$BIN_DIR/cheetahclaws" ]; then
fi
if [ -n "$SHELL_RC" ]; then
- if ! grep -q "$BIN_DIR" "$SHELL_RC" 2>/dev/null; then
- echo "" >> "$SHELL_RC"
- echo "# CheetahClaws" >> "$SHELL_RC"
- echo "export PATH=\"$BIN_DIR:\$PATH\"" >> "$SHELL_RC"
- ok "Added $BIN_DIR to PATH in $SHELL_RC"
+ if ! grep -q "$EXPOSE_DIR" "$SHELL_RC" 2>/dev/null; then
+ if [ "$IS_FISH" = true ]; then
+ {
+ echo ""
+ echo "# CheetahClaws"
+ echo "set -gx PATH \"$EXPOSE_DIR\" \$PATH"
+ } >> "$SHELL_RC"
+ else
+ {
+ echo ""
+ echo "# CheetahClaws"
+ echo "export PATH=\"$EXPOSE_DIR:\$PATH\""
+ } >> "$SHELL_RC"
+ fi
+ ok "Added $EXPOSE_DIR to PATH in $SHELL_RC"
+ else
+ ok "PATH already configured in $SHELL_RC"
fi
fi
- export PATH="$BIN_DIR:$PATH"
-else
- warn "cheetahclaws not found on PATH — you may need to add pip's bin directory manually."
+ export PATH="$EXPOSE_DIR:$PATH"
fi
# ── Print version ──────────────────────────────────────────────────────
@@ -231,6 +280,8 @@ if [ "$CURRENT_SHELL" = "zsh" ]; then
RELOAD_CMD="source ~/.zshrc"
elif [ "$CURRENT_SHELL" = "fish" ]; then
RELOAD_CMD="source ~/.config/fish/config.fish"
+elif [ "$CURRENT_SHELL" = "bash" ] && [ "$PLATFORM" = "macos" ]; then
+ RELOAD_CMD="source ~/.bash_profile"
else
RELOAD_CMD="source ~/.bashrc"
fi