Skip to content

Commit af19b8a

Browse files
dhruvbatraDhruv Batraclaude
authored
yutori template: align with Yutori n1.5 recommendations and fix runtime/type issues (#166)
## Summary PR #159 did the n1 → n1.5 model upgrade, but a few runtime bugs slipped through and the template never fully aligned with the public Yutori reference SDK. This PR fixes both, in both the TS and Python templates. Implementation cross-checked against the public [yutori-ai/yutori-sdk-python](https://github.com/yutori-ai/yutori-sdk-python) reference (`yutori/navigator/*`). ### Runtime / correctness bugs - **Key handling rewritten.** n1.5 emits **lowercase** key names (`enter`, `up`, `pageup`) and supports **sequential presses** (`down down down enter` — see [Key Space](https://docs.yutori.com/reference/n1-5#key-space)). The previous `KEY_MAP` was keyed on PascalCase Playwright-style names so it never matched, and the handler sent the entire space-separated expression as a single key. Replaced with a **lowercase → XKeysym** map (matching the names Kernel's `press_key` accepts — `Return`, `Page_Up`, `Ctrl`, etc., per `cmd/browsers_test.go:1512`) and a `parseKeyExpression` / `_parse_key_expression` that issues one `press_key` call per sequential combo. Applies to `key_press`, `hold_key`, click `modifier`, and scroll `modifier`. - **Coordinates now clamped to [0, dim-1]** after denormalizing from the n1.5 1000×1000 space. A boundary value of `1000` previously mapped to pixel `1280` on a 1280×800 viewport — one pixel outside the valid range. Matches the public SDK's `denormalize_coordinates` default clamp. - **TS template now type-checks cleanly.** The old `tsconfig.json` `extends`ed `../tsconfig.base.json` (which `kernel create` doesn't copy), `index.ts:61` had a `ChatCompletionMessageParam[]` signature mismatch, and `import sharp from 'sharp'` needed `esModuleInterop`. Inlined the full tsconfig like every other TS template, fixed the function signature, and added `esModuleInterop: true`. Verified with `npx tsc --noEmit` on a fresh scaffold (previously produced 12 errors). - **Python `pyproject.toml`** description was still `"n1 Computer Use"` → `"n1.5 Computer Use"`. ### Alignment with the public reference SDK - **WebP quality 80 → 30** for screenshots. Kernel `capture_screenshot` returns PNGs; the public SDK's `DEFAULT_WEBP_QUALITY_FOR_PNG = 30` ([images.py](https://github.com/yutori-ai/yutori-sdk-python/blob/main/yutori/navigator/images.py)) — lossless PNG sources tolerate aggressive WebP compression with no visible degradation, and the payload savings matter on long multi-step trajectories. - **`formatTaskWithContext` / `_format_task_with_context`** appends location, timezone, current date/time, and weekday to the initial task. Mirrors [`format_task_with_context`](https://github.com/yutori-ai/yutori-sdk-python/blob/main/yutori/navigator/context.py). Threaded through new optional `user_timezone` / `user_location` payload fields. - **`formatStopAndSummarize` / `_format_stop_and_summarize`**: when the loop hits `maxIterations` without a final answer, one extra screenshot + summary call so callers get a usable result instead of empty content. Mirrors the SDK's reference loop behavior. - **Scroll amount scaling**: Kernel's `delta_y` is wheel-event repeat count (not pixels), so the previous 1:1 forwarding produced very small scrolls. Now multiplies by `SCROLL_NOTCHES_PER_AMOUNT = 4`, closer to Yutori's documented "1 unit ≈ 10% of viewport height". - **`maxIterations` 50 → 100** to match the public SDK example default. ### Docs / scaffold - README headline example now matches the canonical "list team member names from yutori.com" task from the public docs. The magnitasks Kanban demo is kept below as an advanced example. README also documents the new optional payload fields. - `pkg/create/templates.go` `InvokeCommand` for both TS and Python uses the same canonical example, so the `kernel create` next-steps hint matches the README. ## Test plan - [x] `make build && make test` pass; `go vet ./...` clean. - [x] Scaffolded a fresh TS app from the updated template; `npx tsc --noEmit` passes (previously failed: missing `tsconfig.base`, type-mismatch in `index.ts:61`, sharp default import). - [x] `kernel deploy` from the fresh scaffold deploys with **no** TypeScript warnings (previously deploy logs included `Cannot read file '/boot-node/tsconfig.base.json'` and the `ChatCompletionMessageParam[]` error). - [x] `kernel invoke` runs end-to-end against `https://www.yutori.com` / "list team member names" — agent navigates to the team page and identifies team members. (Note: yutori.com's parallax team UI is genuinely tricky for any CUA; the iteration count is a UI artifact, not a template regression. n1.5 emits `mouse_move` actions to trigger the hover-reveal cards, which the new code handles.) - [x] *Reviewer item*: ideally also test `key_press` Enter, sequential presses (`down down enter`), `pageup`/`pagedown`, and shift-click `modifier` to exercise the new key map paths end-to-end. I couldn't reach a trajectory that emitted those during the smoke test. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Updates core Yutori agent loop/tool translation logic (keys, coordinates, scrolling, iteration behavior), which can change automation behavior and failure modes, but is limited to templates and not security-sensitive code. > > **Overview** > Improves the Yutori n1.5 computer-use templates (TS + Python) to better match Yutori’s reference recommendations and fix several runtime/correctness issues. > > The sampling loops now **add timezone/location/date context** to the initial task, **raise `maxIterations` to 100**, and if the loop times out without a final answer they perform a **stop-and-summarize** follow-up call so invocations return a usable result. > > The computer tools were adjusted for n1.5 behavior: **lowercase/sequential key expressions are parsed and mapped to Kernel keysyms**, normalized coordinates are **clamped** when denormalized to viewport pixels, scroll `amount` is **scaled** to Kernel wheel ticks, screenshot WebP quality is reduced for smaller payloads, and `goto_url` now normalizes missing schemes. > > Docs and scaffolding were updated: README examples and `kernel create` invoke hints now use the `yutori.com` “team member names” task, new optional payload fields (`user_timezone`, `user_location`) are documented/exposed, Python metadata is corrected to n1.5, and the TS template inlines a standalone `tsconfig.json` and fixes type issues (including `esModuleInterop`). > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 24ae144. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Dhruv Batra <dbatra@Dhruvs-MacBook-Pro.local> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent b67d5ee commit af19b8a

11 files changed

Lines changed: 544 additions & 208 deletions

File tree

pkg/create/templates.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -219,7 +219,7 @@ var Commands = map[string]map[string]DeployConfig{
219219
TemplateYutoriComputerUse: {
220220
EntryPoint: "index.ts",
221221
NeedsEnvFile: true,
222-
InvokeCommand: `kernel invoke ts-yutori-cua cua-task --payload '{"query": "Navigate to https://example.com and describe the page"}'`,
222+
InvokeCommand: `kernel invoke ts-yutori-cua cua-task --payload '{"query": "Navigate to https://www.yutori.com and list the team member names."}'`,
223223
},
224224
TemplateTzafonComputerUse: {
225225
EntryPoint: "index.ts",
@@ -271,7 +271,7 @@ var Commands = map[string]map[string]DeployConfig{
271271
TemplateYutoriComputerUse: {
272272
EntryPoint: "main.py",
273273
NeedsEnvFile: true,
274-
InvokeCommand: `kernel invoke python-yutori-cua cua-task --payload '{"query": "Navigate to https://example.com and describe the page"}'`,
274+
InvokeCommand: `kernel invoke python-yutori-cua cua-task --payload '{"query": "Navigate to https://www.yutori.com and list the team member names."}'`,
275275
},
276276
TemplateTzafonComputerUse: {
277277
EntryPoint: "main.py",

pkg/templates/python/yutori/README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,18 @@ kernel deploy main.py --env-file .env
2121

2222
## Usage
2323

24+
```bash
25+
kernel invoke python-yutori-cua cua-task --payload '{"query": "Navigate to https://www.yutori.com and list the team member names."}'
26+
```
27+
28+
Optional payload fields:
29+
30+
- `record_replay` (bool) — capture a video of the session (paid plans only).
31+
- `kiosk` (bool) — launch the browser without address bar / tabs ([see below](#kiosk-mode)).
32+
- `user_timezone` (IANA, e.g. `"America/New_York"`) and `user_location` (free text, e.g. `"New York, NY, US"`) — appended to the task message so the model has accurate temporal/locational grounding.
33+
34+
More involved example (Kanban drag-and-drop):
35+
2436
```bash
2537
kernel invoke python-yutori-cua cua-task --payload '{"query": "Go to https://www.magnitasks.com, Click the Tasks option in the left-side bar, and drag the 5 items in the To Do and In Progress columns to the Done section of the Kanban board. You are done successfully when the items are dragged to Done. Do not click into the items."}'
2638
```

pkg/templates/python/yutori/loop.py

Lines changed: 100 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,14 @@
1111
@see https://docs.yutori.com/reference/n1-5
1212
"""
1313

14+
from __future__ import annotations
15+
1416
import copy
1517
import json
18+
import platform
19+
from datetime import datetime
1620
from typing import Any, Optional
21+
from zoneinfo import ZoneInfo, ZoneInfoNotFoundError
1722

1823
from kernel import Kernel
1924
from openai import OpenAI
@@ -26,6 +31,8 @@
2631
DISABLED_TOOLS = ["extract_elements", "find", "set_element_value", "execute_js"]
2732
TOOL_SET = "browser_tools_core-20260403"
2833

34+
NAVIGATOR_COORDINATE_SCALE = 1000
35+
2936
# Screenshot-trimming defaults mirror Yutori's reference loop:
3037
# https://github.com/yutori-ai/yutori-sdk-python/blob/main/yutori/navigator/payload.py
3138
# Trimming is size-triggered — we only drop old screenshots when the payload
@@ -42,12 +49,14 @@ async def sampling_loop(
4249
kernel: Kernel,
4350
session_id: str,
4451
max_completion_tokens: int = 4096,
45-
max_iterations: int = 50,
52+
max_iterations: int = 100,
4653
viewport_width: int = 1280,
4754
viewport_height: int = 800,
4855
kiosk_mode: bool = False,
56+
user_timezone: str = "America/Los_Angeles",
57+
user_location: str = "San Francisco, CA, US",
4958
) -> dict[str, Any]:
50-
"""Run the n1 sampling loop until the model stops calling tools or max iterations."""
59+
"""Run the n1.5 sampling loop until the model stops calling tools or max iterations."""
5160
client = OpenAI(
5261
api_key=api_key,
5362
base_url="https://api.yutori.com/v1",
@@ -57,7 +66,12 @@ async def sampling_loop(
5766

5867
initial_screenshot = await computer_tool.screenshot()
5968

60-
user_content: list[dict[str, Any]] = [{"type": "text", "text": task}]
69+
# Append location/timezone/current-date context to the task — mirrors Yutori's
70+
# format_task_with_context helper and helps the model with date-sensitive
71+
# judgments. https://github.com/yutori-ai/yutori-sdk-python/blob/main/yutori/navigator/context.py
72+
task_with_context = _format_task_with_context(task, user_timezone, user_location)
73+
74+
user_content: list[dict[str, Any]] = [{"type": "text", "text": task_with_context}]
6175
if initial_screenshot.get("base64_image"):
6276
user_content.append({
6377
"type": "image_url",
@@ -171,15 +185,81 @@ async def sampling_loop(
171185
"content": result.get("output", "OK"),
172186
})
173187

174-
if iteration >= max_iterations:
175-
print("Max iterations reached")
188+
# If the loop exhausted iterations, prompt the model for a final summary so
189+
# the caller gets a usable answer instead of empty content. Mirrors Yutori's
190+
# format_stop_and_summarize helper.
191+
if iteration >= max_iterations and not final_answer:
192+
print("Max iterations reached — requesting summary")
193+
try:
194+
final_screenshot = await computer_tool.screenshot()
195+
stop_content: list[dict[str, Any]] = [
196+
{"type": "text", "text": _format_stop_and_summarize(task)}
197+
]
198+
if final_screenshot.get("base64_image"):
199+
stop_content.append({
200+
"type": "image_url",
201+
"image_url": {
202+
"url": f"data:image/webp;base64,{final_screenshot['base64_image']}"
203+
},
204+
})
205+
conversation_messages.append({"role": "user", "content": stop_content})
206+
207+
summary_messages, _ = _trimmed_for_request(conversation_messages)
208+
summary_response = client.chat.completions.create(
209+
model=model,
210+
messages=summary_messages,
211+
max_completion_tokens=max_completion_tokens,
212+
temperature=0.3,
213+
extra_body={"tool_set": TOOL_SET, "disable_tools": DISABLED_TOOLS},
214+
)
215+
summary = summary_response.choices[0].message if summary_response.choices else None
216+
if summary:
217+
conversation_messages.append(summary.model_dump(exclude_none=True))
218+
final_answer = summary.content or None
219+
except Exception as summary_error:
220+
print(f"Stop-and-summarize call failed: {summary_error}")
176221

177222
return {
178223
"messages": conversation_messages,
179224
"final_answer": final_answer,
180225
}
181226

182227

228+
def _format_task_with_context(task: str, user_timezone: str, user_location: str) -> str:
229+
"""Append location, timezone, and current date/time to the task message."""
230+
for timezone_name in [user_timezone, "America/Los_Angeles", "UTC"]:
231+
try:
232+
tz = ZoneInfo(timezone_name)
233+
tz_label = timezone_name
234+
break
235+
except (ZoneInfoNotFoundError, ValueError, OSError):
236+
continue
237+
else:
238+
return task
239+
240+
now = datetime.now(tz)
241+
day_fmt = "%#d" if platform.system() == "Windows" else "%-d"
242+
context = "\n".join([
243+
f"User's location: {user_location}",
244+
f"User's timezone: {tz_label}",
245+
f"Current Date: {now.strftime(f'%B {day_fmt}, %Y')}",
246+
f"Current Time: {now.strftime('%H:%M:%S %Z')}",
247+
f"Today is: {now.strftime('%A')}",
248+
])
249+
return f"{task}\n\n{context}"
250+
251+
252+
def _format_stop_and_summarize(task: str) -> str:
253+
return (
254+
f"Stop here. "
255+
f"Summarize your current progress and list in detail all the findings "
256+
f"relevant to the given task:\n{task}\n"
257+
f"Provide URLs for all relevant results you find and return them in your response. "
258+
f"If there is no specific URL for a result, "
259+
f"cite the page URL that the information was found on."
260+
)
261+
262+
183263
def _trimmed_for_request(
184264
messages: list[dict[str, Any]],
185265
) -> tuple[list[dict[str, Any]], int]:
@@ -263,17 +343,22 @@ def _scale_coordinates(action: N15Action, viewport_width: int, viewport_height:
263343
scaled = dict(action)
264344

265345
if "coordinates" in scaled and scaled["coordinates"]:
266-
coords = scaled["coordinates"]
267-
scaled["coordinates"] = [
268-
round((coords[0] / 1000) * viewport_width),
269-
round((coords[1] / 1000) * viewport_height),
270-
]
346+
scaled["coordinates"] = _denormalize(scaled["coordinates"], viewport_width, viewport_height)
271347

272348
if "start_coordinates" in scaled and scaled["start_coordinates"]:
273-
coords = scaled["start_coordinates"]
274-
scaled["start_coordinates"] = [
275-
round((coords[0] / 1000) * viewport_width),
276-
round((coords[1] / 1000) * viewport_height),
277-
]
349+
scaled["start_coordinates"] = _denormalize(scaled["start_coordinates"], viewport_width, viewport_height)
278350

279351
return scaled
352+
353+
354+
def _denormalize(coords: list[int] | tuple[int, int], width: int, height: int) -> list[int]:
355+
"""Map [0, 1000] coordinates to viewport pixels and clamp to [0, dim-1].
356+
357+
Clamping prevents a boundary value like 1000 from landing one pixel outside
358+
the viewport on a 1280x800 display.
359+
"""
360+
raw_x = round((coords[0] / NAVIGATOR_COORDINATE_SCALE) * width)
361+
raw_y = round((coords[1] / NAVIGATOR_COORDINATE_SCALE) * height)
362+
x = max(0, min(width - 1, raw_x))
363+
y = max(0, min(height - 1, raw_y))
364+
return [x, y]

pkg/templates/python/yutori/main.py

Lines changed: 26 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,15 @@
66
from session import KernelBrowserSession
77

88

9-
class QueryInput(TypedDict):
10-
query: str
9+
class _QueryInputOptional(TypedDict, total=False):
1110
record_replay: Optional[bool]
1211
kiosk: Optional[bool]
12+
user_timezone: Optional[str]
13+
user_location: Optional[str]
14+
15+
16+
class QueryInput(_QueryInputOptional):
17+
query: str
1318

1419

1520
class QueryOutput(TypedDict):
@@ -37,6 +42,9 @@ async def cua_task(
3742
payload: An object containing:
3843
- query: The task/query string to process
3944
- record_replay: Optional boolean to enable video replay recording
45+
- kiosk: Optional boolean to launch in kiosk mode
46+
- user_timezone: Optional IANA tz (e.g. "America/New_York")
47+
- user_location: Optional free-text location for model context
4048
4149
Returns:
4250
A dictionary containing:
@@ -57,24 +65,29 @@ async def cua_task(
5765
) as session:
5866
print("Kernel browser live view url:", session.live_view_url)
5967

60-
loop_result = await sampling_loop(
61-
model="n1.5-latest",
62-
task=payload["query"],
63-
api_key=str(api_key),
64-
kernel=session.kernel,
65-
session_id=str(session.session_id),
66-
viewport_width=session.viewport_width,
67-
viewport_height=session.viewport_height,
68-
kiosk_mode=kiosk_mode,
69-
)
68+
loop_kwargs: dict = {
69+
"model": "n1.5-latest",
70+
"task": payload["query"],
71+
"api_key": str(api_key),
72+
"kernel": session.kernel,
73+
"session_id": str(session.session_id),
74+
"viewport_width": session.viewport_width,
75+
"viewport_height": session.viewport_height,
76+
"kiosk_mode": kiosk_mode,
77+
}
78+
if payload.get("user_timezone"):
79+
loop_kwargs["user_timezone"] = payload["user_timezone"]
80+
if payload.get("user_location"):
81+
loop_kwargs["user_location"] = payload["user_location"]
82+
83+
loop_result = await sampling_loop(**loop_kwargs)
7084

7185
final_answer = loop_result.get("final_answer")
7286
messages = loop_result.get("messages", [])
7387

7488
if final_answer:
7589
result = final_answer
7690
else:
77-
# Extract last assistant message
7891
result = _extract_last_assistant_message(messages)
7992

8093
return {

pkg/templates/python/yutori/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
[project]
22
name = "python-yutori-cua"
33
version = "0.1.0"
4-
description = "Kernel reference app for Yutori n1 Computer Use"
4+
description = "Kernel reference app for Yutori n1.5 Computer Use"
55
requires-python = ">=3.9"
66
dependencies = [
77
"openai>=1.58.0",

0 commit comments

Comments
 (0)