WebVisionKit is not DOM automation. Students do not search the page with selectors or depend on a site's internal HTML structure. Instead, an app receives rendered browser frames, reasons about those pixels, and acts through browser input primitives such as click, drag, scroll, and keyboard input.
That model is deliberate. It keeps the student problem focused on vision, state, and control instead of page-specific DOM details.
- Google Chrome runs on the host with DevTools remote debugging enabled.
./launch.bashprepares Chrome, validates Docker connectivity, and launches the selected app.- The container runs the
webvisionkitruntime package and the student app. - Chrome DevTools Protocol (CDP) delivers screencast frames into the runtime.
- The app returns observations and browser actions, and the runtime forwards those actions back to Chrome.
Student apps live under:
apps/<name>/app.py
Each app must export:
app = BrowserApp(...)That keeps each project self-contained and easy to launch through ./launch.bash.
WebVisionKit apps can start on:
about:blank- External URLs such as
https://example.com - Bundled
game://...targets such asgame://input-laborgame://simple_drag
Bundled game://... targets are resolved by the launcher before the container starts. They are a launcher feature, not a special browser action API.
context.frame_width and context.frame_height describe the delivered frame size that your app receives in on_frame(...). Browser inputs are specified in that same frame-pixel space.
Internally, the runtime maps those frame pixels back to the browser's CSS viewport before dispatching input events. That means students can reason in the same coordinate system they see in the image, while the runtime handles the browser-space conversion.
Chrome itself is launched maximized by the launcher, so the visible browser window can be larger than the frame size your app sees. Screencast delivery is a separate step: WebVisionKit passes MAX_WIDTH and MAX_HEIGHT into the Chrome DevTools screencast request as maximum bounds for the delivered frame.
That is why a larger browser window can still produce a smaller delivered frame such as 1280x642. The real viewport can be larger, but the screencast frame is scaled to fit within the configured max bounds while preserving its aspect ratio.
If you want the runtime to omit those caps, run with:
MAX_WIDTH=0 MAX_HEIGHT=0 APP_NAME=frame_report ./launch.bashIf you want a larger capped delivery size, run with:
MAX_WIDTH=1920 MAX_HEIGHT=1080 APP_NAME=frame_report ./launch.bashThis is adjustable today through launcher and runtime environment variables. It is not currently configurable through BrowserApp(...) or another public Python sizing API in webvisionkit.
The repo ships a few small reference apps that show real runtime usage:
frame_report: minimal observation app for smoke runs and debuggingscreenshot_capture: periodic screenshot writersimple_drag: small CV example that drags a red block into a green goalinteraction_showcase: broader browser-input demo forgame://input-lab