HyperAgent is a browser automation SDK that uses LLM-powered agents to execute tasks on web pages. It provides both imperative page methods (page.ai(), page.extract()) and a programmatic task execution API.
HyperAgent Class (src/agent/index.ts)
The main class that orchestrates everything:
class HyperAgent<T extends BrowserProviders = "Local"> {
// Core methods
async executeTask(task: string, params?: TaskParams, initPage?: Page): Promise<TaskOutput>
async executeTaskAsync(task: string, params?: TaskParams, initPage?: Page): Promise<Task>
// Page management
async getCurrentPage(): Promise<Page>
async newPage(): Promise<HyperPage>
async getPages(): Promise<HyperPage[]>
// Browser lifecycle
async initBrowser(): Promise<Browser>
async closeAgent(): Promise<void>
}HyperPage Interface (src/agent/index.ts:567-605)
Enhanced Playwright Page with AI methods:
interface HyperPage extends Page {
// Execute a task on this page
ai(task: string, params?: TaskParams): Promise<TaskOutput>
// Execute task asynchronously (non-blocking)
aiAsync(task: string, params?: TaskParams): Promise<Task>
// Extract structured data
extract<T>(
task?: string,
outputSchema?: z.AnyZodObject,
params?: TaskParams
): Promise<T | string>
}Key Implementation Details:
page.ai()→ callsagent.executeTask(task, params, page)(line 569-570)page.extract()→ wrapsexecuteTask()with extraction-specific prompts (lines 573-603)- Adds
maxSteps: 2by default for extractions - Prepends extraction instructions to the task
- Parses JSON output if outputSchema provided
- Adds
Main Task Loop (src/agent/tools/agent.ts:105-306)
runAgentTask()
├── 1. Get DOM State (getDom)
│ ├── Inject JavaScript into page
│ ├── Find interactive elements
│ ├── Draw numbered overlay (canvas)
│ └── Capture screenshot with overlay
│
├── 2. Build Agent Messages (buildAgentStepMessages)
│ ├── System prompt
│ ├── Task description
│ ├── Previous steps context
│ ├── DOM representation (text)
│ └── Screenshot (base64 image)
│
├── 3. Invoke LLM (llm.invokeStructured)
│ ├── Request structured output (Zod schema)
│ └── Get list of actions to execute
│
├── 4. Execute Actions (runAction)
│ ├── For each action in list
│ ├── Run action handler
│ └── Wait 2 seconds between actions
│
└── 5. Repeat until complete/cancelled/maxSteps
Location: src/agent/tools/agent.ts:132-291
Entry Point: getDom(page) (src/context-providers/dom/index.ts:5-18)
export const getDom = async (page: Page): Promise<DOMState | null> => {
const result = await page.evaluate(buildDomViewJs) as DOMStateRaw;
return {
elements: Map<number, InteractiveElement>,
domState: string, // Text representation
screenshot: string // Base64 PNG with overlays
};
};Build DOM View (src/context-providers/dom/build-dom-view.ts:54-130)
Process:
-
Find Interactive Elements (find-interactive-elements.ts:4-63)
- Traverse entire DOM including Shadow DOM and iframes
- Check each element with
isInteractiveElem(element) - Returns
InteractiveElement[]with metadata
-
Render Highlights Offscreen (highlight.ts:105-222)
- Create
OffscreenCanvaswith viewport dimensions - Draw colored rectangles around each interactive element
- Draw numbered labels (1, 2, 3...) on each element
- Return
ImageBitmap
- Create
-
Composite Screenshot (agent.ts:33-42)
const compositeScreenshot = async (page: Page, overlay: string) => { const screenshot = await page.screenshot({ type: "png" }); // Overlay numbered boxes onto base screenshot using Jimp baseImage.composite(overlayImage, 0, 0); return buffer.toString("base64"); };
-
Build Text Representation (build-dom-view.ts:78-123)
[1]<button id="submit" class="btn-primary">Submit Form</button> [2]<input type="text" name="email" placeholder="Enter email"> Some text between elements [3]<a href="/pricing">View Pricing</a>
Output Structure:
interface DOMState {
elements: Map<number, InteractiveElement> // index → element mapping
domState: string // [idx]<tag>text</tag> format
screenshot: string // base64 PNG with overlays
}Available Actions (src/agent/actions/)
| Action | Purpose | Key Parameters | Location |
|---|---|---|---|
clickElement |
Click an element | index: number |
click-element.ts |
inputText |
Fill input field | index: number, text: string |
input-text.ts |
extract |
Extract data | objective: string |
extract.ts |
goToUrl |
Navigate to URL | url: string |
go-to-url.ts |
selectOption |
Select dropdown | index: number, option: string |
select-option.ts |
scroll |
Scroll page | direction: "up"|"down" |
scroll.ts |
keyPress |
Press keyboard key | key: string |
key-press.ts |
complete |
End task | output?: string |
complete.ts |
Action Execution (src/agent/tools/agent.ts:71-103)
Click Element Example (click-element.ts:18-57)
run: async function (ctx: ActionContext, action: ClickElementActionType) {
const { index } = action;
const locator = getLocator(ctx, index); // Get element by index
await locator.scrollIntoViewIfNeeded({ timeout: 2500 });
await locator.waitFor({ state: "visible", timeout: 2500 });
await waitForElementToBeEnabled(locator, 2500);
await waitForElementToBeStable(locator, 2500);
await locator.click({ force: true });
return { success: true, message: `Clicked element with index ${index}` };
}Element Selection: (actions/utils.ts)
export const getLocator = (ctx: ActionContext, index: number): Locator | null => {
const element = ctx.domState.elements.get(index);
if (!element) return null;
return ctx.page.locator(element.cssPath); // Use CSS path selector
};- User calls
page.ai("click the login button") - →
agent.executeTask(task, params, page)(index.ts:569) - →
runAgentTask()starts task loop (agent.ts:105) - →
getDom(page)extracts DOM + screenshot (agent.ts:155)- Injects JS to find interactive elements
- Draws numbered overlays
- Composites screenshot
- →
buildAgentStepMessages()creates LLM prompt (agent.ts:201) - →
llm.invokeStructured()gets action plan (agent.ts:220) - → Execute actions (agent.ts:253-275)
- LLM returns:
{ type: "clickElement", params: { index: 5 } } runAction()callsClickElementActionDefinition.run()- Gets locator for element 5
- Clicks element via Playwright
- LLM returns:
- → Repeat loop or mark complete
- User calls
page.extract("product prices", PriceSchema) - → Wraps task: "You have to perform an extraction on the current page..." (index.ts:586-590)
- → Sets
maxSteps: 2(extractions are quick) (index.ts:581) - → Adds
outputSchemato actions (index.ts:584) - →
executeTask()runs normal agent loop - → LLM returns structured output matching schema
- → Parse JSON and return typed result (index.ts:592)
The extract action is different from page.extract():
Location: src/agent/actions/extract.ts
run: async (ctx: ActionContext, action: ExtractActionType) => {
// Get page HTML
const content = await ctx.page.content();
const markdown = await parseMarkdown(content);
// Take screenshot via CDP
const cdpSession = await ctx.page.context().newCDPSession(ctx.page);
const screenshot = await cdpSession.send("Page.captureScreenshot");
// Call LLM with markdown + screenshot
const response = await ctx.llm.invoke([{
role: "user",
content: [
{ type: "text", text: `Extract: "${objective}"\n\n${markdown}` },
{ type: "image", url: `data:image/png;base64,${screenshot.data}` }
]
}]);
return { success: true, message: `Extracted: ${content}` };
}This is an action the agent can choose during task execution, not the page-level method.
Strengths:
- ✅ Simple index-based selection (LLM just says "5")
- ✅ Visual feedback in screenshots
- ✅ Works well with vision models
Weaknesses:
- ❌ Screenshot required every step (slow)
- ❌ Screenshot → LLM → token cost is high
- ❌ Numbered overlay can occlude important UI
- ❌ Full DOM traversal every step (no caching)
- ❌ Large token counts (screenshot + DOM text)
Performance:
- ~8,000-15,000 tokens per step
- ~1,500-3,000ms per action
- No caching mechanism
Interactive Element Detection (src/context-providers/dom/elem-interactive.ts)
Current Rules:
isInteractiveElem(element: HTMLElement): { isInteractive: boolean, reason?: string }Checks (in order):
- Native interactive tags:
button,a[href],input,select,textarea - ARIA roles:
button,link,tab,checkbox,menuitem - Event listeners:
data-has-interactive-listener="true"(injected) - Contenteditable elements
- Elements with
onclickattribute - Cursor style:
cursor: pointer - Custom detection for common patterns
Ignored Elements:
- Hidden elements (
display: none,visibility: hidden) - Zero-dimension elements
- Disabled elements
- Script and style tags
Prompt Construction (src/agent/messages/builder.ts)
Message Structure:
[
{ role: "system", content: SYSTEM_PROMPT },
{ role: "user", content: [
{ type: "text", text: "Task: click login button\n\nDOMState:\n[1]<button>..." },
{ type: "image", url: "data:image/png;base64,..." }
]},
{ role: "assistant", content: "..." }, // Previous step
{ role: "user", content: "..." }, // Previous action results
// ... more history ...
{ role: "user", content: [ // Current step
{ type: "text", text: "Current DOM:\n..." },
{ type: "image", url: "..." }
]}
]Variable Management (src/agent/index.ts:174-202)
interface HyperVariable {
key: string;
value: string;
description?: string;
}
// API
agent.addVariable({ key: "email", value: "user@example.com" })
agent.getVariable("email")
agent.deleteVariable("email")Usage in Actions:
// In inputText action:
text = text.replace(`<<${variable.key}>>`, variable.value);
// Agent can use: inputText(5, "<<email>>") → "user@example.com"-
LocalBrowserProvider (default)
- Uses
patchright(Playwright fork with anti-detection) - Runs locally
- Uses
-
HyperbrowserProvider
- Cloud-based browser service
- Remote CDP connection
Selection: (index.ts:85-94)
new HyperAgent({
browserProvider: "Local" | "Hyperbrowser",
localConfig: { ... },
hyperbrowserConfig: { ... }
})Model Context Protocol Support (src/agent/mcp/)
Purpose: Connect external tools as custom actions
await agent.initializeMCPClient({
servers: [{
id: "filesystem",
command: "npx",
args: ["-y", "@modelcontextprotocol/server-filesystem"]
}]
});How it works:
- MCP server exposes tools
- Tools converted to
AgentActionDefinition - Registered with agent
- LLM can invoke MCP tools as actions
Debug Output (src/agent/tools/agent.ts:112-148)
When debug: true:
debug/
└── {taskId}/
├── step-0/
│ ├── elems.txt # DOM text representation
│ ├── screenshot.png # Composite screenshot
│ ├── msgs.json # LLM messages
│ └── stepOutput.json # Action results
├── step-1/
└── taskOutput.json # Final output
- ✅ Simple API (
page.ai(),page.extract()) - ✅ Flexible action system
- ✅ Multi-step task execution
- ✅ MCP integration
- ✅ Variable substitution
- ❌ Screenshot required every step
- ❌ No DOM caching
- ❌ No action caching
- ❌ High token usage (8K-15K per step)
- ❌ Slow actions (1.5-3s each)
- ❌ Numbered overlay can be occluded
- ❌ Full DOM may miss semantic meaning
- ❌ No accessibility tree
- ❌ No self-healing on failure
- ❌ Single-attempt actions (no retry logic)
| Component | File Path | Key Lines |
|---|---|---|
| Main Agent Class | src/agent/index.ts |
37-606 |
| Task Execution Loop | src/agent/tools/agent.ts |
105-306 |
| DOM Extraction | src/context-providers/dom/index.ts |
5-18 |
| Build DOM View | src/context-providers/dom/build-dom-view.ts |
54-130 |
| Find Elements | src/context-providers/dom/find-interactive-elements.ts |
4-63 |
| Canvas Overlay | src/context-providers/dom/highlight.ts |
105-222 |
| Click Action | src/agent/actions/click-element.ts |
18-57 |
| Input Text Action | src/agent/actions/input-text.ts |
16-37 |
| Extract Action | src/agent/actions/extract.ts |
16-104 |
| System Prompt | src/agent/messages/system-prompt.ts |
- |
| Message Builder | src/agent/messages/builder.ts |
- |
Based on Stagehand and Skyvern analysis, key opportunities:
-
Adopt Accessibility Tree (Stagehand approach)
- 3-4x token reduction
- Better semantic understanding
- No screenshot required for actions
-
Implement Caching (Stagehand approach)
- Action cache (instruction+URL → selector)
- LLM cache (prompt → response)
- 20-30x speed improvement for cached actions
-
Hybrid Visual Approach (Skyvern approach)
- DOM injection for element IDs (no overlay)
- Bounding boxes only when needed
- Keep visual feedback but reduce occlusion
-
Self-Healing (Stagehand approach)
- Re-observe on failure
- Multiple selector strategies
- Retry logic with different approaches
See improvement-plan.md for detailed implementation strategy.