HyperAgent Current State Analysis

Overview

HyperAgent is a browser automation SDK that uses LLM-powered agents to execute tasks on web pages. It provides both imperative page methods (page.ai(), page.extract()) and a programmatic task execution API.

Core Architecture

1. Entry Points & Public API

HyperAgent Class (src/agent/index.ts)

The main class that orchestrates everything:

class HyperAgent<T extends BrowserProviders = "Local"> {
  // Core methods
  async executeTask(task: string, params?: TaskParams, initPage?: Page): Promise<TaskOutput>
  async executeTaskAsync(task: string, params?: TaskParams, initPage?: Page): Promise<Task>

  // Page management
  async getCurrentPage(): Promise<Page>
  async newPage(): Promise<HyperPage>
  async getPages(): Promise<HyperPage[]>

  // Browser lifecycle
  async initBrowser(): Promise<Browser>
  async closeAgent(): Promise<void>
}

HyperPage Interface (src/agent/index.ts:567-605)

Enhanced Playwright Page with AI methods:

interface HyperPage extends Page {
  // Execute a task on this page
  ai(task: string, params?: TaskParams): Promise<TaskOutput>

  // Execute task asynchronously (non-blocking)
  aiAsync(task: string, params?: TaskParams): Promise<Task>

  // Extract structured data
  extract<T>(
    task?: string,
    outputSchema?: z.AnyZodObject,
    params?: TaskParams
  ): Promise<T | string>
}

Key Implementation Details:

page.ai() → calls agent.executeTask(task, params, page) (line 569-570)
page.extract() → wraps executeTask() with extraction-specific prompts (lines 573-603)
- Adds maxSteps: 2 by default for extractions
- Prepends extraction instructions to the task
- Parses JSON output if outputSchema provided

2. Task Execution Flow

Main Task Loop (src/agent/tools/agent.ts:105-306)

runAgentTask()
  ├── 1. Get DOM State (getDom)
  │   ├── Inject JavaScript into page
  │   ├── Find interactive elements
  │   ├── Draw numbered overlay (canvas)
  │   └── Capture screenshot with overlay
  │
  ├── 2. Build Agent Messages (buildAgentStepMessages)
  │   ├── System prompt
  │   ├── Task description
  │   ├── Previous steps context
  │   ├── DOM representation (text)
  │   └── Screenshot (base64 image)
  │
  ├── 3. Invoke LLM (llm.invokeStructured)
  │   ├── Request structured output (Zod schema)
  │   └── Get list of actions to execute
  │
  ├── 4. Execute Actions (runAction)
  │   ├── For each action in list
  │   ├── Run action handler
  │   └── Wait 2 seconds between actions
  │
  └── 5. Repeat until complete/cancelled/maxSteps

Location: src/agent/tools/agent.ts:132-291

3. DOM State Extraction

Current Implementation: Visual DOM with Canvas Overlay

Entry Point: `getDom(page)` (src/context-providers/dom/index.ts:5-18)

export const getDom = async (page: Page): Promise<DOMState | null> => {
  const result = await page.evaluate(buildDomViewJs) as DOMStateRaw;
  return {
    elements: Map<number, InteractiveElement>,
    domState: string,  // Text representation
    screenshot: string // Base64 PNG with overlays
  };
};

Build DOM View (src/context-providers/dom/build-dom-view.ts:54-130)

Process:

Find Interactive Elements (find-interactive-elements.ts:4-63)
- Traverse entire DOM including Shadow DOM and iframes
- Check each element with isInteractiveElem(element)
- Returns InteractiveElement[] with metadata
Render Highlights Offscreen (highlight.ts:105-222)
- Create OffscreenCanvas with viewport dimensions
- Draw colored rectangles around each interactive element
- Draw numbered labels (1, 2, 3...) on each element
- Return ImageBitmap

Composite Screenshot (agent.ts:33-42)

const compositeScreenshot = async (page: Page, overlay: string) => {
  const screenshot = await page.screenshot({ type: "png" });
  // Overlay numbered boxes onto base screenshot using Jimp
  baseImage.composite(overlayImage, 0, 0);
  return buffer.toString("base64");
};

Build Text Representation (build-dom-view.ts:78-123)

[1]<button id="submit" class="btn-primary">Submit Form</button>
[2]<input type="text" name="email" placeholder="Enter email">
Some text between elements
[3]<a href="/pricing">View Pricing</a>

Output Structure:

interface DOMState {
  elements: Map<number, InteractiveElement>  // index → element mapping
  domState: string                          // [idx]<tag>text</tag> format
  screenshot: string                        // base64 PNG with overlays
}

4. Action System

Available Actions (src/agent/actions/)

Action	Purpose	Key Parameters	Location
`clickElement`	Click an element	`index: number`	click-element.ts
`inputText`	Fill input field	`index: number, text: string`	input-text.ts
`extract`	Extract data	`objective: string`	extract.ts
`goToUrl`	Navigate to URL	`url: string`	go-to-url.ts
`selectOption`	Select dropdown	`index: number, option: string`	select-option.ts
`scroll`	Scroll page	`direction: "up"\|"down"`	scroll.ts
`keyPress`	Press keyboard key	`key: string`	key-press.ts
`complete`	End task	`output?: string`	complete.ts

Action Execution (src/agent/tools/agent.ts:71-103)

Click Element Example (click-element.ts:18-57)

run: async function (ctx: ActionContext, action: ClickElementActionType) {
  const { index } = action;
  const locator = getLocator(ctx, index);  // Get element by index

  await locator.scrollIntoViewIfNeeded({ timeout: 2500 });
  await locator.waitFor({ state: "visible", timeout: 2500 });
  await waitForElementToBeEnabled(locator, 2500);
  await waitForElementToBeStable(locator, 2500);

  await locator.click({ force: true });
  return { success: true, message: `Clicked element with index ${index}` };
}

Element Selection: (actions/utils.ts)

export const getLocator = (ctx: ActionContext, index: number): Locator | null => {
  const element = ctx.domState.elements.get(index);
  if (!element) return null;
  return ctx.page.locator(element.cssPath);  // Use CSS path selector
};

5. Key Workflows

Workflow 1: `page.ai("click the login button")`

User calls page.ai("click the login button")
→ agent.executeTask(task, params, page) (index.ts:569)
→ runAgentTask() starts task loop (agent.ts:105)
→ getDom(page) extracts DOM + screenshot (agent.ts:155)
- Injects JS to find interactive elements
- Draws numbered overlays
- Composites screenshot
→ buildAgentStepMessages() creates LLM prompt (agent.ts:201)
→ llm.invokeStructured() gets action plan (agent.ts:220)
→ Execute actions (agent.ts:253-275)
- LLM returns: { type: "clickElement", params: { index: 5 } }
- runAction() calls ClickElementActionDefinition.run()
- Gets locator for element 5
- Clicks element via Playwright
→ Repeat loop or mark complete

Workflow 2: `page.extract("product prices", schema)`

User calls page.extract("product prices", PriceSchema)
→ Wraps task: "You have to perform an extraction on the current page..." (index.ts:586-590)
→ Sets maxSteps: 2 (extractions are quick) (index.ts:581)
→ Adds outputSchema to actions (index.ts:584)
→ executeTask() runs normal agent loop
→ LLM returns structured output matching schema
→ Parse JSON and return typed result (index.ts:592)

Workflow 3: Extract Action (Internal)

The extract action is different from page.extract():

Location: src/agent/actions/extract.ts

run: async (ctx: ActionContext, action: ExtractActionType) => {
  // Get page HTML
  const content = await ctx.page.content();
  const markdown = await parseMarkdown(content);

  // Take screenshot via CDP
  const cdpSession = await ctx.page.context().newCDPSession(ctx.page);
  const screenshot = await cdpSession.send("Page.captureScreenshot");

  // Call LLM with markdown + screenshot
  const response = await ctx.llm.invoke([{
    role: "user",
    content: [
      { type: "text", text: `Extract: "${objective}"\n\n${markdown}` },
      { type: "image", url: `data:image/png;base64,${screenshot.data}` }
    ]
  }]);

  return { success: true, message: `Extracted: ${content}` };
}

This is an action the agent can choose during task execution, not the page-level method.

6. DOM State Representation

Current Approach: Visual DOM + Numbered Overlay

Strengths:

✅ Simple index-based selection (LLM just says "5")
✅ Visual feedback in screenshots
✅ Works well with vision models

Weaknesses:

❌ Screenshot required every step (slow)
❌ Screenshot → LLM → token cost is high
❌ Numbered overlay can occlude important UI
❌ Full DOM traversal every step (no caching)
❌ Large token counts (screenshot + DOM text)

Performance:

~8,000-15,000 tokens per step
~1,500-3,000ms per action
No caching mechanism

7. Element Discovery

Interactive Element Detection (src/context-providers/dom/elem-interactive.ts)

Current Rules:

isInteractiveElem(element: HTMLElement): { isInteractive: boolean, reason?: string }

Checks (in order):

Native interactive tags: button, a[href], input, select, textarea
ARIA roles: button, link, tab, checkbox, menuitem
Event listeners: data-has-interactive-listener="true" (injected)
Contenteditable elements
Elements with onclick attribute
Cursor style: cursor: pointer
Custom detection for common patterns

Ignored Elements:

Hidden elements (display: none, visibility: hidden)
Zero-dimension elements
Disabled elements
Script and style tags

8. Message Building

Prompt Construction (src/agent/messages/builder.ts)

Message Structure:

[
  { role: "system", content: SYSTEM_PROMPT },
  { role: "user", content: [
    { type: "text", text: "Task: click login button\n\nDOMState:\n[1]<button>..." },
    { type: "image", url: "data:image/png;base64,..." }
  ]},
  { role: "assistant", content: "..." },  // Previous step
  { role: "user", content: "..." },       // Previous action results
  // ... more history ...
  { role: "user", content: [             // Current step
    { type: "text", text: "Current DOM:\n..." },
    { type: "image", url: "..." }
  ]}
]

9. Variable System

Variable Management (src/agent/index.ts:174-202)

interface HyperVariable {
  key: string;
  value: string;
  description?: string;
}

// API
agent.addVariable({ key: "email", value: "user@example.com" })
agent.getVariable("email")
agent.deleteVariable("email")

Usage in Actions:

// In inputText action:
text = text.replace(`<<${variable.key}>>`, variable.value);
// Agent can use: inputText(5, "<<email>>") → "user@example.com"

10. Browser Provider Architecture

Supported Providers:

LocalBrowserProvider (default)
- Uses patchright (Playwright fork with anti-detection)
- Runs locally
HyperbrowserProvider
- Cloud-based browser service
- Remote CDP connection

Selection: (index.ts:85-94)

new HyperAgent({
  browserProvider: "Local" | "Hyperbrowser",
  localConfig: { ... },
  hyperbrowserConfig: { ... }
})

11. MCP Integration

Model Context Protocol Support (src/agent/mcp/)

Purpose: Connect external tools as custom actions

await agent.initializeMCPClient({
  servers: [{
    id: "filesystem",
    command: "npx",
    args: ["-y", "@modelcontextprotocol/server-filesystem"]
  }]
});

How it works:

MCP server exposes tools
Tools converted to AgentActionDefinition
Registered with agent
LLM can invoke MCP tools as actions

12. Debug Mode

Debug Output (src/agent/tools/agent.ts:112-148)

When debug: true:

debug/
  └── {taskId}/
      ├── step-0/
      │   ├── elems.txt              # DOM text representation
      │   ├── screenshot.png         # Composite screenshot
      │   ├── msgs.json              # LLM messages
      │   └── stepOutput.json        # Action results
      ├── step-1/
      └── taskOutput.json            # Final output

13. Summary: Current vs Desired State

What Works Well:

✅ Simple API (page.ai(), page.extract())
✅ Flexible action system
✅ Multi-step task execution
✅ MCP integration
✅ Variable substitution

Performance Bottlenecks:

❌ Screenshot required every step
❌ No DOM caching
❌ No action caching
❌ High token usage (8K-15K per step)
❌ Slow actions (1.5-3s each)

Accuracy Issues:

❌ Numbered overlay can be occluded
❌ Full DOM may miss semantic meaning
❌ No accessibility tree
❌ No self-healing on failure
❌ Single-attempt actions (no retry logic)

14. File Reference Map

Component	File Path	Key Lines
Main Agent Class	`src/agent/index.ts`	37-606
Task Execution Loop	`src/agent/tools/agent.ts`	105-306
DOM Extraction	`src/context-providers/dom/index.ts`	5-18
Build DOM View	`src/context-providers/dom/build-dom-view.ts`	54-130
Find Elements	`src/context-providers/dom/find-interactive-elements.ts`	4-63
Canvas Overlay	`src/context-providers/dom/highlight.ts`	105-222
Click Action	`src/agent/actions/click-element.ts`	18-57
Input Text Action	`src/agent/actions/input-text.ts`	16-37
Extract Action	`src/agent/actions/extract.ts`	16-104
System Prompt	`src/agent/messages/system-prompt.ts`	-
Message Builder	`src/agent/messages/builder.ts`	-

Next Steps: Performance & Accuracy Improvements

Based on Stagehand and Skyvern analysis, key opportunities:

Adopt Accessibility Tree (Stagehand approach)
- 3-4x token reduction
- Better semantic understanding
- No screenshot required for actions
Implement Caching (Stagehand approach)
- Action cache (instruction+URL → selector)
- LLM cache (prompt → response)
- 20-30x speed improvement for cached actions
Hybrid Visual Approach (Skyvern approach)
- DOM injection for element IDs (no overlay)
- Bounding boxes only when needed
- Keep visual feedback but reduce occlusion
Self-Healing (Stagehand approach)
- Re-observe on failure
- Multiple selector strategies
- Retry logic with different approaches

See improvement-plan.md for detailed implementation strategy.

FilesExpand file tree

currentState.md

Latest commit

History