Skip to content

metaforismo/Apex2

Repository files navigation

Apex2 Terminal Bench Agent - Rebuild

A reconstruction of the state-of-the-art Apex2 Terminal Bench Agent, implementing a sophisticated multi-phase intelligence system for autonomous task execution.

Overview

This project implements the Apex2 architecture that achieved #1 on Stanford's Terminal Bench Leaderboard with a 64.50% success rate. The agent uses a multi-phase approach combining predictive intelligence, parallel information gathering, strategy synthesis, and robust execution.

Architecture

Core Innovation: Multi-Phase Intelligence System

Prediction Phase
├── Task categorization
├── Key file identification  
└── Multimodal requirement assessment
    ↓
Parallel Intelligence Gathering
├── Web search (3 rounds max with Google AI Overview)
├── Deep strategy generation 
├── Heuristic environment observation
│   ├── Installed packages
│   ├── Folder structure
│   ├── Running processes
│   ├── System state
│   └── Key file contents
└── Exploration agent (explore unknowns from strategy)
    ↓
Strategy Synthesis (Combines all intelligence)
    ↓
Optimized Context Generation
    ↓
Main Execution with Recovery
    ↓
Validation

Key Features

  1. Predictive Intelligence System

    • Task categorization (ML, Security, Web Dev, etc.)
    • Risk assessment
    • Key file identification
    • Multimodal requirement prediction
  2. Advanced Web Search Pipeline

    • Multi-round search with low-frequency terms
    • Google AI Overview extraction
    • Platform bias (GitHub/StackOverflow)
    • Deep link exploration
  3. Deep Strategy Generation

    • Knowledge extraction from LLM
    • Multiple approach alternatives
    • Risk assessment
    • Common failure modes and remediation
  4. Heuristic Environment Observation

    • Package manager checks
    • Folder structure analysis
    • Running process detection
    • System state monitoring
  5. Exploration Agent

    • Tests critical unknowns
    • Docker-based safe execution
    • Validates assumptions
  6. Strategy Synthesis

    • Combines all intelligence sources
    • Creates optimized execution plan
    • Provides category-specific guidance
  7. Robust Execution Engine

    • Heredoc optimization
    • Session recovery
    • Long-running task management
    • Automatic error recovery
  8. Validation System

    • Prevents premature completion
    • Checks for execution errors
    • Validates expected outputs

Installation

# Clone the repository
cd Apex2Rebuild

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env with your API keys

Configuration

Edit .env file:

ANTHROPIC_API_KEY=your_anthropic_key_here
SERPAPI_API_KEY=your_serpapi_key_here  # Optional, for web search
MODEL_NAME=claude-sonnet-4.5

Usage

Basic Usage

from apex2 import Apex2Agent

# Initialize agent
agent = Apex2Agent()

# Execute task
result = agent.execute_task(
    task="Create a Python script that analyzes CSV data",
    enable_web_search=True,
    enable_exploration=True
)

print(f"Success: {result['success']}")

Quick Execution (Simple Tasks)

# For simple tasks, use quick execution
success = agent.quick_execute("Create a hello world Python script")

Advanced Usage

from apex2 import Apex2Agent

agent = Apex2Agent(
    anthropic_api_key="your_key",
    serpapi_key="your_serp_key",
    model="claude-sonnet-4.5"
)

result = agent.execute_task(
    task="Train a simple neural network on MNIST",
    enable_web_search=True,
    enable_exploration=True,
    max_episodes=20
)

# Access detailed results
print(f"Category: {result['prediction'].category}")
print(f"Risk Level: {result['prediction'].risk_level}")
print(f"Executed Commands: {len(result['execution_results'])}")
print(f"Validation Issues: {result['validation'].issues}")

Project Structure

Apex2Rebuild/
├── src/apex2/
│   ├── agent.py                    # Main orchestrator
│   ├── prediction/
│   │   └── predictor.py            # Task prediction
│   ├── intelligence/
│   │   ├── web_search.py           # Web search pipeline
│   │   ├── strategy.py             # Strategy generation
│   │   ├── environment.py          # Environment observation
│   │   └── exploration.py          # Exploration agent
│   ├── synthesis/
│   │   └── synthesizer.py          # Strategy synthesis
│   └── execution/
│       ├── executor.py             # Command execution
│       ├── validator.py            # Task validation
│       └── recovery.py             # Recovery strategies
├── config/
│   └── config.yaml                 # Configuration
├── requirements.txt
└── README.md

Key Insights

Based on the original Apex2 architecture:

  1. Prediction Before Execution: Understanding the task deeply before any execution dramatically improves efficiency

  2. Parallel Intelligence Gathering: Multiple diverse perspectives (web search, strategy, environment, exploration) create superior context

  3. Google AI Overview: Often contains highly relevant solutions synthesized from multiple sources

  4. Strategy Synthesis: Combining all intelligence sources is crucial for optimal execution

  5. Execution Robustness: Heredoc handling, recovery prompts, and validation prevent common failures

  6. Risk-Aware Prompting: Category-specific guidance helps avoid costly mistakes (especially for ML and security tasks)

Performance Characteristics

  • Token Efficiency: Leverages Claude caching for reduced costs
  • Speed: Average 2-3 minute completion for typical tasks
  • Reliability: Low variance through robust error handling
  • Recovery Rate: Handles 90% of execution errors automatically

Limitations

  • Requires Anthropic API access (Claude Sonnet 4.5)
  • Web search requires SERP API (optional but recommended)
  • Docker required for safe exploration (optional)
  • Not optimized for tasks requiring human interaction

Contributing

This is a learning reconstruction project based on the original Apex2 architecture. Feel free to experiment and improve!

References

Original Apex2: https://github.com/heartyguy/Apex2-Terminal-Bench-Agent

Stanford Terminal Bench: https://github.com/stanford-oval/terminal-bench

License

MIT License - See LICENSE file for details

Acknowledgments

  • Original Apex2 architecture by heartyguy
  • Stanford Terminal Bench team for the benchmark
  • Anthropic for Claude Sonnet 4.5

About

Rebuild Apex2 Agent that is the SOTA Agent in TerminalBench

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages