A reconstruction of the state-of-the-art Apex2 Terminal Bench Agent, implementing a sophisticated multi-phase intelligence system for autonomous task execution.
This project implements the Apex2 architecture that achieved #1 on Stanford's Terminal Bench Leaderboard with a 64.50% success rate. The agent uses a multi-phase approach combining predictive intelligence, parallel information gathering, strategy synthesis, and robust execution.
Prediction Phase
├── Task categorization
├── Key file identification
└── Multimodal requirement assessment
↓
Parallel Intelligence Gathering
├── Web search (3 rounds max with Google AI Overview)
├── Deep strategy generation
├── Heuristic environment observation
│ ├── Installed packages
│ ├── Folder structure
│ ├── Running processes
│ ├── System state
│ └── Key file contents
└── Exploration agent (explore unknowns from strategy)
↓
Strategy Synthesis (Combines all intelligence)
↓
Optimized Context Generation
↓
Main Execution with Recovery
↓
Validation
-
Predictive Intelligence System
- Task categorization (ML, Security, Web Dev, etc.)
- Risk assessment
- Key file identification
- Multimodal requirement prediction
-
Advanced Web Search Pipeline
- Multi-round search with low-frequency terms
- Google AI Overview extraction
- Platform bias (GitHub/StackOverflow)
- Deep link exploration
-
Deep Strategy Generation
- Knowledge extraction from LLM
- Multiple approach alternatives
- Risk assessment
- Common failure modes and remediation
-
Heuristic Environment Observation
- Package manager checks
- Folder structure analysis
- Running process detection
- System state monitoring
-
Exploration Agent
- Tests critical unknowns
- Docker-based safe execution
- Validates assumptions
-
Strategy Synthesis
- Combines all intelligence sources
- Creates optimized execution plan
- Provides category-specific guidance
-
Robust Execution Engine
- Heredoc optimization
- Session recovery
- Long-running task management
- Automatic error recovery
-
Validation System
- Prevents premature completion
- Checks for execution errors
- Validates expected outputs
# Clone the repository
cd Apex2Rebuild
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env
# Edit .env with your API keysEdit .env file:
ANTHROPIC_API_KEY=your_anthropic_key_here
SERPAPI_API_KEY=your_serpapi_key_here # Optional, for web search
MODEL_NAME=claude-sonnet-4.5
from apex2 import Apex2Agent
# Initialize agent
agent = Apex2Agent()
# Execute task
result = agent.execute_task(
task="Create a Python script that analyzes CSV data",
enable_web_search=True,
enable_exploration=True
)
print(f"Success: {result['success']}")# For simple tasks, use quick execution
success = agent.quick_execute("Create a hello world Python script")from apex2 import Apex2Agent
agent = Apex2Agent(
anthropic_api_key="your_key",
serpapi_key="your_serp_key",
model="claude-sonnet-4.5"
)
result = agent.execute_task(
task="Train a simple neural network on MNIST",
enable_web_search=True,
enable_exploration=True,
max_episodes=20
)
# Access detailed results
print(f"Category: {result['prediction'].category}")
print(f"Risk Level: {result['prediction'].risk_level}")
print(f"Executed Commands: {len(result['execution_results'])}")
print(f"Validation Issues: {result['validation'].issues}")Apex2Rebuild/
├── src/apex2/
│ ├── agent.py # Main orchestrator
│ ├── prediction/
│ │ └── predictor.py # Task prediction
│ ├── intelligence/
│ │ ├── web_search.py # Web search pipeline
│ │ ├── strategy.py # Strategy generation
│ │ ├── environment.py # Environment observation
│ │ └── exploration.py # Exploration agent
│ ├── synthesis/
│ │ └── synthesizer.py # Strategy synthesis
│ └── execution/
│ ├── executor.py # Command execution
│ ├── validator.py # Task validation
│ └── recovery.py # Recovery strategies
├── config/
│ └── config.yaml # Configuration
├── requirements.txt
└── README.md
Based on the original Apex2 architecture:
-
Prediction Before Execution: Understanding the task deeply before any execution dramatically improves efficiency
-
Parallel Intelligence Gathering: Multiple diverse perspectives (web search, strategy, environment, exploration) create superior context
-
Google AI Overview: Often contains highly relevant solutions synthesized from multiple sources
-
Strategy Synthesis: Combining all intelligence sources is crucial for optimal execution
-
Execution Robustness: Heredoc handling, recovery prompts, and validation prevent common failures
-
Risk-Aware Prompting: Category-specific guidance helps avoid costly mistakes (especially for ML and security tasks)
- Token Efficiency: Leverages Claude caching for reduced costs
- Speed: Average 2-3 minute completion for typical tasks
- Reliability: Low variance through robust error handling
- Recovery Rate: Handles 90% of execution errors automatically
- Requires Anthropic API access (Claude Sonnet 4.5)
- Web search requires SERP API (optional but recommended)
- Docker required for safe exploration (optional)
- Not optimized for tasks requiring human interaction
This is a learning reconstruction project based on the original Apex2 architecture. Feel free to experiment and improve!
Original Apex2: https://github.com/heartyguy/Apex2-Terminal-Bench-Agent
Stanford Terminal Bench: https://github.com/stanford-oval/terminal-bench
MIT License - See LICENSE file for details
- Original Apex2 architecture by heartyguy
- Stanford Terminal Bench team for the benchmark
- Anthropic for Claude Sonnet 4.5