Open
Conversation
Implement the easi Python library for orchestrating simulator-based embodied reasoning evaluation. Includes subprocess isolation via filesystem IPC, versioned simulator management (conda+uv), agent interface, task/benchmark framework, and CLI. Components: core abstractions, dummy/AI2-THOR simulators, dummy task, dummy agent, LLM client + dummy server, full test suite (44 tests).
Add embodied agent evaluation pipeline with real simulator support: - ReAct agent with multi-action buffering and PromptBuilder protocol - EB-Alfred task support (6 splits via multi-split YAML discovery) - AI2-THOR v2.1.0 bridge with skill-based actions and state tracking - EvaluationRunner with structured output: <output_dir>/<task>/<run_id>/ - Per-episode artifacts: result.json, trajectory.jsonl, rgb_*.png - Centralized logging (print -> logger), --verbosity CLI option - Subprocess observability: bridge output streaming, Ctrl+C cleanup - LLM API client (OpenAI-compatible) and dummy LLM server - 106 tests passing
Split the monolithic bridge into a generic AI2ThorBridge base class (simulator layer) and an EBAlfredBridge subclass (task layer), so that future benchmarks using ai2thor==2.1.0 can reuse the simulator bridge without rewriting controller management, IPC, or navigation helpers. - Extract EB-Alfred goal evaluation and task loading into easi/tasks/ebalfred/thor_utils.py - Trim easi/simulators/ai2thor/v2_1_0/thor_utils.py to generic-only constants and object query utilities - Refactor bridge.py from EBAlfredBridge (1062 lines) to generic AI2ThorBridge (~314 lines) with configurable simulator_kwargs - Create easi/tasks/ebalfred/bridge.py with EBAlfredBridge subclass containing all skill execution, state tracking, and goal evaluation - Add get_bridge_script_path() and simulator_kwargs to BaseTask and TaskProtocol; override in EBAlfredTask - Update EvaluationRunner to prefer task-specific bridge paths and forward simulator_kwargs - Add simulator_kwargs to all EB-Alfred YAML configs - Add 29 tests covering imports, inheritance, method separation, bridge path resolution, and simulator_kwargs
Unified client with generate() and generate_structured() methods, lazy imports, and cumulative usage tracking (tokens + cost).
Manages start/stop, port checking, health polling with timeout, and context manager support. Extensible for future backends.
New arguments for `easi run` to select LLM backend and configure inference server. Backward compatible with existing --llm-url.
Runner now resolves backend, auto-starts vLLM when needed, creates LLMClient for non-legacy backends, wraps structured output, and tracks LLM usage per-episode and per-run.
- Track usage in generate_structured() via instructor's _raw_response - Fix log file handle leak in ServerManager (store and close in stop()) - Remove duplicate agent_config computation in runner._create_agent()
…port Replace the tightly-coupled agent/prompt design with a memory-based architecture where AgentMemory holds shared state, PromptBuilder reads from memory to construct prompts and parse responses, and the agent is a thin orchestrator. Key changes: - AgentMemory + StepRecord dataclasses as shared agent state - New PromptBuilderProtocol: build_messages(memory) + parse_response(response, memory) - Simplified BaseAgent (removed _chat_history, abstract stubs, default act()) - ReActAgent rewritten as thin orchestrator delegating to builder - EBAlfredPromptBuilder gains chat_history=True mode with VLMPlanner parity - json_repair moved to easi/utils/ (old location re-exports) - Removed stateless flag from agent config (builder controls mode)
Builder-owned schema enforcement: prompt builders can now optionally implement get_response_format() to provide a JSON schema dict that gets passed through to litellm. ReActAgent handles fallback automatically when the backend doesn't support response_format. - LLMClient.generate() accepts optional response_format param - ReActAgent._generate_with_fallback() tries schema, caches on failure - EBAlfredPromptBuilder.get_response_format() returns vlm_generation_guide - Remove dead code: instructor dep, Pydantic schemas, monkey-patching
Add BaseTask.on_episode_reset() hook for task-specific post-reset setup. EBAlfredTask overrides it to update agent action space from bridge metadata, removing EB-Alfred-specific logic from the general EvaluationRunner.
…onfig - trajectory.jsonl: add llm_response field to each step entry - result.json: add instruction field for each episode - config.json: include all CLI options and full task YAML config
Retry: LLMClient passes num_retries to litellm.completion() for automatic exponential backoff on transient errors (timeouts, rate limits). Configurable via --max-retries (default 3). Resume: --resume <run_dir> loads config.json from a previous run, skips completed episodes, clears and re-runs the last episode (which may have been interrupted), then continues the remaining episodes. All CLI options are restored from config.json so only --resume is needed.
…ps bug Fix max_steps mismatch where YAML configured 50 but vendor EBAlfEnv hardcoded 30. Now max_steps flows from YAML through simulator_kwargs to the bridge and vendor env. Add per-episode retry in EvaluationRunner: on crash (e.g. AI2-THOR Unity segfault), the episode dir is cleared, the simulator is re-launched, and the episode is retried up to max_retries times. If all retries are exhausted the episode is recorded as failed and the runner continues to the next episode.
Integrate EmbodiedBench EB-Navigation into EASI with vendored env, task bridge, prompt builder, and 5 split configs (ai2thor v5.0.0).
Remove action_space field from all YAML configs and TaskEntry. Tasks now define their action space via _build_action_space() override with caching, eliminating the confusing pattern of empty YAML fields.
…atform Replace stub bridge with working AI2ThorV5Bridge class that starts a real controller, handles scene reset and discrete navigation actions. Switch platform from CloudRendering to Linux64. Increase sim test timeout default from 30s to 200s for THOR startup.
Fix OUTPUT_TEMPLATE trailing spaces on 3 lines and regenerate navigation_examples.json from source to fix line continuation artifact and curly quote mismatch. Verified character-level parity.
- Add habitat_sim:v0_3_0 simulator registration (conda env + manifest) - Vendor EBHabEnv from EmbodiedBench with fixed imports - Add EBHabitatTask with dynamic action space via on_episode_reset hook - Add EBHabitatPromptBuilder matching VLMPlanner prompt construction - Add 6 per-split YAML configs (base, common_sense, complex_instruction, spatial_relationship, visual_appearance, long_horizon) - Add 26 offline tests for actions, task, prompts, and registry - Move EB-Habitat-specific deps (gym, hydra-core, omegaconf, imageio, habitat-lab) from simulator requirements.txt to task YAML additional_deps
Add --redownload to 'easi task download' and 'easi run' to force re-download of cached HuggingFace datasets. Useful when a previous download was interrupted or incomplete.
Track whether each Xorg was launched with sudo and use `sudo -n kill` for those processes, preventing orphaned root-owned Xorg servers.
Start/stop Xorg servers in run(), override render platform per worker when xorg is active, warn on GPU contention with local LLM backends. Fix tests that use __new__ to include _xorg_instances attribute.
Added to: ai2thor (v2.1, v3.3.5, v5.0), habitat_sim (v0.1.7, v0.3), coppeliasim (v4.1), tdw (v1.11.23), omnigibson (v3.7.2).
…rgManager RenderPlatform gains setup()/teardown()/for_worker() hooks so platforms that manage external services (like Xorg) are self-contained. No more if/else xorg handling in callers. - XorgPlatform.setup() starts XorgManager, for_worker() returns a per-worker _XorgWorkerPlatform with fixed display/GPU - Runners call _setup_render_platform() once, then teardown() in finally - cli sim_test calls setup()/for_worker()/teardown() directly - Removed _xorg_instances attribute and scattered xorg_mgr variables
- Replace asyncio.get_event_loop() with get_running_loop() (deprecated in 3.10, error in 3.12) - Use project get_logger() convention in progress.py instead of logging.getLogger() - Move litellm imports to local scope in test_react_agent.py to avoid import errors
Adopt WorkerBinding plus SimulatorRenderAdapter across registry, runners, subprocess launch, and simulator integrations so render backends own resource assignment while simulators contribute render-specific quirks through one adapter path.
Keep the conda-backed smoke-test path aligned with the new WorkerBinding handoff so render adapters and per-worker launch data reach SubprocessRunner consistently.
Add resolved_name property to RenderPlatform so auto-detection shows the actual backend (e.g. "native (via auto-detection)") instead of just "auto".
- cmd_sim_test: move setup()/for_worker() inside the try block that owns finally:teardown() so Xorg servers are always cleaned up - _create_simulator: call setup() on per-simulator resolved platforms and register for teardown via self._render_platform
Integrate REVERIE-CE as a new task reusing VLN-CE R2R infrastructure. Same simulator (habitat_sim:v0_1_7), action space, and metrics. New prompt builder adapted for REVERIE's high-level instruction style. Dataset: oscarqjh/REVERIE-CE_easi (repackaged from Dynam3D)
Adds easi/analysis/ package with trajectory_video.py that generates per-episode MP4 videos showing robot paths on top-down maps alongside camera views — pure post-processing, no simulator dependencies.
…ridge Add trajectory visualization hooks to the EB-Habitat bridge using habitat-sim 0.3.0 API (articulated_agent.base_pos). Includes topdown map rendering via pathfinder, start position persistence, and per-step agent position tracking in trajectory info. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds the entire easi-er cli — a unified evaluation framework for embodied AI agents. It introduces subprocess-isolated simulators, multi-split task definitions, LLM-powered agents, and a CLI for running evaluations across multiple benchmarks.
Core Framework
Simulators (6)
Benchmarks (10)
LLM Infrastructure
Standardized Prompt Format
CLI
easi task list / info / download / scaffold
easi env list / install / check
easi sim test
easi start --agent react --backend openai --model gpt-4o
easi start --resume ./logs//<run_id>
Testing