[Feat] EASI-ER CLI by oscarqjh · Pull Request #31 · EvolvingLMMs-Lab/EASI

oscarqjh · 2026-03-09T03:39:07Z

Summary

This PR adds the entire easi-er cli — a unified evaluation framework for embodied AI agents. It introduces subprocess-isolated simulators, multi-split task definitions, LLM-powered agents, and a CLI for running evaluations across multiple benchmarks.

Core Framework

Subprocess isolation: Each simulator runs in its own conda environment (potentially different Python version), communicating via filesystem IPC
Multi-split tasks: YAML-based task configs with template inheritance (extends), auto-discovered by registry
Pluggable agents: Dummy agent for testing, ReAct agent with multi-action buffering for real evaluation
AgentMemory + PromptBuilder: Shared state architecture with task-specific prompt builders following a standardized format
Parallel evaluation: Thread-pool parallelism with multi-instance vLLM support (--num-parallel, --vllm-instances)
Resume support: Interrupted runs can be resumed from logs/ output directory

Simulators (6)

AI2-THOR v2.1.0 (Python 3.8) — EB-Alfred
AI2-THOR v5.0.0 (Python 3.10) — EB-Navigation, AI2-THOR Rearrangement
AI2-THOR v3.3.5 (Python 3.10) — ManipulaTHOR
Habitat-Sim v0.1.7 (Python 3.8) — VLN-CE R2R, VLN-CE RxR
Habitat-Sim v0.3.0 (Python 3.9) — EB-Habitat, LHPR-VLN
CoppeliaSim v4.1.0 (Python 3.10) — EB-Manipulation
TDW v1.11.23 (Python 3.10) — HAZARD

Benchmarks (10)

EmbodiedBench: EB-Alfred (6 splits), EB-Navigation (5 splits), EB-Habitat (4 splits), EB-Manipulation (4 splits)
HAZARD: Fire, Flood, Wind scenarios
ManipulaTHOR: Arm point navigation (seen/unseen)
AI2-THOR Rearrangement 2023: 5 evaluation splits
VLN-CE R2R: Vision-and-language navigation (val seen/unseen)
VLN-CE RxR: Multilingual VLN (val seen/unseen, en/hi/te)
LHPR-VLN: Multi-subtask navigation (val/test splits)

LLM Infrastructure

LLMClient: LiteLLM wrapper supporting any backend (OpenAI, Anthropic, Gemini, vLLM)
ServerManager / MultiServerManager: vLLM subprocess lifecycle with GPU allocation, tensor parallelism, and port auto-probing
GPU isolation: --vllm-gpus and --sim-gpus for separating LLM inference from simulator rendering

Standardized Prompt Format

EASI Standard Prompt Format Reference (docs/easi-prompt-format-reference.md)
All non-EmbodiedBench prompt builders aligned: standard section headers, 4-field JSON response format, consistent action history format
EmbodiedBench benchmarks retain their original published formats

CLI

easi task list / info / download / scaffold
easi env list / install / check
easi sim test
easi start --agent react --backend openai --model gpt-4o
easi start --resume ./logs//<run_id>

Testing

824 tests, all passing
All tests run offline without simulators or LLMs (mocked subprocess bridges)

Implement the easi Python library for orchestrating simulator-based embodied reasoning evaluation. Includes subprocess isolation via filesystem IPC, versioned simulator management (conda+uv), agent interface, task/benchmark framework, and CLI. Components: core abstractions, dummy/AI2-THOR simulators, dummy task, dummy agent, LLM client + dummy server, full test suite (44 tests).

Add embodied agent evaluation pipeline with real simulator support: - ReAct agent with multi-action buffering and PromptBuilder protocol - EB-Alfred task support (6 splits via multi-split YAML discovery) - AI2-THOR v2.1.0 bridge with skill-based actions and state tracking - EvaluationRunner with structured output: <output_dir>/<task>/<run_id>/ - Per-episode artifacts: result.json, trajectory.jsonl, rgb_*.png - Centralized logging (print -> logger), --verbosity CLI option - Subprocess observability: bridge output streaming, Ctrl+C cleanup - LLM API client (OpenAI-compatible) and dummy LLM server - 106 tests passing

Split the monolithic bridge into a generic AI2ThorBridge base class (simulator layer) and an EBAlfredBridge subclass (task layer), so that future benchmarks using ai2thor==2.1.0 can reuse the simulator bridge without rewriting controller management, IPC, or navigation helpers. - Extract EB-Alfred goal evaluation and task loading into easi/tasks/ebalfred/thor_utils.py - Trim easi/simulators/ai2thor/v2_1_0/thor_utils.py to generic-only constants and object query utilities - Refactor bridge.py from EBAlfredBridge (1062 lines) to generic AI2ThorBridge (~314 lines) with configurable simulator_kwargs - Create easi/tasks/ebalfred/bridge.py with EBAlfredBridge subclass containing all skill execution, state tracking, and goal evaluation - Add get_bridge_script_path() and simulator_kwargs to BaseTask and TaskProtocol; override in EBAlfredTask - Update EvaluationRunner to prefer task-specific bridge paths and forward simulator_kwargs - Add simulator_kwargs to all EB-Alfred YAML configs - Add 29 tests covering imports, inheritance, method separation, bridge path resolution, and simulator_kwargs

Unified client with generate() and generate_structured() methods, lazy imports, and cumulative usage tracking (tokens + cost).

Manages start/stop, port checking, health polling with timeout, and context manager support. Extensible for future backends.

New arguments for `easi run` to select LLM backend and configure inference server. Backward compatible with existing --llm-url.

Runner now resolves backend, auto-starts vLLM when needed, creates LLMClient for non-legacy backends, wraps structured output, and tracks LLM usage per-episode and per-run.

- Track usage in generate_structured() via instructor's _raw_response - Fix log file handle leak in ServerManager (store and close in stop()) - Remove duplicate agent_config computation in runner._create_agent()

…port Replace the tightly-coupled agent/prompt design with a memory-based architecture where AgentMemory holds shared state, PromptBuilder reads from memory to construct prompts and parse responses, and the agent is a thin orchestrator. Key changes: - AgentMemory + StepRecord dataclasses as shared agent state - New PromptBuilderProtocol: build_messages(memory) + parse_response(response, memory) - Simplified BaseAgent (removed _chat_history, abstract stubs, default act()) - ReActAgent rewritten as thin orchestrator delegating to builder - EBAlfredPromptBuilder gains chat_history=True mode with VLMPlanner parity - json_repair moved to easi/utils/ (old location re-exports) - Removed stateless flag from agent config (builder controls mode)

Builder-owned schema enforcement: prompt builders can now optionally implement get_response_format() to provide a JSON schema dict that gets passed through to litellm. ReActAgent handles fallback automatically when the backend doesn't support response_format. - LLMClient.generate() accepts optional response_format param - ReActAgent._generate_with_fallback() tries schema, caches on failure - EBAlfredPromptBuilder.get_response_format() returns vlm_generation_guide - Remove dead code: instructor dep, Pydantic schemas, monkey-patching

Add BaseTask.on_episode_reset() hook for task-specific post-reset setup. EBAlfredTask overrides it to update agent action space from bridge metadata, removing EB-Alfred-specific logic from the general EvaluationRunner.

…onfig - trajectory.jsonl: add llm_response field to each step entry - result.json: add instruction field for each episode - config.json: include all CLI options and full task YAML config

Retry: LLMClient passes num_retries to litellm.completion() for automatic exponential backoff on transient errors (timeouts, rate limits). Configurable via --max-retries (default 3). Resume: --resume <run_dir> loads config.json from a previous run, skips completed episodes, clears and re-runs the last episode (which may have been interrupted), then continues the remaining episodes. All CLI options are restored from config.json so only --resume is needed.

…ps bug Fix max_steps mismatch where YAML configured 50 but vendor EBAlfEnv hardcoded 30. Now max_steps flows from YAML through simulator_kwargs to the bridge and vendor env. Add per-episode retry in EvaluationRunner: on crash (e.g. AI2-THOR Unity segfault), the episode dir is cleared, the simulator is re-launched, and the episode is retried up to max_retries times. If all retries are exhausted the episode is recorded as failed and the runner continues to the next episode.

Integrate EmbodiedBench EB-Navigation into EASI with vendored env, task bridge, prompt builder, and 5 split configs (ai2thor v5.0.0).

Remove action_space field from all YAML configs and TaskEntry. Tasks now define their action space via _build_action_space() override with caching, eliminating the confusing pattern of empty YAML fields.

…atform Replace stub bridge with working AI2ThorV5Bridge class that starts a real controller, handles scene reset and discrete navigation actions. Switch platform from CloudRendering to Linux64. Increase sim test timeout default from 30s to 200s for THOR startup.

Fix OUTPUT_TEMPLATE trailing spaces on 3 lines and regenerate navigation_examples.json from source to fix line continuation artifact and curly quote mismatch. Verified character-level parity.

- Add habitat_sim:v0_3_0 simulator registration (conda env + manifest) - Vendor EBHabEnv from EmbodiedBench with fixed imports - Add EBHabitatTask with dynamic action space via on_episode_reset hook - Add EBHabitatPromptBuilder matching VLMPlanner prompt construction - Add 6 per-split YAML configs (base, common_sense, complex_instruction, spatial_relationship, visual_appearance, long_horizon) - Add 26 offline tests for actions, task, prompts, and registry - Move EB-Habitat-specific deps (gym, hydra-core, omegaconf, imageio, habitat-lab) from simulator requirements.txt to task YAML additional_deps

Add --redownload to 'easi task download' and 'easi run' to force re-download of cached HuggingFace datasets. Useful when a previous download was interrupted or incomplete.

Track whether each Xorg was launched with sudo and use `sudo -n kill` for those processes, preventing orphaned root-owned Xorg servers.

Start/stop Xorg servers in run(), override render platform per worker when xorg is active, warn on GPU contention with local LLM backends. Fix tests that use __new__ to include _xorg_instances attribute.

Added to: ai2thor (v2.1, v3.3.5, v5.0), habitat_sim (v0.1.7, v0.3), coppeliasim (v4.1), tdw (v1.11.23), omnigibson (v3.7.2).

…rgManager RenderPlatform gains setup()/teardown()/for_worker() hooks so platforms that manage external services (like Xorg) are self-contained. No more if/else xorg handling in callers. - XorgPlatform.setup() starts XorgManager, for_worker() returns a per-worker _XorgWorkerPlatform with fixed display/GPU - Runners call _setup_render_platform() once, then teardown() in finally - cli sim_test calls setup()/for_worker()/teardown() directly - Removed _xorg_instances attribute and scattered xorg_mgr variables

- Replace asyncio.get_event_loop() with get_running_loop() (deprecated in 3.10, error in 3.12) - Use project get_logger() convention in progress.py instead of logging.getLogger() - Move litellm imports to local scope in test_react_agent.py to avoid import errors

Adopt WorkerBinding plus SimulatorRenderAdapter across registry, runners, subprocess launch, and simulator integrations so render backends own resource assignment while simulators contribute render-specific quirks through one adapter path.

Keep the conda-backed smoke-test path aligned with the new WorkerBinding handoff so render adapters and per-worker launch data reach SubprocessRunner consistently.

Add resolved_name property to RenderPlatform so auto-detection shows the actual backend (e.g. "native (via auto-detection)") instead of just "auto".

- cmd_sim_test: move setup()/for_worker() inside the try block that owns finally:teardown() so Xorg servers are always cleaned up - _create_simulator: call setup() on per-simulator resolved platforms and register for teardown via self._render_platform

Integrate REVERIE-CE as a new task reusing VLN-CE R2R infrastructure. Same simulator (habitat_sim:v0_1_7), action space, and metrics. New prompt builder adapted for REVERIE's high-level instruction style. Dataset: oscarqjh/REVERIE-CE_easi (repackaged from Dynam3D)

Adds easi/analysis/ package with trajectory_video.py that generates per-episode MP4 videos showing robot paths on top-down maps alongside camera views — pure post-processing, no simulator dependencies.

…ridge Add trajectory visualization hooks to the EB-Habitat bridge using habitat-sim 0.3.0 API (articulated_agent.base_pos). Includes topdown map rendering via pathfinder, start position persistence, and per-step agent position tracking in trajectory info. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…v level)

oscarqjh added 30 commits February 11, 2026 16:15

[Fix] Updated dependencies for ebalfred

1ad66a0

[Feat] Added EBAlfred, working on dummy agent

52d2f30

[Feat] Added auto env install

347a0ca

[Feat] Add LLM response schemas with BaseResponseSchema

65ec22b

[Feat] Add LLM kwargs parser, backend validation, model string builder

9c8dd29

[Feat] Add LLMClient wrapping LiteLLM + Instructor

e1a85ba

Unified client with generate() and generate_structured() methods, lazy imports, and cumulative usage tracking (tokens + cost).

[Feat] Add ServerManager for vLLM subprocess lifecycle

ab27e7e

Manages start/stop, port checking, health polling with timeout, and context manager support. Extensible for future backends.

[Feat] Add --backend, --model, --port, --llm-kwargs CLI args

3e7010c

New arguments for `easi run` to select LLM backend and configure inference server. Backward compatible with existing --llm-url.

[Feat] Integrate LLMClient + ServerManager into EvaluationRunner

7856803

Runner now resolves backend, auto-starts vLLM when needed, creates LLMClient for non-legacy backends, wraps structured output, and tracks LLM usage per-episode and per-run.

[Feat] Add [llm] optional dependency group (litellm + instructor)

7231f0b

[Test] Add import smoke tests and full LLM pipeline integration test

e7f73c1

[Fix] Address code review findings for LLM inference pipeline

cb972eb

- Track usage in generate_structured() via instructor's _raw_response - Fix log file handle leak in ServerManager (store and close in stop()) - Remove duplicate agent_config computation in runner._create_agent()

[Fix] Upated task yaml for ebalfred

33aa620

[Feat] Enrich run outputs with LLM responses, instruction, and full c…

4483419

…onfig - trajectory.jsonl: add llm_response field to each step entry - result.json: add instruction field for each episode - config.json: include all CLI options and full task YAML config

[Fix] Default --agent to react instead of dummy

912cfb0

[Feat] Add EB-Navigation benchmark integration

02ed1e2

Integrate EmbodiedBench EB-Navigation into EASI with vendored env, task bridge, prompt builder, and 5 split configs (ai2thor v5.0.0).

[Refactor] Replace YAML action_space with BaseTask._build_action_space()

9c8d7ee

Remove action_space field from all YAML configs and TaskEntry. Tasks now define their action space via _build_action_space() override with caching, eliminating the confusing pattern of empty YAML fields.

[Fix] Align EB-Navigation prompts with EmbodiedBench source exactly

ca6cfff

Fix OUTPUT_TEMPLATE trailing spaces on 3 lines and regenerate navigation_examples.json from source to fix line continuation artifact and curly quote mismatch. Verified character-level parity.

[Fix] Updated ebnavigation prompt

c7c2d80

[Feat] Add --redownload flag for dataset re-download

b9d339c

Add --redownload to 'easi task download' and 'easi run' to force re-download of cached HuggingFace datasets. Useful when a previous download was interrupted or incomplete.

oscarqjh and others added 30 commits March 10, 2026 14:54

feat: add XorgPlatform render platform

972e46e

feat: add XorgManager for Xorg server lifecycle

0bf9c84

fix: use sudo kill for sudo-launched Xorg processes in stop()

5e9ffe3

Track whether each Xorg was launched with sudo and use `sudo -n kill` for those processes, preventing orphaned root-owned Xorg servers.

feat: add per-worker GPU pinning via round-robin

c6486b1

feat: integrate XorgManager lifecycle into evaluation runners

243eac2

Start/stop Xorg servers in run(), override render platform per worker when xorg is active, warn on GPU contention with local LLM backends. Fix tests that use __new__ to include _xorg_instances attribute.

docs: add xorg to render platform CLI options and reference

7c1068a

feat: register xorg as supported render platform for X11 simulators

a99a810

Added to: ai2thor (v2.1, v3.3.5, v5.0), habitat_sim (v0.1.7, v0.3), coppeliasim (v4.1), tdw (v1.11.23), omnigibson (v3.7.2).

refactor: extract render_platform module into render_platforms package

2b7011a

fix: initialize _render_platform in ParallelRunner

a2acb53

feat: add CoppeliaSimXorgPlatform for GPU-accelerated rendering

537c5df

feat: add xorg render platform support to sim test CLI

869b009

fix: make sim test use binding-aware render launch

095fcff

Keep the conda-backed smoke-test path aligned with the new WorkerBinding handoff so render adapters and per-worker launch data reach SubprocessRunner consistently.

feat: log resolved render platform for sim test and start commands

3450c34

Add resolved_name property to RenderPlatform so auto-detection shows the actual backend (e.g. "native (via auto-detection)") instead of just "auto".

refactor: add log_name property to RenderPlatform to deduplicate logging

a71cb0f

feat: add trajectory video renderer for post-eval analysis

811d5f9

Adds easi/analysis/ package with trajectory_video.py that generates per-episode MP4 videos showing robot paths on top-down maps alongside camera views — pure post-processing, no simulator dependencies.

feat: add _get_topdown_map and _get_episode_meta hooks to BaseBridge

97cc3ef

feat: add topdown map and episode meta hooks to VLNCEBridge

8629bf8

feat: add easi analyze trajectory CLI command

8d13aa6

fix: correct sim access chain in EBHabitatBridge (was missing one .en…

96bc896

…v level)

feat: use progress bar for trajectory video generation

c9268d7

feat: replace --max-episodes with --episodes filter (IDs, ranges, mixed)

fc6ed05

feat: vLLM v0.17+ compat and InternVL3 custom model server

465d35d

chore: add internvl3 template

8a7679e

fix: dont explicitly set egl vendor path

f4fcce0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] EASI-ER CLI#31

[Feat] EASI-ER CLI#31
oscarqjh wants to merge 219 commits intoEvolvingLMMs-Lab:mainfrom
oscarqjh:dev

oscarqjh commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

oscarqjh commented Mar 9, 2026

Summary

Core Framework

Simulators (6)

Benchmarks (10)

LLM Infrastructure

Standardized Prompt Format

CLI

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant