Skip to content

[Feat] EASI-ER CLI#31

Open
oscarqjh wants to merge 219 commits intoEvolvingLMMs-Lab:mainfrom
oscarqjh:dev
Open

[Feat] EASI-ER CLI#31
oscarqjh wants to merge 219 commits intoEvolvingLMMs-Lab:mainfrom
oscarqjh:dev

Conversation

@oscarqjh
Copy link
Copy Markdown
Collaborator

@oscarqjh oscarqjh commented Mar 9, 2026

Summary

This PR adds the entire easi-er cli — a unified evaluation framework for embodied AI agents. It introduces subprocess-isolated simulators, multi-split task definitions, LLM-powered agents, and a CLI for running evaluations across multiple benchmarks.

Core Framework

  • Subprocess isolation: Each simulator runs in its own conda environment (potentially different Python version), communicating via filesystem IPC
  • Multi-split tasks: YAML-based task configs with template inheritance (extends), auto-discovered by registry
  • Pluggable agents: Dummy agent for testing, ReAct agent with multi-action buffering for real evaluation
  • AgentMemory + PromptBuilder: Shared state architecture with task-specific prompt builders following a standardized format
  • Parallel evaluation: Thread-pool parallelism with multi-instance vLLM support (--num-parallel, --vllm-instances)
  • Resume support: Interrupted runs can be resumed from logs/ output directory

Simulators (6)

  • AI2-THOR v2.1.0 (Python 3.8) — EB-Alfred
  • AI2-THOR v5.0.0 (Python 3.10) — EB-Navigation, AI2-THOR Rearrangement
  • AI2-THOR v3.3.5 (Python 3.10) — ManipulaTHOR
  • Habitat-Sim v0.1.7 (Python 3.8) — VLN-CE R2R, VLN-CE RxR
  • Habitat-Sim v0.3.0 (Python 3.9) — EB-Habitat, LHPR-VLN
  • CoppeliaSim v4.1.0 (Python 3.10) — EB-Manipulation
  • TDW v1.11.23 (Python 3.10) — HAZARD

Benchmarks (10)

  • EmbodiedBench: EB-Alfred (6 splits), EB-Navigation (5 splits), EB-Habitat (4 splits), EB-Manipulation (4 splits)
  • HAZARD: Fire, Flood, Wind scenarios
  • ManipulaTHOR: Arm point navigation (seen/unseen)
  • AI2-THOR Rearrangement 2023: 5 evaluation splits
  • VLN-CE R2R: Vision-and-language navigation (val seen/unseen)
  • VLN-CE RxR: Multilingual VLN (val seen/unseen, en/hi/te)
  • LHPR-VLN: Multi-subtask navigation (val/test splits)

LLM Infrastructure

  • LLMClient: LiteLLM wrapper supporting any backend (OpenAI, Anthropic, Gemini, vLLM)
  • ServerManager / MultiServerManager: vLLM subprocess lifecycle with GPU allocation, tensor parallelism, and port auto-probing
  • GPU isolation: --vllm-gpus and --sim-gpus for separating LLM inference from simulator rendering

Standardized Prompt Format

  • EASI Standard Prompt Format Reference (docs/easi-prompt-format-reference.md)
  • All non-EmbodiedBench prompt builders aligned: standard section headers, 4-field JSON response format, consistent action history format
  • EmbodiedBench benchmarks retain their original published formats

CLI

easi task list / info / download / scaffold
easi env list / install / check
easi sim test
easi start --agent react --backend openai --model gpt-4o
easi start --resume ./logs//<run_id>

Testing

  • 824 tests, all passing
  • All tests run offline without simulators or LLMs (mocked subprocess bridges)

Implement the easi Python library for orchestrating simulator-based
embodied reasoning evaluation. Includes subprocess isolation via
filesystem IPC, versioned simulator management (conda+uv), agent
interface, task/benchmark framework, and CLI.

Components: core abstractions, dummy/AI2-THOR simulators, dummy task,
dummy agent, LLM client + dummy server, full test suite (44 tests).
Add embodied agent evaluation pipeline with real simulator support:
- ReAct agent with multi-action buffering and PromptBuilder protocol
- EB-Alfred task support (6 splits via multi-split YAML discovery)
- AI2-THOR v2.1.0 bridge with skill-based actions and state tracking
- EvaluationRunner with structured output: <output_dir>/<task>/<run_id>/
- Per-episode artifacts: result.json, trajectory.jsonl, rgb_*.png
- Centralized logging (print -> logger), --verbosity CLI option
- Subprocess observability: bridge output streaming, Ctrl+C cleanup
- LLM API client (OpenAI-compatible) and dummy LLM server
- 106 tests passing
Split the monolithic bridge into a generic AI2ThorBridge base class
(simulator layer) and an EBAlfredBridge subclass (task layer), so that
future benchmarks using ai2thor==2.1.0 can reuse the simulator bridge
without rewriting controller management, IPC, or navigation helpers.

- Extract EB-Alfred goal evaluation and task loading into
  easi/tasks/ebalfred/thor_utils.py
- Trim easi/simulators/ai2thor/v2_1_0/thor_utils.py to generic-only
  constants and object query utilities
- Refactor bridge.py from EBAlfredBridge (1062 lines) to generic
  AI2ThorBridge (~314 lines) with configurable simulator_kwargs
- Create easi/tasks/ebalfred/bridge.py with EBAlfredBridge subclass
  containing all skill execution, state tracking, and goal evaluation
- Add get_bridge_script_path() and simulator_kwargs to BaseTask and
  TaskProtocol; override in EBAlfredTask
- Update EvaluationRunner to prefer task-specific bridge paths and
  forward simulator_kwargs
- Add simulator_kwargs to all EB-Alfred YAML configs
- Add 29 tests covering imports, inheritance, method separation,
  bridge path resolution, and simulator_kwargs
Unified client with generate() and generate_structured() methods,
lazy imports, and cumulative usage tracking (tokens + cost).
Manages start/stop, port checking, health polling with timeout,
and context manager support. Extensible for future backends.
New arguments for `easi run` to select LLM backend and configure
inference server. Backward compatible with existing --llm-url.
Runner now resolves backend, auto-starts vLLM when needed,
creates LLMClient for non-legacy backends, wraps structured
output, and tracks LLM usage per-episode and per-run.
- Track usage in generate_structured() via instructor's _raw_response
- Fix log file handle leak in ServerManager (store and close in stop())
- Remove duplicate agent_config computation in runner._create_agent()
…port

Replace the tightly-coupled agent/prompt design with a memory-based
architecture where AgentMemory holds shared state, PromptBuilder reads
from memory to construct prompts and parse responses, and the agent is
a thin orchestrator.

Key changes:
- AgentMemory + StepRecord dataclasses as shared agent state
- New PromptBuilderProtocol: build_messages(memory) + parse_response(response, memory)
- Simplified BaseAgent (removed _chat_history, abstract stubs, default act())
- ReActAgent rewritten as thin orchestrator delegating to builder
- EBAlfredPromptBuilder gains chat_history=True mode with VLMPlanner parity
- json_repair moved to easi/utils/ (old location re-exports)
- Removed stateless flag from agent config (builder controls mode)
Builder-owned schema enforcement: prompt builders can now optionally
implement get_response_format() to provide a JSON schema dict that gets
passed through to litellm. ReActAgent handles fallback automatically
when the backend doesn't support response_format.

- LLMClient.generate() accepts optional response_format param
- ReActAgent._generate_with_fallback() tries schema, caches on failure
- EBAlfredPromptBuilder.get_response_format() returns vlm_generation_guide
- Remove dead code: instructor dep, Pydantic schemas, monkey-patching
Add BaseTask.on_episode_reset() hook for task-specific post-reset setup.
EBAlfredTask overrides it to update agent action space from bridge metadata,
removing EB-Alfred-specific logic from the general EvaluationRunner.
…onfig

- trajectory.jsonl: add llm_response field to each step entry
- result.json: add instruction field for each episode
- config.json: include all CLI options and full task YAML config
Retry: LLMClient passes num_retries to litellm.completion() for
automatic exponential backoff on transient errors (timeouts, rate
limits). Configurable via --max-retries (default 3).

Resume: --resume <run_dir> loads config.json from a previous run,
skips completed episodes, clears and re-runs the last episode (which
may have been interrupted), then continues the remaining episodes.
All CLI options are restored from config.json so only --resume is needed.
…ps bug

Fix max_steps mismatch where YAML configured 50 but vendor EBAlfEnv
hardcoded 30. Now max_steps flows from YAML through simulator_kwargs
to the bridge and vendor env.

Add per-episode retry in EvaluationRunner: on crash (e.g. AI2-THOR
Unity segfault), the episode dir is cleared, the simulator is
re-launched, and the episode is retried up to max_retries times.
If all retries are exhausted the episode is recorded as failed and
the runner continues to the next episode.
Integrate EmbodiedBench EB-Navigation into EASI with vendored env,
task bridge, prompt builder, and 5 split configs (ai2thor v5.0.0).
Remove action_space field from all YAML configs and TaskEntry. Tasks
now define their action space via _build_action_space() override with
caching, eliminating the confusing pattern of empty YAML fields.
…atform

Replace stub bridge with working AI2ThorV5Bridge class that starts a
real controller, handles scene reset and discrete navigation actions.
Switch platform from CloudRendering to Linux64. Increase sim test
timeout default from 30s to 200s for THOR startup.
Fix OUTPUT_TEMPLATE trailing spaces on 3 lines and regenerate
navigation_examples.json from source to fix line continuation artifact
and curly quote mismatch. Verified character-level parity.
- Add habitat_sim:v0_3_0 simulator registration (conda env + manifest)
- Vendor EBHabEnv from EmbodiedBench with fixed imports
- Add EBHabitatTask with dynamic action space via on_episode_reset hook
- Add EBHabitatPromptBuilder matching VLMPlanner prompt construction
- Add 6 per-split YAML configs (base, common_sense, complex_instruction,
  spatial_relationship, visual_appearance, long_horizon)
- Add 26 offline tests for actions, task, prompts, and registry
- Move EB-Habitat-specific deps (gym, hydra-core, omegaconf, imageio,
  habitat-lab) from simulator requirements.txt to task YAML additional_deps
Add --redownload to 'easi task download' and 'easi run' to force
re-download of cached HuggingFace datasets. Useful when a previous
download was interrupted or incomplete.
oscarqjh and others added 30 commits March 10, 2026 14:54
Track whether each Xorg was launched with sudo and use `sudo -n kill`
for those processes, preventing orphaned root-owned Xorg servers.
Start/stop Xorg servers in run(), override render platform per worker
when xorg is active, warn on GPU contention with local LLM backends.
Fix tests that use __new__ to include _xorg_instances attribute.
Added to: ai2thor (v2.1, v3.3.5, v5.0), habitat_sim (v0.1.7, v0.3),
coppeliasim (v4.1), tdw (v1.11.23), omnigibson (v3.7.2).
…rgManager

RenderPlatform gains setup()/teardown()/for_worker() hooks so platforms
that manage external services (like Xorg) are self-contained. No more
if/else xorg handling in callers.

- XorgPlatform.setup() starts XorgManager, for_worker() returns a
  per-worker _XorgWorkerPlatform with fixed display/GPU
- Runners call _setup_render_platform() once, then teardown() in finally
- cli sim_test calls setup()/for_worker()/teardown() directly
- Removed _xorg_instances attribute and scattered xorg_mgr variables
- Replace asyncio.get_event_loop() with get_running_loop() (deprecated in 3.10, error in 3.12)
- Use project get_logger() convention in progress.py instead of logging.getLogger()
- Move litellm imports to local scope in test_react_agent.py to avoid import errors
Adopt WorkerBinding plus SimulatorRenderAdapter across registry, runners, subprocess launch, and simulator integrations so render backends own resource assignment while simulators contribute render-specific quirks through one adapter path.
Keep the conda-backed smoke-test path aligned with the new WorkerBinding handoff so render adapters and per-worker launch data reach SubprocessRunner consistently.
Add resolved_name property to RenderPlatform so auto-detection
shows the actual backend (e.g. "native (via auto-detection)")
instead of just "auto".
- cmd_sim_test: move setup()/for_worker() inside the try block that
  owns finally:teardown() so Xorg servers are always cleaned up
- _create_simulator: call setup() on per-simulator resolved platforms
  and register for teardown via self._render_platform
Integrate REVERIE-CE as a new task reusing VLN-CE R2R infrastructure.
Same simulator (habitat_sim:v0_1_7), action space, and metrics. New
prompt builder adapted for REVERIE's high-level instruction style.

Dataset: oscarqjh/REVERIE-CE_easi (repackaged from Dynam3D)
Adds easi/analysis/ package with trajectory_video.py that generates
per-episode MP4 videos showing robot paths on top-down maps alongside
camera views — pure post-processing, no simulator dependencies.
…ridge

Add trajectory visualization hooks to the EB-Habitat bridge using
habitat-sim 0.3.0 API (articulated_agent.base_pos). Includes topdown
map rendering via pathfinder, start position persistence, and per-step
agent position tracking in trajectory info.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant