Skip to content

Add Agent World Model environment (AWM-1K)#421

Open
Raibows wants to merge 2 commits intometa-pytorch:mainfrom
Raibows:add-agent-world-model
Open

Add Agent World Model environment (AWM-1K)#421
Raibows wants to merge 2 commits intometa-pytorch:mainfrom
Raibows:add-agent-world-model

Conversation

@Raibows
Copy link

@Raibows Raibows commented Mar 7, 2026

Summary

Hi team! We would love to contribute the Agent World Model environment to OpenEnv, if you're open to it.

AgentWorldModel-1K is a synthetic agentic environment suite we built for large-scale tool-use RL training. It provides:

  • 1,000 environments with 10,000 tasks spanning diverse domains (e-commerce, booking, banking, healthcare, etc.)
  • Each environment is a fully functional MCP server with tools, SQLite database state, and verification logic
  • Built-in support for parallel rollout collection — a single server handles concurrent isolated sessions

More details are in our paper: Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning.

Happy to address any feedback or make adjustments to better fit OpenEnv!

Here I also want to loop in @sfc-gh-kganesan for helping with this request.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation
  • New environment
  • Refactoring

Alignment Checklist

Before submitting, verify:

  • I have read .claude/docs/PRINCIPLES.md and this PR aligns with our principles
  • I have checked .claude/docs/INVARIANTS.md and no invariants are violated
  • I have run /pre-submit-pr (or bash .claude/hooks/lint.sh and tests) and addressed all issues

RFC Status

  • Not required (new environment addition, follows existing env patterns)
  • RFC exists: #___
  • RFC needed (will create before merge)

Test Plan

  • Unit tests in tests/envs/test_awm_environment.py (smoke data) covering environment lifecycle, scenario management, tool discovery/calling, both verification modes, reward configuration, and session management
  • Real example agent provided in envs/agent_world_model_env/example_usage.py to test the actual data & running
  • Lint passing

Claude Code Review

N/A

@meta-cla
Copy link

meta-cla bot commented Mar 7, 2026

Hi @Raibows!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 7, 2026

Greptile Summary

This PR adds the Agent World Model (AWM-1K) environment to OpenEnv — a suite of 1,000 synthetic tool-use environments backed by HuggingFace-hosted data, SQLite state, and MCP-over-subprocess tool routing. The environment follows the existing env pattern (server + client split, Pydantic models, WebSocket-only delivery) and comes with solid unit tests. However, there are 2 Tier 1 issues that need fixing and 3 Tier 2 alignment flags that need human review before merging.

Tier 1: Fix Required

  • envs/agent_world_model_env/server/verifier.py:88,164exec() is called with "__builtins__": __builtins__, giving auto-generated verifier code from HuggingFace full access to open(), __import__(), subprocess, etc. This must be restricted to a minimal allowlist before merge.
  • envs/agent_world_model_env/server/scenario_manager.py:32_get_random_port() releases the OS-assigned port before the subprocess claims it, creating a TOCTOU race condition that can cause startup failures under the concurrent-rollout use-case the environment is specifically designed for.

Tier 2: Alignment Discussion (tag @darktex)

  • Agent-callable done tool: Agents can explicitly end their own episode — a potential violation of the INVARIANTS.md invariant that agents must not control episode lifecycle.
  • verify returns done=False: Agents can call verify repeatedly within one episode and accumulate reward=1.0 on each success. This undermines the "one env = one trajectory" principle.
  • Agent selects verifier_mode: The verifier_mode argument in the verify tool call is agent-supplied. A trained policy could learn to always pick the easier verifier, breaking the principle that reward computation is fully environment-controlled.

Confidence Score: 2/5

  • Not safe to merge: an unrestricted exec() security issue and a concurrent-use race condition need to be resolved, and three alignment questions need explicit sign-off from @Darktex.
  • Score of 2 reflects: the codebase structure is solid and tests are comprehensive, but the exec(builtins) vulnerability is a genuine security risk in a server process, the port race condition is a correctness bug under the primary concurrent use-case, and three alignment questions touch INVARIANTS.md invariants that require explicit human sign-off per the project's review model.
  • Pay close attention to envs/agent_world_model_env/server/verifier.py (exec security) and envs/agent_world_model_env/server/scenario_manager.py (port race). Also review awm_environment.py for the three alignment concerns around episode/reward control.

Important Files Changed

Filename Overview
envs/agent_world_model_env/server/verifier.py Contains the core security issue: exec() with builtins fully exposed allows verifier code from HuggingFace to execute arbitrary Python. Also handles LLM-as-judge integration for SQL mode.
envs/agent_world_model_env/server/scenario_manager.py Manages subprocess lifecycle for sub-environments. Contains a TOCTOU race condition in _get_random_port() that can cause port conflicts under concurrent rollouts — the primary AWM use-case.
envs/agent_world_model_env/server/awm_environment.py Core environment logic. Has two alignment flags: agent-callable done tool violates the no-reset invariant, and verify returning done=False with agent-chosen verifier_mode undermines reward integrity.
envs/agent_world_model_env/client.py Thin client extending MCPToolClient. Correctly overrides _parse_result to deserialize AWM-specific observation fields. No server imports; client-server separation is maintained.
envs/agent_world_model_env/models.py Pydantic models for AWMAction and AWMObservation. Well-structured with discriminated union, model_dump exclude_none, and has_verifier backwards-compat validator. No issues found.
envs/agent_world_model_env/server/data_loader.py HuggingFace dataset download and lazy-loading. Straightforward JSONL parsing with normalized key lookup. AWM_DATA_DIR env var override works correctly. No critical issues.
envs/agent_world_model_env/server/db_manager.py SQLite database creation and snapshot management. Handles schema DDL and sample data insertion with per-statement error handling. save_snapshot via shutil.copy2 is correct.
envs/agent_world_model_env/server/app.py FastAPI app setup. Correctly disables HTTP /reset and /step (replacing them with 400 handlers) since AWM requires stateful WebSocket sessions. Shared data loader pattern is appropriate.
tests/envs/test_awm_environment.py Comprehensive unit tests using smoke data (no real HuggingFace download needed). Covers environment lifecycle, verifier modes, reward config, and session management. Tests do not cover concurrent session behavior.

Sequence Diagram

sequenceDiagram
    participant Trainer as RL Trainer
    participant Client as AWMEnv (client.py)
    participant Server as FastAPI /ws (app.py)
    participant Env as AWMEnvironment
    participant HF as HuggingFace Hub
    participant Sub as Sub-env Subprocess
    participant DB as SQLite DB

    Trainer->>Client: reset(scenario, task_idx)
    Client->>Server: WebSocket reset message
    Server->>Env: reset(scenario, task_idx)
    Env->>HF: hf_hub_download (first time only)
    HF-->>Env: JSONL data files cached
    Env->>DB: create_database() + save_snapshot()
    Env->>Sub: subprocess.Popen(patched server.py)
    Sub-->>Env: MCP server ready on random port
    Env->>Sub: list_mcp_tools() via mcp-agent
    Sub-->>Env: tool list
    Env-->>Server: AWMObservation(reset_ok)
    Server-->>Client: StepResult
    Client-->>Trainer: observation

    loop Agent steps
        Trainer->>Client: step(CallToolAction(tool_name, args))
        Client->>Server: WebSocket step message
        Server->>Env: step(action)
        Env->>Sub: call_mcp_tool() via mcp-agent
        Sub->>DB: SQL read/write
        DB-->>Sub: result
        Sub-->>Env: tool result
        Env-->>Client: AWMObservation(reward, tool_result)
    end

    Trainer->>Client: step(CallToolAction("verify", {verifier_mode}))
    Client->>Server: WebSocket step message
    Server->>Env: step(verify action)
    Env->>DB: run_verifier(initial_db, final_db)
    DB-->>Env: verification result
    Env-->>Client: AWMObservation(reward, verify_result, done=False)

    Trainer->>Client: step(CallToolAction("done"))
    Client->>Server: WebSocket step message
    Server->>Env: step(done action)
    Env->>Sub: process.stop()
    Env-->>Client: AWMObservation(done=True)
Loading

Last reviewed commit: fde6176

Comment on lines +88 to +94
namespace = {
"sqlite3": sqlite3,
"json": json,
"os": os,
"__builtins__": __builtins__,
}
exec(verifier_code, namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exec() with unrestricted __builtins__ exposes full Python runtime

Both execute_sql_verifier (line 88–94) and execute_code_verifier (line 158–164) build an exec namespace that passes "__builtins__": __builtins__. This gives the verifier code full access to every Python built-in — open(), __import__(), subprocess, os.system(), etc. Although the AWM dataset is served from HuggingFace and is assumed to be trustworthy today, running auto-generated code with an unrestricted namespace inside the server process is a supply-chain vulnerability: a compromised dataset entry could exfiltrate secrets, spawn processes, or corrupt the server state.

The fix is to restrict the namespace to an explicit, minimal allowlist:

# execute_sql_verifier  (same pattern applies to execute_code_verifier)
namespace = {
    "sqlite3": sqlite3,
    "json": json,
    "os": os,
    "__builtins__": {            # restrict to a safe subset
        "len": len, "range": range, "int": int, "str": str,
        "float": float, "bool": bool, "list": list, "dict": dict,
        "tuple": tuple, "set": set, "print": print,
        "isinstance": isinstance, "hasattr": hasattr,
        "getattr": getattr, "enumerate": enumerate, "zip": zip,
        "sorted": sorted, "sum": sum, "min": min, "max": max,
        "True": True, "False": False, "None": None,
    },
}
exec(verifier_code, namespace)

The same change should be applied at line 158 in execute_code_verifier.

Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/server/verifier.py
Line: 88-94

Comment:
**`exec()` with unrestricted `__builtins__` exposes full Python runtime**

Both `execute_sql_verifier` (line 88–94) and `execute_code_verifier` (line 158–164) build an `exec` namespace that passes `"__builtins__": __builtins__`. This gives the verifier code full access to every Python built-in — `open()`, `__import__()`, `subprocess`, `os.system()`, etc. Although the AWM dataset is served from HuggingFace and is assumed to be trustworthy today, running auto-generated code with an unrestricted namespace inside the server process is a supply-chain vulnerability: a compromised dataset entry could exfiltrate secrets, spawn processes, or corrupt the server state.

The fix is to restrict the namespace to an explicit, minimal allowlist:

```python
# execute_sql_verifier  (same pattern applies to execute_code_verifier)
namespace = {
    "sqlite3": sqlite3,
    "json": json,
    "os": os,
    "__builtins__": {            # restrict to a safe subset
        "len": len, "range": range, "int": int, "str": str,
        "float": float, "bool": bool, "list": list, "dict": dict,
        "tuple": tuple, "set": set, "print": print,
        "isinstance": isinstance, "hasattr": hasattr,
        "getattr": getattr, "enumerate": enumerate, "zip": zip,
        "sorted": sorted, "sum": sum, "min": min, "max": max,
        "True": True, "False": False, "None": None,
    },
}
exec(verifier_code, namespace)
```

The same change should be applied at line 158 in `execute_code_verifier`.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +31 to +35
def _get_random_port() -> int:
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(("", 0))
s.listen(1)
return s.getsockname()[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOCTOU race condition: port released before subprocess claims it

_get_random_port() binds to port 0 to let the OS assign a free port, records that port number, then immediately closes and releases the socket. Between the with block exiting and the subprocess calling uvicorn.run(app, port=…), another process on the system can bind to that same port. Under concurrent rollout collection (the primary AWM use-case), this is likely: many AWMEnvironment instances start subprocesses in parallel, all calling _get_random_port() at nearly the same time.

The canonical fix is to hold the socket open and pass the file descriptor to the child process so it can call socket.fromfd() and bind without racing. A simpler practical alternative is to keep the socket bound and let the OS double-bind (SO_REUSEPORT), but the cleanest approach for uvicorn is to pass an already-bound FD:

def _get_random_port() -> int:
    """Return a free port. NOTE: still has a small race window — prefer fd passing for production."""
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        s.bind(("127.0.0.1", 0))
        return s.getsockname()[1]

At minimum, the SO_REUSEADDR flag makes re-binding by the child much less likely to fail. A full fix would pass the socket FD to uvicorn via --fd.

Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/server/scenario_manager.py
Line: 31-35

Comment:
**TOCTOU race condition: port released before subprocess claims it**

`_get_random_port()` binds to port 0 to let the OS assign a free port, records that port number, then immediately closes and releases the socket. Between the `with` block exiting and the subprocess calling `uvicorn.run(app, port=…)`, another process on the system can bind to that same port. Under concurrent rollout collection (the primary AWM use-case), this is likely: many AWMEnvironment instances start subprocesses in parallel, all calling `_get_random_port()` at nearly the same time.

The canonical fix is to hold the socket open and pass the file descriptor to the child process so it can call `socket.fromfd()` and bind without racing. A simpler practical alternative is to keep the socket bound and let the OS double-bind (`SO_REUSEPORT`), but the cleanest approach for uvicorn is to pass an already-bound FD:

```python
def _get_random_port() -> int:
    """Return a free port. NOTE: still has a small race window — prefer fd passing for production."""
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        s.bind(("127.0.0.1", 0))
        return s.getsockname()[1]
```

At minimum, the `SO_REUSEADDR` flag makes re-binding by the child much less likely to fail. A full fix would pass the socket FD to uvicorn via `--fd`.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +311 to +314
elif isinstance(action, CallToolAction):
if action.tool_name == "done":
return self._handle_done(action)
elif action.tool_name == "verify":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALIGNMENT FLAG: Agent terminates its own episode via done tool

  • Principle at stake: INVARIANTS.md"Agents cannot reset"; the invariant reads: "Agents MUST NOT have the ability to reset the environment or advance simulation time."
  • The concern: The done tool is callable by the agent and causes done=True, ending the trajectory. This gives the agent direct control over episode termination — a form of reset control. An RL-trained agent could learn to call done early (e.g., after a lucky high-reward verify call) to avoid penalty steps, inflating its trajectory statistics. The invariant exists precisely to prevent agents from short-circuiting episode mechanics.
  • Suggested reviewer: @darktex

The concern is distinct from "agents cannot reset": the agent cannot re-start, but it can stop early. Whether this tradeoff is acceptable (e.g., "done signals the agent gave up") should be explicitly discussed and documented.

Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/server/awm_environment.py
Line: 311-314

Comment:
**ALIGNMENT FLAG: Agent terminates its own episode via `done` tool**

- **Principle at stake**: `INVARIANTS.md`*"Agents cannot reset"*; the invariant reads: "Agents MUST NOT have the ability to reset the environment or advance simulation time."
- **The concern**: The `done` tool is callable by the agent and causes `done=True`, ending the trajectory. This gives the agent direct control over episode termination — a form of reset control. An RL-trained agent could learn to call `done` early (e.g., after a lucky high-reward `verify` call) to avoid penalty steps, inflating its trajectory statistics. The invariant exists precisely to prevent agents from short-circuiting episode mechanics.
- **Suggested reviewer**: `@darktex`

The concern is distinct from "agents cannot *reset*": the agent cannot re-start, but it can *stop early*. Whether this tradeoff is acceptable (e.g., "done signals the agent gave up") should be explicitly discussed and documented.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +470 to +489
def _handle_verify(self, action: CallToolAction) -> AWMObservation:
"""Handle the `verify` tool — run verifier with specified mode."""
if not self._reset_ok or self._scenario is None:
return AWMObservation(
done=False,
reward=self._get_reward("server_error"),
reward_type="server_error",
error="Cannot verify: environment not initialized "
"(reset failed or not called)",
)

if self._task is None or self._task_idx is None:
return AWMObservation(
done=False,
reward=self._get_reward("no_verifier"),
reward_type="no_verifier",
error="Cannot verify: no task specified at reset",
)

# Get verifier_mode from arguments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALIGNMENT FLAG: verify returns done=False, allowing unbounded reward accumulation

  • Principle at stake: PRINCIPLES.md"One env = one trajectory"; rewards must reflect a single well-defined completion signal.
  • The concern: _handle_verify always returns done=False, so an agent can call verify arbitrarily many times within a single episode. Each call returns a reward (e.g., 1.0 for complete). A policy could learn to spam verify after a successful action and accumulate rewards indefinitely without calling done. The README even advertises "can be called multiple times with different modes" as a feature.
  • Suggested reviewer: @darktex

This diverges from the Gymnasium convention where a terminal reward fires once. If multi-call verify is intentional for curriculum / partial-credit use-cases, the reward config should probably emit non-zero reward only on the first complete classification per episode, or verify should set done=True on success.

Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/server/awm_environment.py
Line: 470-489

Comment:
**ALIGNMENT FLAG: `verify` returns `done=False`, allowing unbounded reward accumulation**

- **Principle at stake**: `PRINCIPLES.md`*"One env = one trajectory"*; rewards must reflect a single well-defined completion signal.
- **The concern**: `_handle_verify` always returns `done=False`, so an agent can call `verify` arbitrarily many times within a single episode. Each call returns a `reward` (e.g., `1.0` for `complete`). A policy could learn to spam `verify` after a successful action and accumulate rewards indefinitely without calling `done`. The README even advertises "can be called multiple times with different modes" as a feature.
- **Suggested reviewer**: `@darktex`

This diverges from the Gymnasium convention where a terminal reward fires once. If multi-call verify is intentional for curriculum / partial-credit use-cases, the reward config should probably emit non-zero reward only on the *first* `complete` classification per episode, or `verify` should set `done=True` on success.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +489 to +501
# Get verifier_mode from arguments
args = action.arguments or {}
verifier_mode = args.get("verifier_mode", "code")
final_answer = args.get("final_answer")

if verifier_mode not in VALID_VERIFIER_MODES:
return AWMObservation(
done=False,
reward=self._get_reward("invalid_args"),
reward_type="invalid_args",
error=f"Invalid verifier_mode '{verifier_mode}'. "
f"Must be one of: {', '.join(sorted(VALID_VERIFIER_MODES))}",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALIGNMENT FLAG: Agent selects verifier_mode — environment should control reward signal

  • Principle at stake: PRINCIPLES.md"Rewards inside environment"; reward computation must be fully determined by the environment.
  • The concern: verifier_mode is read from action.arguments — meaning the agent chooses whether to be evaluated by the SQL verifier (requires LLM judge, harder) or the code verifier (deterministic, potentially easier). An RL policy could exploit this by learning to always call verify with the lenient mode, obtaining inflated rewards without actually completing the task. The reward signal would no longer be fully in the environment's control.
  • Suggested reviewer: @darktex

One approach: fix verifier_mode at reset() time (the orchestrator chooses it, not the agent) and store it as self._verifier_mode. The verify tool call would then ignore any verifier_mode argument from the agent.

Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/server/awm_environment.py
Line: 489-501

Comment:
**ALIGNMENT FLAG: Agent selects `verifier_mode` — environment should control reward signal**

- **Principle at stake**: `PRINCIPLES.md`*"Rewards inside environment"*; reward computation must be fully determined by the environment.
- **The concern**: `verifier_mode` is read from `action.arguments` — meaning the *agent* chooses whether to be evaluated by the SQL verifier (requires LLM judge, harder) or the code verifier (deterministic, potentially easier). An RL policy could exploit this by learning to always call `verify` with the lenient mode, obtaining inflated rewards without actually completing the task. The reward signal would no longer be fully in the environment's control.
- **Suggested reviewer**: `@darktex`

One approach: fix `verifier_mode` at `reset()` time (the orchestrator chooses it, not the agent) and store it as `self._verifier_mode`. The `verify` tool call would then ignore any `verifier_mode` argument from the agent.

How can I resolve this? If you propose a fix, please make it concise.

@Raibows
Copy link
Author

Raibows commented Mar 7, 2026

Clarification on the Greptile found issues.

  1. exec() in verifier.py

The verifier code comes exclusively from our own Snowflake/AgentWorldModel-1K dataset, which do not involve security concerns. The os and sqlite3 modules are required for verifiers to compare initial vs. final database states. We also chmod 0o444 both DB files before execution to prevent writes.

  1. _run_async() event loop bug

AWMEnvironment implements the sync reset() / step() methods from the Environment base class. The OpenEnv framework calls sync environments from a thread pool context (via reset_async/step_async default implementations), so there is no running event loop when _run_async() executes.

  1. TOCTOU port race

This is a real but very low-probability issue. We can add retry logic if the team considers it necessary.

  1. Debug print() statements

All print() statements are replaced with logger.info().

  1. Mid-episode verify tool

This is intentional. AWM supports two independent verification modes (code and SQL), and users often want to run both on the same episode. The verify tool returns a reward but does not end the episode.

  1. SUPPORTS_CONCURRENT_SESSIONS = True

This does not enable trajectory multiplexing. Each WebSocket connection creates a completely independent AWMEnvironment instance with its own session directory, SQLite database, and subprocess. It is "one connection = one trajectory".

  1. No Docker container isolation

Each environment in Agent World Model is a FastAPI process, a lightweight application server, which does not require docker container isolation. Involing docker container will slow down the environment setup and teardown, disadvantage for large-scale RL training (e.g., 1024 parallel instances launched per step).

  1. External LLM API for SQL verifier

The SQL verifier mode (code-augmented LLM-as-a-Judge) is optional though being recommended according to our experiment results. We also provide code mode for judge which executes the verifier code directly to determine the task completion and reward, without the need of external LLM API.

@meta-cla
Copy link

meta-cla bot commented Mar 7, 2026

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant