Add Agent World Model environment (AWM-1K)#421
Add Agent World Model environment (AWM-1K)#421Raibows wants to merge 2 commits intometa-pytorch:mainfrom
Conversation
|
Hi @Raibows! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
Greptile SummaryThis PR adds the Agent World Model (AWM-1K) environment to OpenEnv — a suite of 1,000 synthetic tool-use environments backed by HuggingFace-hosted data, SQLite state, and MCP-over-subprocess tool routing. The environment follows the existing env pattern (server + client split, Pydantic models, WebSocket-only delivery) and comes with solid unit tests. However, there are 2 Tier 1 issues that need fixing and 3 Tier 2 alignment flags that need human review before merging. Tier 1: Fix Required
Tier 2: Alignment Discussion (tag
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Trainer as RL Trainer
participant Client as AWMEnv (client.py)
participant Server as FastAPI /ws (app.py)
participant Env as AWMEnvironment
participant HF as HuggingFace Hub
participant Sub as Sub-env Subprocess
participant DB as SQLite DB
Trainer->>Client: reset(scenario, task_idx)
Client->>Server: WebSocket reset message
Server->>Env: reset(scenario, task_idx)
Env->>HF: hf_hub_download (first time only)
HF-->>Env: JSONL data files cached
Env->>DB: create_database() + save_snapshot()
Env->>Sub: subprocess.Popen(patched server.py)
Sub-->>Env: MCP server ready on random port
Env->>Sub: list_mcp_tools() via mcp-agent
Sub-->>Env: tool list
Env-->>Server: AWMObservation(reset_ok)
Server-->>Client: StepResult
Client-->>Trainer: observation
loop Agent steps
Trainer->>Client: step(CallToolAction(tool_name, args))
Client->>Server: WebSocket step message
Server->>Env: step(action)
Env->>Sub: call_mcp_tool() via mcp-agent
Sub->>DB: SQL read/write
DB-->>Sub: result
Sub-->>Env: tool result
Env-->>Client: AWMObservation(reward, tool_result)
end
Trainer->>Client: step(CallToolAction("verify", {verifier_mode}))
Client->>Server: WebSocket step message
Server->>Env: step(verify action)
Env->>DB: run_verifier(initial_db, final_db)
DB-->>Env: verification result
Env-->>Client: AWMObservation(reward, verify_result, done=False)
Trainer->>Client: step(CallToolAction("done"))
Client->>Server: WebSocket step message
Server->>Env: step(done action)
Env->>Sub: process.stop()
Env-->>Client: AWMObservation(done=True)
Last reviewed commit: fde6176 |
| namespace = { | ||
| "sqlite3": sqlite3, | ||
| "json": json, | ||
| "os": os, | ||
| "__builtins__": __builtins__, | ||
| } | ||
| exec(verifier_code, namespace) |
There was a problem hiding this comment.
exec() with unrestricted __builtins__ exposes full Python runtime
Both execute_sql_verifier (line 88–94) and execute_code_verifier (line 158–164) build an exec namespace that passes "__builtins__": __builtins__. This gives the verifier code full access to every Python built-in — open(), __import__(), subprocess, os.system(), etc. Although the AWM dataset is served from HuggingFace and is assumed to be trustworthy today, running auto-generated code with an unrestricted namespace inside the server process is a supply-chain vulnerability: a compromised dataset entry could exfiltrate secrets, spawn processes, or corrupt the server state.
The fix is to restrict the namespace to an explicit, minimal allowlist:
# execute_sql_verifier (same pattern applies to execute_code_verifier)
namespace = {
"sqlite3": sqlite3,
"json": json,
"os": os,
"__builtins__": { # restrict to a safe subset
"len": len, "range": range, "int": int, "str": str,
"float": float, "bool": bool, "list": list, "dict": dict,
"tuple": tuple, "set": set, "print": print,
"isinstance": isinstance, "hasattr": hasattr,
"getattr": getattr, "enumerate": enumerate, "zip": zip,
"sorted": sorted, "sum": sum, "min": min, "max": max,
"True": True, "False": False, "None": None,
},
}
exec(verifier_code, namespace)The same change should be applied at line 158 in execute_code_verifier.
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/server/verifier.py
Line: 88-94
Comment:
**`exec()` with unrestricted `__builtins__` exposes full Python runtime**
Both `execute_sql_verifier` (line 88–94) and `execute_code_verifier` (line 158–164) build an `exec` namespace that passes `"__builtins__": __builtins__`. This gives the verifier code full access to every Python built-in — `open()`, `__import__()`, `subprocess`, `os.system()`, etc. Although the AWM dataset is served from HuggingFace and is assumed to be trustworthy today, running auto-generated code with an unrestricted namespace inside the server process is a supply-chain vulnerability: a compromised dataset entry could exfiltrate secrets, spawn processes, or corrupt the server state.
The fix is to restrict the namespace to an explicit, minimal allowlist:
```python
# execute_sql_verifier (same pattern applies to execute_code_verifier)
namespace = {
"sqlite3": sqlite3,
"json": json,
"os": os,
"__builtins__": { # restrict to a safe subset
"len": len, "range": range, "int": int, "str": str,
"float": float, "bool": bool, "list": list, "dict": dict,
"tuple": tuple, "set": set, "print": print,
"isinstance": isinstance, "hasattr": hasattr,
"getattr": getattr, "enumerate": enumerate, "zip": zip,
"sorted": sorted, "sum": sum, "min": min, "max": max,
"True": True, "False": False, "None": None,
},
}
exec(verifier_code, namespace)
```
The same change should be applied at line 158 in `execute_code_verifier`.
How can I resolve this? If you propose a fix, please make it concise.| def _get_random_port() -> int: | ||
| with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: | ||
| s.bind(("", 0)) | ||
| s.listen(1) | ||
| return s.getsockname()[1] |
There was a problem hiding this comment.
TOCTOU race condition: port released before subprocess claims it
_get_random_port() binds to port 0 to let the OS assign a free port, records that port number, then immediately closes and releases the socket. Between the with block exiting and the subprocess calling uvicorn.run(app, port=…), another process on the system can bind to that same port. Under concurrent rollout collection (the primary AWM use-case), this is likely: many AWMEnvironment instances start subprocesses in parallel, all calling _get_random_port() at nearly the same time.
The canonical fix is to hold the socket open and pass the file descriptor to the child process so it can call socket.fromfd() and bind without racing. A simpler practical alternative is to keep the socket bound and let the OS double-bind (SO_REUSEPORT), but the cleanest approach for uvicorn is to pass an already-bound FD:
def _get_random_port() -> int:
"""Return a free port. NOTE: still has a small race window — prefer fd passing for production."""
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind(("127.0.0.1", 0))
return s.getsockname()[1]At minimum, the SO_REUSEADDR flag makes re-binding by the child much less likely to fail. A full fix would pass the socket FD to uvicorn via --fd.
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/server/scenario_manager.py
Line: 31-35
Comment:
**TOCTOU race condition: port released before subprocess claims it**
`_get_random_port()` binds to port 0 to let the OS assign a free port, records that port number, then immediately closes and releases the socket. Between the `with` block exiting and the subprocess calling `uvicorn.run(app, port=…)`, another process on the system can bind to that same port. Under concurrent rollout collection (the primary AWM use-case), this is likely: many AWMEnvironment instances start subprocesses in parallel, all calling `_get_random_port()` at nearly the same time.
The canonical fix is to hold the socket open and pass the file descriptor to the child process so it can call `socket.fromfd()` and bind without racing. A simpler practical alternative is to keep the socket bound and let the OS double-bind (`SO_REUSEPORT`), but the cleanest approach for uvicorn is to pass an already-bound FD:
```python
def _get_random_port() -> int:
"""Return a free port. NOTE: still has a small race window — prefer fd passing for production."""
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind(("127.0.0.1", 0))
return s.getsockname()[1]
```
At minimum, the `SO_REUSEADDR` flag makes re-binding by the child much less likely to fail. A full fix would pass the socket FD to uvicorn via `--fd`.
How can I resolve this? If you propose a fix, please make it concise.| elif isinstance(action, CallToolAction): | ||
| if action.tool_name == "done": | ||
| return self._handle_done(action) | ||
| elif action.tool_name == "verify": |
There was a problem hiding this comment.
ALIGNMENT FLAG: Agent terminates its own episode via done tool
- Principle at stake:
INVARIANTS.md— "Agents cannot reset"; the invariant reads: "Agents MUST NOT have the ability to reset the environment or advance simulation time." - The concern: The
donetool is callable by the agent and causesdone=True, ending the trajectory. This gives the agent direct control over episode termination — a form of reset control. An RL-trained agent could learn to calldoneearly (e.g., after a lucky high-rewardverifycall) to avoid penalty steps, inflating its trajectory statistics. The invariant exists precisely to prevent agents from short-circuiting episode mechanics. - Suggested reviewer:
@darktex
The concern is distinct from "agents cannot reset": the agent cannot re-start, but it can stop early. Whether this tradeoff is acceptable (e.g., "done signals the agent gave up") should be explicitly discussed and documented.
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/server/awm_environment.py
Line: 311-314
Comment:
**ALIGNMENT FLAG: Agent terminates its own episode via `done` tool**
- **Principle at stake**: `INVARIANTS.md` — *"Agents cannot reset"*; the invariant reads: "Agents MUST NOT have the ability to reset the environment or advance simulation time."
- **The concern**: The `done` tool is callable by the agent and causes `done=True`, ending the trajectory. This gives the agent direct control over episode termination — a form of reset control. An RL-trained agent could learn to call `done` early (e.g., after a lucky high-reward `verify` call) to avoid penalty steps, inflating its trajectory statistics. The invariant exists precisely to prevent agents from short-circuiting episode mechanics.
- **Suggested reviewer**: `@darktex`
The concern is distinct from "agents cannot *reset*": the agent cannot re-start, but it can *stop early*. Whether this tradeoff is acceptable (e.g., "done signals the agent gave up") should be explicitly discussed and documented.
How can I resolve this? If you propose a fix, please make it concise.| def _handle_verify(self, action: CallToolAction) -> AWMObservation: | ||
| """Handle the `verify` tool — run verifier with specified mode.""" | ||
| if not self._reset_ok or self._scenario is None: | ||
| return AWMObservation( | ||
| done=False, | ||
| reward=self._get_reward("server_error"), | ||
| reward_type="server_error", | ||
| error="Cannot verify: environment not initialized " | ||
| "(reset failed or not called)", | ||
| ) | ||
|
|
||
| if self._task is None or self._task_idx is None: | ||
| return AWMObservation( | ||
| done=False, | ||
| reward=self._get_reward("no_verifier"), | ||
| reward_type="no_verifier", | ||
| error="Cannot verify: no task specified at reset", | ||
| ) | ||
|
|
||
| # Get verifier_mode from arguments |
There was a problem hiding this comment.
ALIGNMENT FLAG: verify returns done=False, allowing unbounded reward accumulation
- Principle at stake:
PRINCIPLES.md— "One env = one trajectory"; rewards must reflect a single well-defined completion signal. - The concern:
_handle_verifyalways returnsdone=False, so an agent can callverifyarbitrarily many times within a single episode. Each call returns areward(e.g.,1.0forcomplete). A policy could learn to spamverifyafter a successful action and accumulate rewards indefinitely without callingdone. The README even advertises "can be called multiple times with different modes" as a feature. - Suggested reviewer:
@darktex
This diverges from the Gymnasium convention where a terminal reward fires once. If multi-call verify is intentional for curriculum / partial-credit use-cases, the reward config should probably emit non-zero reward only on the first complete classification per episode, or verify should set done=True on success.
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/server/awm_environment.py
Line: 470-489
Comment:
**ALIGNMENT FLAG: `verify` returns `done=False`, allowing unbounded reward accumulation**
- **Principle at stake**: `PRINCIPLES.md` — *"One env = one trajectory"*; rewards must reflect a single well-defined completion signal.
- **The concern**: `_handle_verify` always returns `done=False`, so an agent can call `verify` arbitrarily many times within a single episode. Each call returns a `reward` (e.g., `1.0` for `complete`). A policy could learn to spam `verify` after a successful action and accumulate rewards indefinitely without calling `done`. The README even advertises "can be called multiple times with different modes" as a feature.
- **Suggested reviewer**: `@darktex`
This diverges from the Gymnasium convention where a terminal reward fires once. If multi-call verify is intentional for curriculum / partial-credit use-cases, the reward config should probably emit non-zero reward only on the *first* `complete` classification per episode, or `verify` should set `done=True` on success.
How can I resolve this? If you propose a fix, please make it concise.| # Get verifier_mode from arguments | ||
| args = action.arguments or {} | ||
| verifier_mode = args.get("verifier_mode", "code") | ||
| final_answer = args.get("final_answer") | ||
|
|
||
| if verifier_mode not in VALID_VERIFIER_MODES: | ||
| return AWMObservation( | ||
| done=False, | ||
| reward=self._get_reward("invalid_args"), | ||
| reward_type="invalid_args", | ||
| error=f"Invalid verifier_mode '{verifier_mode}'. " | ||
| f"Must be one of: {', '.join(sorted(VALID_VERIFIER_MODES))}", | ||
| ) |
There was a problem hiding this comment.
ALIGNMENT FLAG: Agent selects verifier_mode — environment should control reward signal
- Principle at stake:
PRINCIPLES.md— "Rewards inside environment"; reward computation must be fully determined by the environment. - The concern:
verifier_modeis read fromaction.arguments— meaning the agent chooses whether to be evaluated by the SQL verifier (requires LLM judge, harder) or the code verifier (deterministic, potentially easier). An RL policy could exploit this by learning to always callverifywith the lenient mode, obtaining inflated rewards without actually completing the task. The reward signal would no longer be fully in the environment's control. - Suggested reviewer:
@darktex
One approach: fix verifier_mode at reset() time (the orchestrator chooses it, not the agent) and store it as self._verifier_mode. The verify tool call would then ignore any verifier_mode argument from the agent.
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/server/awm_environment.py
Line: 489-501
Comment:
**ALIGNMENT FLAG: Agent selects `verifier_mode` — environment should control reward signal**
- **Principle at stake**: `PRINCIPLES.md` — *"Rewards inside environment"*; reward computation must be fully determined by the environment.
- **The concern**: `verifier_mode` is read from `action.arguments` — meaning the *agent* chooses whether to be evaluated by the SQL verifier (requires LLM judge, harder) or the code verifier (deterministic, potentially easier). An RL policy could exploit this by learning to always call `verify` with the lenient mode, obtaining inflated rewards without actually completing the task. The reward signal would no longer be fully in the environment's control.
- **Suggested reviewer**: `@darktex`
One approach: fix `verifier_mode` at `reset()` time (the orchestrator chooses it, not the agent) and store it as `self._verifier_mode`. The `verify` tool call would then ignore any `verifier_mode` argument from the agent.
How can I resolve this? If you propose a fix, please make it concise.|
Clarification on the Greptile found issues.
The verifier code comes exclusively from our own Snowflake/AgentWorldModel-1K dataset, which do not involve security concerns. The
This is a real but very low-probability issue. We can add retry logic if the team considers it necessary.
All
This is intentional. AWM supports two independent verification modes (code and SQL), and users often want to run both on the same episode. The
This does not enable trajectory multiplexing. Each WebSocket connection creates a completely independent
Each environment in Agent World Model is a FastAPI process, a lightweight application server, which does not require docker container isolation. Involing docker container will slow down the environment setup and teardown, disadvantage for large-scale RL training (e.g., 1024 parallel instances launched per step).
The SQL verifier mode (code-augmented LLM-as-a-Judge) is optional though being recommended according to our experiment results. We also provide |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
Summary
Hi team! We would love to contribute the Agent World Model environment to OpenEnv, if you're open to it.
AgentWorldModel-1K is a synthetic agentic environment suite we built for large-scale tool-use RL training. It provides:
More details are in our paper: Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning.
Happy to address any feedback or make adjustments to better fit OpenEnv!
Here I also want to loop in @sfc-gh-kganesan for helping with this request.
Type of Change
Alignment Checklist
Before submitting, verify:
.claude/docs/PRINCIPLES.mdand this PR aligns with our principles.claude/docs/INVARIANTS.mdand no invariants are violated/pre-submit-pr(orbash .claude/hooks/lint.shand tests) and addressed all issuesRFC Status
Test Plan
tests/envs/test_awm_environment.py(smoke data) covering environment lifecycle, scenario management, tool discovery/calling, both verification modes, reward configuration, and session managementenvs/agent_world_model_env/example_usage.pyto test the actual data & runningClaude Code Review
N/A