OpenEnv hackathon expert-in-the-loop track w/ AWM#428
OpenEnv hackathon expert-in-the-loop track w/ AWM#428sfc-gh-mhidayetoglu wants to merge 18 commits intometa-pytorch:mainfrom
Conversation
- Fix SQLite seed data loading: handle backslash-escaped quotes in AWM dataset SQL statements (_fix_escaped_quotes in db_manager.py) - Add exception handler to dynamically generated FastAPI sub-environments for better 500 error debugging (scenario_manager.py) - Make example_usage.py model/endpoint configurable via env vars - Add run_awm_task.py: multi-task runner with improved system prompt - Add run_awm_task_with_expert.py: two-model architecture where a cheaper agent (o4-mini) is guided by an expert model (gpt-5.1) that analyzes baseline logs to provide task-specific advice - Include benchmark results for workflow_automation_1 (10 tasks): gpt-5.1 baseline: 9/10, o4-mini solo: 8/10, o4-mini+expert: 9/10 Made-with: Cursor
Expert Enhancement
|
Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
Greptile SummaryThis PR adds a complete new environment ( Key findings:
Confidence Score: 1/5
Last reviewed commit: 16de06b |
| namespace = { | ||
| "sqlite3": sqlite3, | ||
| "json": json, | ||
| "os": os, | ||
| "__builtins__": __builtins__, | ||
| } | ||
| exec(verifier_code, namespace) |
There was a problem hiding this comment.
Both execute_sql_verifier (lines 88-94) and execute_code_verifier (lines 158-165) pass "__builtins__": __builtins__ into the exec() namespace. This grants the executed verifier code full access to all Python built-ins including open(), __import__(), subprocess, os.system(), network sockets, etc.
While the DB files are temporarily chmod'd to 0o444, the verifier code is free to write to arbitrary paths, launch subprocesses, exfiltrate data, or import any installed module. Since verifier_code comes from the dataset (gen_verifiers.jsonl), this creates a significant attack surface if the dataset is ever compromised or if malformed entries are present.
At minimum, __builtins__ should be restricted to a safe allowlist (e.g., only open on read-only paths, json, sqlite3, re), or the verifier should be run in a subprocess with OS-level isolation (like the sub-env approach used elsewhere in this PR).
The same pattern appears in both functions.
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/server/verifier.py
Line: 88-94
Comment:
Both `execute_sql_verifier` (lines 88-94) and `execute_code_verifier` (lines 158-165) pass `"__builtins__": __builtins__` into the `exec()` namespace. This grants the executed verifier code full access to all Python built-ins including `open()`, `__import__()`, `subprocess`, `os.system()`, network sockets, etc.
While the DB files are temporarily chmod'd to `0o444`, the verifier code is free to write to arbitrary paths, launch subprocesses, exfiltrate data, or import any installed module. Since `verifier_code` comes from the dataset (`gen_verifiers.jsonl`), this creates a significant attack surface if the dataset is ever compromised or if malformed entries are present.
At minimum, `__builtins__` should be restricted to a safe allowlist (e.g., only `open` on read-only paths, `json`, `sqlite3`, `re`), or the verifier should be run in a subprocess with OS-level isolation (like the sub-env approach used elsewhere in this PR).
The same pattern appears in both functions.
How can I resolve this? If you propose a fix, please make it concise.|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| HIDDEN_TOOLS = frozenset(["done", "verify", "__list_scenarios__"]) |
There was a problem hiding this comment.
HIDDEN_TOOLS = frozenset(["done", "verify", "__list_scenarios__"]) is defined here but never referenced anywhere in this file or elsewhere in the codebase. This constant appears to be intended as a guard — but the step() method dispatches these tool names purely through explicit if/elif chains (lines 312-317) without consulting HIDDEN_TOOLS.
The practical consequence is that the names "done", "verify", and "__list_scenarios__" are invisible to list_tools() (because they aren't in the MCP subprocess's tool list), but an agent that knows or discovers these names can call them via CallToolAction and the server will process them. For example, calling verify mid-episode returns a live reward value to the caller.
If this constant is meant to document which tools are hidden, it should at minimum be wired into an access check in the step() method. If it's dead code, it should be removed to avoid confusion.
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/server/awm_environment.py
Line: 27
Comment:
`HIDDEN_TOOLS = frozenset(["done", "verify", "__list_scenarios__"])` is defined here but never referenced anywhere in this file or elsewhere in the codebase. This constant appears to be intended as a guard — but the `step()` method dispatches these tool names purely through explicit `if/elif` chains (lines 312-317) without consulting `HIDDEN_TOOLS`.
The practical consequence is that the names `"done"`, `"verify"`, and `"__list_scenarios__"` are invisible to `list_tools()` (because they aren't in the MCP subprocess's tool list), but an agent that knows or discovers these names can call them via `CallToolAction` and the server will process them. For example, calling `verify` mid-episode returns a live reward value to the caller.
If this constant is meant to document which tools are hidden, it should at minimum be wired into an access check in the `step()` method. If it's dead code, it should be removed to avoid confusion.
How can I resolve this? If you propose a fix, please make it concise.| def _run_async(coro: Any) -> Any: | ||
| """Run an async coroutine from sync context, handling event loop edge cases.""" | ||
|
|
||
| try: | ||
| loop = asyncio.get_event_loop() | ||
| if loop.is_closed(): | ||
| raise RuntimeError("closed") | ||
| return loop.run_until_complete(coro) | ||
| except RuntimeError: | ||
| loop = asyncio.new_event_loop() | ||
| try: | ||
| return loop.run_until_complete(coro) | ||
| finally: | ||
| loop.close() |
There was a problem hiding this comment.
_run_async calls loop.run_until_complete(coro) (line 71). When AWMEnvironment methods are invoked by the FastAPI/uvicorn server (which runs on its own async event loop), calling run_until_complete() on the already-running loop raises RuntimeError: This event loop is already running. The except RuntimeError branch (line 72) then creates a new event loop, but loop.run_until_complete() on a brand-new loop still won't have access to the running uvicorn event loop's resources (open connections, etc.).
In practice, all _run_async call-sites (list_mcp_tools at line 261/366, call_mcp_tool at line 415, run_llm_judge at line 550) are async coroutines — they should be await-ed rather than bridged through a sync wrapper. Since reset() and step() are currently synchronous, the fix would be to make them async def and propagate await upward, or use asyncio.run_coroutine_threadsafe() with a dedicated background event loop.
This pattern is likely to cause RuntimeError drops in any deployment behind uvicorn.
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/server/awm_environment.py
Line: 64-77
Comment:
`_run_async` calls `loop.run_until_complete(coro)` (line 71). When `AWMEnvironment` methods are invoked by the FastAPI/uvicorn server (which runs on its own async event loop), calling `run_until_complete()` on the already-running loop raises `RuntimeError: This event loop is already running`. The `except RuntimeError` branch (line 72) then creates a new event loop, but `loop.run_until_complete()` on a brand-new loop still won't have access to the running uvicorn event loop's resources (open connections, etc.).
In practice, all `_run_async` call-sites (`list_mcp_tools` at line 261/366, `call_mcp_tool` at line 415, `run_llm_judge` at line 550) are async coroutines — they should be `await`-ed rather than bridged through a sync wrapper. Since `reset()` and `step()` are currently synchronous, the fix would be to make them `async def` and propagate `await` upward, or use `asyncio.run_coroutine_threadsafe()` with a dedicated background event loop.
This pattern is likely to cause `RuntimeError` drops in any deployment behind uvicorn.
How can I resolve this? If you propose a fix, please make it concise.| print(f"# Scenario: {scenario}, Task: {task_idx}") | ||
| print(f"{'#'*80}") | ||
|
|
||
| async with AWMEnv(base_url="http://localhost:8899") as env: |
There was a problem hiding this comment.
The server URL http://localhost:8899 is hardcoded (line 147) and not exposed as a CLI argument or environment variable. Any deployment outside of this exact local configuration will silently connect to the wrong host/port or fail entirely.
The existing argparse block below (lines 254-259) already handles --baseline-log, --agent-model, and --expert-model. A --base-url argument (defaulting to http://localhost:8899) should be added for consistency.
The same hardcoded URL pattern appears in run_awm_task.py (line 89) and should be fixed there too.
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/run_awm_task_with_expert.py
Line: 147
Comment:
The server URL `http://localhost:8899` is hardcoded (line 147) and not exposed as a CLI argument or environment variable. Any deployment outside of this exact local configuration will silently connect to the wrong host/port or fail entirely.
The existing `argparse` block below (lines 254-259) already handles `--baseline-log`, `--agent-model`, and `--expert-model`. A `--base-url` argument (defaulting to `http://localhost:8899`) should be added for consistency.
The same hardcoded URL pattern appears in `run_awm_task.py` (line 89) and should be fixed there too.
How can I resolve this? If you propose a fix, please make it concise.| llm = AsyncAzureOpenAI( | ||
| azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"], | ||
| api_key=os.environ["AZURE_OPENAI_API_KEY"], | ||
| api_version=os.environ.get("OPENAI_API_VERSION", "2025-04-01-preview"), | ||
| ) |
There was a problem hiding this comment.
llm is constructed once (lines 261-265) and passed as both the expert_llm (model: args.expert_model, defaults to "gpt-5.1") and the agent LLM (model: args.agent_model, defaults to "gpt-4.1-mini"). AsyncAzureOpenAI clients use a single azure_endpoint, meaning both models must be deployed on the same Azure OpenAI resource. If the expert model is hosted on a different resource or endpoint than the agent model, all expert calls will silently hit the wrong endpoint or fail with a 404/model-not-found error.
Consider allowing separate --expert-endpoint / --agent-endpoint arguments, or at minimum adding a comment that both models must be on the same Azure resource.
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/run_awm_task_with_expert.py
Line: 261-265
Comment:
`llm` is constructed once (lines 261-265) and passed as both the `expert_llm` (model: `args.expert_model`, defaults to `"gpt-5.1"`) and the agent LLM (model: `args.agent_model`, defaults to `"gpt-4.1-mini"`). `AsyncAzureOpenAI` clients use a single `azure_endpoint`, meaning both models must be deployed on the same Azure OpenAI resource. If the expert model is hosted on a different resource or endpoint than the agent model, all expert calls will silently hit the wrong endpoint or fail with a 404/model-not-found error.
Consider allowing separate `--expert-endpoint` / `--agent-endpoint` arguments, or at minimum adding a comment that both models must be on the same Azure resource.
How can I resolve this? If you propose a fix, please make it concise.| return AWMObservation( | ||
| done=False, | ||
| reward=self._get_reward(reward_type), | ||
| reward_type=reward_type, | ||
| verify_result=verify_result, | ||
| scenario=self._scenario, | ||
| task=self._task, | ||
| task_idx=self._task_idx, | ||
| steps_taken=self._state.step_count, | ||
| ) |
There was a problem hiding this comment.
ALIGNMENT FLAG: verify exposes live reward signal to the agent mid-episode.
-
Principle at stake: "Rewards inside environment" (PRINCIPLES.md) — domain reward knowledge is encapsulated inside the environment, not leaked to the agent during training.
-
The concern:
_handle_verify()(lines 470-589) returns anAWMObservationwithreward=self._get_reward(reward_type)populated at line 582. Becauseverifyis dispatched by the samestep()path used for regular tool calls, any agent that learns or guesses the tool name"verify"can call it mid-episode and receive a graded reward signal. Combined with the fact thatHIDDEN_TOOLSis not enforced (see related finding), this creates a potential reward-hacking vector: an agent could callverifyiteratively to measure progress and hill-climb its own score within a single episode, bypassing the intended one-shot evaluation design.
The simplest fix is for _handle_verify() to not populate reward in the returned observation (returning reward=None) and have only the orchestration layer (e.g., run_awm_task_with_expert.py) inspect reward_type to compute the training signal externally. Alternatively, the verify call should only be allowed post-done.
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/server/awm_environment.py
Line: 580-589
Comment:
**ALIGNMENT FLAG**: `verify` exposes live reward signal to the agent mid-episode.
- **Principle at stake**: "Rewards inside environment" (PRINCIPLES.md) — domain reward knowledge is encapsulated inside the environment, not leaked to the agent during training.
- **The concern**: `_handle_verify()` (lines 470-589) returns an `AWMObservation` with `reward=self._get_reward(reward_type)` populated at line 582. Because `verify` is dispatched by the same `step()` path used for regular tool calls, any agent that learns or guesses the tool name `"verify"` can call it mid-episode and receive a graded reward signal. Combined with the fact that `HIDDEN_TOOLS` is not enforced (see related finding), this creates a potential reward-hacking vector: an agent could call `verify` iteratively to measure progress and hill-climb its own score within a single episode, bypassing the intended one-shot evaluation design.
The simplest fix is for `_handle_verify()` to not populate `reward` in the returned observation (returning `reward=None`) and have only the orchestration layer (e.g., `run_awm_task_with_expert.py`) inspect `reward_type` to compute the training signal externally. Alternatively, the `verify` call should only be allowed post-`done`.
How can I resolve this? If you propose a fix, please make it concise.| class ExpertAdvisor: | ||
| def __init__(self, baseline_log_path: str, llm: AsyncAzureOpenAI, model: str): | ||
| with open(baseline_log_path, "r", encoding="utf-8") as f: | ||
| self._baseline_log = f.read() | ||
| self._llm = llm | ||
| self._model = model | ||
| self._system_prompt = EXPERT_SYSTEM_PROMPT.format( | ||
| baseline_log=self._baseline_log | ||
| ) | ||
|
|
||
| async def advise(self, scenario: str, task_idx: int, task_description: str) -> str: | ||
| messages = [ | ||
| {"role": "system", "content": self._system_prompt}, | ||
| { | ||
| "role": "user", | ||
| "content": ( | ||
| f"I am about to run Scenario: {scenario}, Task: {task_idx}.\n" | ||
| f"Task description: {task_description}\n\n" | ||
| "What happened in the baseline run for this task? " | ||
| "Give me concise, actionable advice." | ||
| ), | ||
| }, | ||
| ] | ||
| response = await self._llm.chat.completions.create( | ||
| model=self._model, | ||
| messages=messages, | ||
| temperature=0.3, | ||
| max_completion_tokens=512, | ||
| ) | ||
| return response.choices[0].message.content or "" |
There was a problem hiding this comment.
ALIGNMENT FLAG: Expert-in-the-loop pattern introduces a new architectural paradigm without an RFC.
-
Principle at stake: "Key Decisions Made" (PRINCIPLES.md) — architectural decisions should be documented in RFCs and not changed without RFC discussion.
-
The concern: The
ExpertAdvisorclass (lines 104-133) andrun_task()integration (lines 136-250) add a second LLM (the "expert") that consults a baseline run log and injects advice into the agent's context prior to each task. This is a fundamentally new training paradigm (expert-guided imitation / in-context learning from prior runs) not covered by existing RFCs. It raises questions about: (1) whether this pattern should be part of the environment spec, the client, or a separate orchestration layer; (2) how the baseline log file should be distributed/versioned; (3) whether the expert's advice counts as part of the agent's "observation" for purposes of the Gymnasium API; (4) how reward signals should be attributed when both expert and agent are in the loop.
The PR description checklist is also entirely unchecked (principles, invariants, pre-submit). Even for a hackathon submission, flagging this as needing RFC discussion before merging to main is advisable.
Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/run_awm_task_with_expert.py
Line: 104-133
Comment:
**ALIGNMENT FLAG**: Expert-in-the-loop pattern introduces a new architectural paradigm without an RFC.
- **Principle at stake**: "Key Decisions Made" (PRINCIPLES.md) — architectural decisions should be documented in RFCs and not changed without RFC discussion.
- **The concern**: The `ExpertAdvisor` class (lines 104-133) and `run_task()` integration (lines 136-250) add a second LLM (the "expert") that consults a baseline run log and injects advice into the agent's context prior to each task. This is a fundamentally new training paradigm (expert-guided imitation / in-context learning from prior runs) not covered by existing RFCs. It raises questions about: (1) whether this pattern should be part of the environment spec, the client, or a separate orchestration layer; (2) how the baseline log file should be distributed/versioned; (3) whether the expert's advice counts as part of the agent's "observation" for purposes of the Gymnasium API; (4) how reward signals should be attributed when both expert and agent are in the loop.
The PR description checklist is also entirely unchecked (principles, invariants, pre-submit). Even for a hackathon submission, flagging this as needing RFC discussion before merging to `main` is advisable.
How can I resolve this? If you propose a fix, please make it concise.Documents the two-model expert-advised architecture with mermaid diagrams, benchmark comparison table (gpt-5.1 baseline 9/10, o4-mini solo 8/10, o4-mini+expert 9/10), bug fixes applied, and key learnings from evaluating workflow_automation_1. Made-with: Cursor
The existing static expert (run_awm_task_with_expert.py) gives one-shot advice from a baseline log before the agent starts. This adds a complementary dynamic approach where the expert is exposed as a callable tool (ask_expert) that the agent invokes on-demand during the task. Key features of the dynamic expert: - Verifier-informed: pre-analyzes Python verification code to extract exact DB state requirements the task must satisfy - Schema-aware: provides full MCP tool schemas to the expert so it can generate precise tool names and argument values - Error recovery: system nudges agent to consult expert after errors/stalls - Agent-initiated: agent decides when to call expert (0 to N times per task) - No baseline log required: works from first run Benchmark on workflow_automation_1 (10 tasks): Baseline (no expert): 5/10 complete, avg reward 0.500 Dynamic expert: 8/10 complete, avg reward 0.800 (+60%) Also updates EXPERT_ENHANCEMENT.md with architecture comparison between the static and dynamic approaches, including a side-by-side feature table. Made-with: Cursor
…loop Add dynamic expert-in-the-loop runner for AWM environment
- Remove run_awm_task_with_expert.py (static expert that required baseline logs) - Remove results/ directory with static expert output logs and JSON results - Rewrite EXPERT_ENHANCEMENT.md to focus solely on the dynamic expert-in-the-loop approach: verifier-informed, schema-aware, on-demand consultation The dynamic expert is the better general-purpose approach — it works from the first run without prior data, recovers from errors in real-time, and uses deeper intelligence (verifier code analysis + full tool schemas). Made-with: Cursor
The docstring still referenced run_awm_task_with_expert.py and "differences from the static approach" even though the static expert was removed. Updated to describe the dynamic expert on its own terms. Made-with: Cursor
…loop Dynamic expert in the loop
…EXPERT_ENHANCEMENT.md - Add 6 matplotlib-generated charts: reward curves, completion comparison, format error divergence, adaptive ratio, training wheels analogy, and expert behavior evolution - Document 4 experiments: baseline, expert-assisted, mixed 50/50, and adaptive ratio training with detailed step-by-step metrics - Add engaging narrative with "training wheels" analogy for adaptive expert scaffolding - Add key findings section with 5 evidence-backed insights - Add 6 future research directions for teaching agents to ask for help - Preserve all original content (architecture, bug fixes, learnings, usage) Made-with: Cursor
- Replace placeholder "step 2 as of writing" with full 10-step results table - Add head-to-head comparison table (Adaptive vs Mixed vs Baseline) - Adaptive 6E/2S is the only approach to cross zero reward (step 9: +0.068) - Remove AgentFly/Snowflake attribution line Made-with: Cursor
- reward_curves: 4-way comparison with adaptive as the hero line (gold) - completion_comparison: 3-way grouped bar chart (steps 1-10) - format_error_divergence: adaptive FE trajectory with phase transition threshold - adaptive_ratio: dual panel — phase diagram with training path + FE over time - training_wheels_analogy: actual metrics per phase, current phase highlighted - expert_behavior_evolution: cumulative completions race + milestone scoreboard Made-with: Cursor
…ion rate - Added reward shaping: solo bonus (+1.0), recovery bonus (+0.3), blind penalty (-0.2) - "Try first, ask if stuck" system prompt - 16-step results showing positive reward from step 11 onward - Adaptive phase transition from Scaffold to Balanced at step 10 - Updated all 6 charts with latest training data Made-with: Cursor
… completion, all 3 phase transitions - Extended Experiment 4 table from 16 to 29 steps showing full Scaffold->Balanced->Independence journey - Updated hero stat, key findings, benchmark table, training wheels analogy, expert behavior analysis - Regenerated all 6 charts with 29-step data (reward curves, completions, FE divergence, phase diagram, training wheels, expert behavior) - Format errors: 88% -> 31%, completions: 7 -> 57, reward: -0.528 -> +0.824 peak Made-with: Cursor
- Updated Mermaid sequence diagram: agent now tries call_tool first, only calls expert on error/stuck - Updated "How It Works" steps: added "Try first" and "Ask if stuck" as distinct steps - Old flow showed agent calling expert immediately; new flow matches actual training strategy Made-with: Cursor
…4 new diagrams - Rewritten as slide-style sections with visual-first layout - Added AWM environment overview: 1,005 environments, domain categories, architecture - Added task walkthrough diagram showing step-by-step agent interaction - Added reward hierarchy visualization explaining graduated reward shaping - Added final scoreboard comparing all 4 strategies - Kept all training data (29 steps, all phase transitions, behavior analysis) - Reduced wall-of-text sections, added tables and diagrams throughout Made-with: Cursor
- train_grpo_awm.py: standalone TRL + OpenEnv GRPO training script - Uses rollout_func pattern from TRL OpenEnv integration - Supports colocate and server vLLM modes - Optional GPT expert tool with reward shaping - Environment reward + format bonus reward functions - train_grpo_awm.ipynb: Colab notebook with step-by-step walkthrough - Setup, environment server, data prep, training, evaluation - Works on single T4/A100 GPU with Qwen3-0.6B - Links to full results in EXPERT_ENHANCEMENT.md - Updated README with Quick Start section referencing new files Made-with: Cursor
Summary
Type of Change
Alignment Checklist
Before submitting, verify:
.claude/docs/PRINCIPLES.mdand this PR aligns with our principles.claude/docs/INVARIANTS.mdand no invariants are violated/pre-submit-pr(orbash .claude/hooks/lint.shand tests) and addressed all issuesRFC Status
Test Plan
Claude Code Review