Skip to content

OpenEnv hackathon expert-in-the-loop track w/ AWM#428

Open
sfc-gh-mhidayetoglu wants to merge 18 commits intometa-pytorch:mainfrom
sfc-gh-mhidayetoglu:add-agent-world-model
Open

OpenEnv hackathon expert-in-the-loop track w/ AWM#428
sfc-gh-mhidayetoglu wants to merge 18 commits intometa-pytorch:mainfrom
sfc-gh-mhidayetoglu:add-agent-world-model

Conversation

@sfc-gh-mhidayetoglu
Copy link

Summary

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation
  • New environment
  • Refactoring

Alignment Checklist

Before submitting, verify:

  • I have read .claude/docs/PRINCIPLES.md and this PR aligns with our principles
  • I have checked .claude/docs/INVARIANTS.md and no invariants are violated
  • I have run /pre-submit-pr (or bash .claude/hooks/lint.sh and tests) and addressed all issues

RFC Status

  • Not required (bug fix, docs, minor refactoring)
  • RFC exists: #___
  • RFC needed (will create before merge)

Test Plan

Claude Code Review

Raibows and others added 4 commits March 7, 2026 10:00
- Fix SQLite seed data loading: handle backslash-escaped quotes in AWM
  dataset SQL statements (_fix_escaped_quotes in db_manager.py)
- Add exception handler to dynamically generated FastAPI sub-environments
  for better 500 error debugging (scenario_manager.py)
- Make example_usage.py model/endpoint configurable via env vars
- Add run_awm_task.py: multi-task runner with improved system prompt
- Add run_awm_task_with_expert.py: two-model architecture where a cheaper
  agent (o4-mini) is guided by an expert model (gpt-5.1) that analyzes
  baseline logs to provide task-specific advice
- Include benchmark results for workflow_automation_1 (10 tasks):
  gpt-5.1 baseline: 9/10, o4-mini solo: 8/10, o4-mini+expert: 9/10

Made-with: Cursor
@meta-cla
Copy link

meta-cla bot commented Mar 8, 2026

Hi @sfc-gh-mhidayetoglu!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 8, 2026

Greptile Summary

This PR adds a complete new environment (agent_world_model_env) that wraps 1,000 Agent World Model sub-environments into a single OpenEnv-compatible environment, along with an expert-in-the-loop orchestration script. The core server architecture (subprocess-per-session, SQLite snapshot for verification, WebSocket-only transport) is well-designed and aligns with OpenEnv's stateful environment pattern.

Key findings:

  • Security (Tier 1): verifier.py calls exec(verifier_code, namespace) with unrestricted __builtins__, granting dataset-sourced verifier code full access to Python builtins (file I/O, __import__, subprocesses, network). This appears in both execute_sql_verifier and execute_code_verifier.
  • Logic (Tier 1): HIDDEN_TOOLS is defined but never enforced — there is no access check preventing an agent from calling done, verify, or __list_scenarios__ by name via CallToolAction. The verify tool returns a live reward value to the caller mid-episode, creating a potential reward-hacking vector.
  • Runtime (Tier 1): _run_async uses loop.run_until_complete(), which will raise RuntimeError: This event loop is already running when called from within uvicorn's async event loop (the normal deployment path).
  • Configuration (Tier 1): http://localhost:8899 is hardcoded in both run_awm_task_with_expert.py and run_awm_task.py, and both agent and expert LLM share a single AsyncAzureOpenAI client, which will break if the two models are deployed on different Azure resources.
  • Alignment (Tier 2): The expert-in-the-loop pattern introduces a new architectural paradigm not covered by existing RFCs. The PR description checklist is entirely unchecked, and the interaction between expert advice and agent reward signals needs clarification.

Confidence Score: 1/5

  • Not safe to merge — critical security vulnerability in verifier exec, broken async pattern that will fail under uvicorn, unreachable reward enforcement, and unreviewed architectural paradigm.
  • Three independent critical issues must be addressed: (1) unrestricted __builtins__ in exec() allows dataset-sourced code unrestricted Python access (security); (2) _run_async will raise RuntimeError when called from uvicorn's async context, causing deployment failures (runtime); (3) HIDDEN_TOOLS constant is defined but never enforced, and verify exposes reward signals mid-episode, violating the "rewards inside environment" principle (alignment). Additionally, the expert-in-the-loop architecture lacks RFC coverage. These are independent enough that all must be resolved before merging.
  • envs/agent_world_model_env/server/verifier.py (exec security), envs/agent_world_model_env/server/awm_environment.py (HIDDEN_TOOLS enforcement + async bridge + reward exposure), envs/agent_world_model_env/run_awm_task_with_expert.py (hardcoded URL + shared Azure client + RFC gap).

Last reviewed commit: 16de06b

Comment on lines +88 to +94
namespace = {
"sqlite3": sqlite3,
"json": json,
"os": os,
"__builtins__": __builtins__,
}
exec(verifier_code, namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both execute_sql_verifier (lines 88-94) and execute_code_verifier (lines 158-165) pass "__builtins__": __builtins__ into the exec() namespace. This grants the executed verifier code full access to all Python built-ins including open(), __import__(), subprocess, os.system(), network sockets, etc.

While the DB files are temporarily chmod'd to 0o444, the verifier code is free to write to arbitrary paths, launch subprocesses, exfiltrate data, or import any installed module. Since verifier_code comes from the dataset (gen_verifiers.jsonl), this creates a significant attack surface if the dataset is ever compromised or if malformed entries are present.

At minimum, __builtins__ should be restricted to a safe allowlist (e.g., only open on read-only paths, json, sqlite3, re), or the verifier should be run in a subprocess with OS-level isolation (like the sub-env approach used elsewhere in this PR).

The same pattern appears in both functions.

Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/server/verifier.py
Line: 88-94

Comment:
Both `execute_sql_verifier` (lines 88-94) and `execute_code_verifier` (lines 158-165) pass `"__builtins__": __builtins__` into the `exec()` namespace. This grants the executed verifier code full access to all Python built-ins including `open()`, `__import__()`, `subprocess`, `os.system()`, network sockets, etc.

While the DB files are temporarily chmod'd to `0o444`, the verifier code is free to write to arbitrary paths, launch subprocesses, exfiltrate data, or import any installed module. Since `verifier_code` comes from the dataset (`gen_verifiers.jsonl`), this creates a significant attack surface if the dataset is ever compromised or if malformed entries are present.

At minimum, `__builtins__` should be restricted to a safe allowlist (e.g., only `open` on read-only paths, `json`, `sqlite3`, `re`), or the verifier should be run in a subprocess with OS-level isolation (like the sub-env approach used elsewhere in this PR).

The same pattern appears in both functions.

How can I resolve this? If you propose a fix, please make it concise.


logger = logging.getLogger(__name__)

HIDDEN_TOOLS = frozenset(["done", "verify", "__list_scenarios__"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HIDDEN_TOOLS = frozenset(["done", "verify", "__list_scenarios__"]) is defined here but never referenced anywhere in this file or elsewhere in the codebase. This constant appears to be intended as a guard — but the step() method dispatches these tool names purely through explicit if/elif chains (lines 312-317) without consulting HIDDEN_TOOLS.

The practical consequence is that the names "done", "verify", and "__list_scenarios__" are invisible to list_tools() (because they aren't in the MCP subprocess's tool list), but an agent that knows or discovers these names can call them via CallToolAction and the server will process them. For example, calling verify mid-episode returns a live reward value to the caller.

If this constant is meant to document which tools are hidden, it should at minimum be wired into an access check in the step() method. If it's dead code, it should be removed to avoid confusion.

Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/server/awm_environment.py
Line: 27

Comment:
`HIDDEN_TOOLS = frozenset(["done", "verify", "__list_scenarios__"])` is defined here but never referenced anywhere in this file or elsewhere in the codebase. This constant appears to be intended as a guard — but the `step()` method dispatches these tool names purely through explicit `if/elif` chains (lines 312-317) without consulting `HIDDEN_TOOLS`.

The practical consequence is that the names `"done"`, `"verify"`, and `"__list_scenarios__"` are invisible to `list_tools()` (because they aren't in the MCP subprocess's tool list), but an agent that knows or discovers these names can call them via `CallToolAction` and the server will process them. For example, calling `verify` mid-episode returns a live reward value to the caller.

If this constant is meant to document which tools are hidden, it should at minimum be wired into an access check in the `step()` method. If it's dead code, it should be removed to avoid confusion.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +64 to +77
def _run_async(coro: Any) -> Any:
"""Run an async coroutine from sync context, handling event loop edge cases."""

try:
loop = asyncio.get_event_loop()
if loop.is_closed():
raise RuntimeError("closed")
return loop.run_until_complete(coro)
except RuntimeError:
loop = asyncio.new_event_loop()
try:
return loop.run_until_complete(coro)
finally:
loop.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_run_async calls loop.run_until_complete(coro) (line 71). When AWMEnvironment methods are invoked by the FastAPI/uvicorn server (which runs on its own async event loop), calling run_until_complete() on the already-running loop raises RuntimeError: This event loop is already running. The except RuntimeError branch (line 72) then creates a new event loop, but loop.run_until_complete() on a brand-new loop still won't have access to the running uvicorn event loop's resources (open connections, etc.).

In practice, all _run_async call-sites (list_mcp_tools at line 261/366, call_mcp_tool at line 415, run_llm_judge at line 550) are async coroutines — they should be await-ed rather than bridged through a sync wrapper. Since reset() and step() are currently synchronous, the fix would be to make them async def and propagate await upward, or use asyncio.run_coroutine_threadsafe() with a dedicated background event loop.

This pattern is likely to cause RuntimeError drops in any deployment behind uvicorn.

Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/server/awm_environment.py
Line: 64-77

Comment:
`_run_async` calls `loop.run_until_complete(coro)` (line 71). When `AWMEnvironment` methods are invoked by the FastAPI/uvicorn server (which runs on its own async event loop), calling `run_until_complete()` on the already-running loop raises `RuntimeError: This event loop is already running`. The `except RuntimeError` branch (line 72) then creates a new event loop, but `loop.run_until_complete()` on a brand-new loop still won't have access to the running uvicorn event loop's resources (open connections, etc.).

In practice, all `_run_async` call-sites (`list_mcp_tools` at line 261/366, `call_mcp_tool` at line 415, `run_llm_judge` at line 550) are async coroutines — they should be `await`-ed rather than bridged through a sync wrapper. Since `reset()` and `step()` are currently synchronous, the fix would be to make them `async def` and propagate `await` upward, or use `asyncio.run_coroutine_threadsafe()` with a dedicated background event loop.

This pattern is likely to cause `RuntimeError` drops in any deployment behind uvicorn.

How can I resolve this? If you propose a fix, please make it concise.

print(f"# Scenario: {scenario}, Task: {task_idx}")
print(f"{'#'*80}")

async with AWMEnv(base_url="http://localhost:8899") as env:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server URL http://localhost:8899 is hardcoded (line 147) and not exposed as a CLI argument or environment variable. Any deployment outside of this exact local configuration will silently connect to the wrong host/port or fail entirely.

The existing argparse block below (lines 254-259) already handles --baseline-log, --agent-model, and --expert-model. A --base-url argument (defaulting to http://localhost:8899) should be added for consistency.

The same hardcoded URL pattern appears in run_awm_task.py (line 89) and should be fixed there too.

Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/run_awm_task_with_expert.py
Line: 147

Comment:
The server URL `http://localhost:8899` is hardcoded (line 147) and not exposed as a CLI argument or environment variable. Any deployment outside of this exact local configuration will silently connect to the wrong host/port or fail entirely.

The existing `argparse` block below (lines 254-259) already handles `--baseline-log`, `--agent-model`, and `--expert-model`. A `--base-url` argument (defaulting to `http://localhost:8899`) should be added for consistency.

The same hardcoded URL pattern appears in `run_awm_task.py` (line 89) and should be fixed there too.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +261 to +265
llm = AsyncAzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_API_KEY"],
api_version=os.environ.get("OPENAI_API_VERSION", "2025-04-01-preview"),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llm is constructed once (lines 261-265) and passed as both the expert_llm (model: args.expert_model, defaults to "gpt-5.1") and the agent LLM (model: args.agent_model, defaults to "gpt-4.1-mini"). AsyncAzureOpenAI clients use a single azure_endpoint, meaning both models must be deployed on the same Azure OpenAI resource. If the expert model is hosted on a different resource or endpoint than the agent model, all expert calls will silently hit the wrong endpoint or fail with a 404/model-not-found error.

Consider allowing separate --expert-endpoint / --agent-endpoint arguments, or at minimum adding a comment that both models must be on the same Azure resource.

Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/run_awm_task_with_expert.py
Line: 261-265

Comment:
`llm` is constructed once (lines 261-265) and passed as both the `expert_llm` (model: `args.expert_model`, defaults to `"gpt-5.1"`) and the agent LLM (model: `args.agent_model`, defaults to `"gpt-4.1-mini"`). `AsyncAzureOpenAI` clients use a single `azure_endpoint`, meaning both models must be deployed on the same Azure OpenAI resource. If the expert model is hosted on a different resource or endpoint than the agent model, all expert calls will silently hit the wrong endpoint or fail with a 404/model-not-found error.

Consider allowing separate `--expert-endpoint` / `--agent-endpoint` arguments, or at minimum adding a comment that both models must be on the same Azure resource.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +580 to +589
return AWMObservation(
done=False,
reward=self._get_reward(reward_type),
reward_type=reward_type,
verify_result=verify_result,
scenario=self._scenario,
task=self._task,
task_idx=self._task_idx,
steps_taken=self._state.step_count,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALIGNMENT FLAG: verify exposes live reward signal to the agent mid-episode.

  • Principle at stake: "Rewards inside environment" (PRINCIPLES.md) — domain reward knowledge is encapsulated inside the environment, not leaked to the agent during training.

  • The concern: _handle_verify() (lines 470-589) returns an AWMObservation with reward=self._get_reward(reward_type) populated at line 582. Because verify is dispatched by the same step() path used for regular tool calls, any agent that learns or guesses the tool name "verify" can call it mid-episode and receive a graded reward signal. Combined with the fact that HIDDEN_TOOLS is not enforced (see related finding), this creates a potential reward-hacking vector: an agent could call verify iteratively to measure progress and hill-climb its own score within a single episode, bypassing the intended one-shot evaluation design.

The simplest fix is for _handle_verify() to not populate reward in the returned observation (returning reward=None) and have only the orchestration layer (e.g., run_awm_task_with_expert.py) inspect reward_type to compute the training signal externally. Alternatively, the verify call should only be allowed post-done.

Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/server/awm_environment.py
Line: 580-589

Comment:
**ALIGNMENT FLAG**: `verify` exposes live reward signal to the agent mid-episode.

- **Principle at stake**: "Rewards inside environment" (PRINCIPLES.md) — domain reward knowledge is encapsulated inside the environment, not leaked to the agent during training.

- **The concern**: `_handle_verify()` (lines 470-589) returns an `AWMObservation` with `reward=self._get_reward(reward_type)` populated at line 582. Because `verify` is dispatched by the same `step()` path used for regular tool calls, any agent that learns or guesses the tool name `"verify"` can call it mid-episode and receive a graded reward signal. Combined with the fact that `HIDDEN_TOOLS` is not enforced (see related finding), this creates a potential reward-hacking vector: an agent could call `verify` iteratively to measure progress and hill-climb its own score within a single episode, bypassing the intended one-shot evaluation design.

The simplest fix is for `_handle_verify()` to not populate `reward` in the returned observation (returning `reward=None`) and have only the orchestration layer (e.g., `run_awm_task_with_expert.py`) inspect `reward_type` to compute the training signal externally. Alternatively, the `verify` call should only be allowed post-`done`.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +104 to +133
class ExpertAdvisor:
def __init__(self, baseline_log_path: str, llm: AsyncAzureOpenAI, model: str):
with open(baseline_log_path, "r", encoding="utf-8") as f:
self._baseline_log = f.read()
self._llm = llm
self._model = model
self._system_prompt = EXPERT_SYSTEM_PROMPT.format(
baseline_log=self._baseline_log
)

async def advise(self, scenario: str, task_idx: int, task_description: str) -> str:
messages = [
{"role": "system", "content": self._system_prompt},
{
"role": "user",
"content": (
f"I am about to run Scenario: {scenario}, Task: {task_idx}.\n"
f"Task description: {task_description}\n\n"
"What happened in the baseline run for this task? "
"Give me concise, actionable advice."
),
},
]
response = await self._llm.chat.completions.create(
model=self._model,
messages=messages,
temperature=0.3,
max_completion_tokens=512,
)
return response.choices[0].message.content or ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALIGNMENT FLAG: Expert-in-the-loop pattern introduces a new architectural paradigm without an RFC.

  • Principle at stake: "Key Decisions Made" (PRINCIPLES.md) — architectural decisions should be documented in RFCs and not changed without RFC discussion.

  • The concern: The ExpertAdvisor class (lines 104-133) and run_task() integration (lines 136-250) add a second LLM (the "expert") that consults a baseline run log and injects advice into the agent's context prior to each task. This is a fundamentally new training paradigm (expert-guided imitation / in-context learning from prior runs) not covered by existing RFCs. It raises questions about: (1) whether this pattern should be part of the environment spec, the client, or a separate orchestration layer; (2) how the baseline log file should be distributed/versioned; (3) whether the expert's advice counts as part of the agent's "observation" for purposes of the Gymnasium API; (4) how reward signals should be attributed when both expert and agent are in the loop.

The PR description checklist is also entirely unchecked (principles, invariants, pre-submit). Even for a hackathon submission, flagging this as needing RFC discussion before merging to main is advisable.

Prompt To Fix With AI
This is a comment left during a code review.
Path: envs/agent_world_model_env/run_awm_task_with_expert.py
Line: 104-133

Comment:
**ALIGNMENT FLAG**: Expert-in-the-loop pattern introduces a new architectural paradigm without an RFC.

- **Principle at stake**: "Key Decisions Made" (PRINCIPLES.md) — architectural decisions should be documented in RFCs and not changed without RFC discussion.

- **The concern**: The `ExpertAdvisor` class (lines 104-133) and `run_task()` integration (lines 136-250) add a second LLM (the "expert") that consults a baseline run log and injects advice into the agent's context prior to each task. This is a fundamentally new training paradigm (expert-guided imitation / in-context learning from prior runs) not covered by existing RFCs. It raises questions about: (1) whether this pattern should be part of the environment spec, the client, or a separate orchestration layer; (2) how the baseline log file should be distributed/versioned; (3) whether the expert's advice counts as part of the agent's "observation" for purposes of the Gymnasium API; (4) how reward signals should be attributed when both expert and agent are in the loop.

The PR description checklist is also entirely unchecked (principles, invariants, pre-submit). Even for a hackathon submission, flagging this as needing RFC discussion before merging to `main` is advisable.

How can I resolve this? If you propose a fix, please make it concise.

sfc-gh-mhidayetoglu and others added 14 commits March 7, 2026 21:56
Documents the two-model expert-advised architecture with mermaid diagrams,
benchmark comparison table (gpt-5.1 baseline 9/10, o4-mini solo 8/10,
o4-mini+expert 9/10), bug fixes applied, and key learnings from evaluating
workflow_automation_1.

Made-with: Cursor
The existing static expert (run_awm_task_with_expert.py) gives one-shot advice
from a baseline log before the agent starts. This adds a complementary
dynamic approach where the expert is exposed as a callable tool (ask_expert)
that the agent invokes on-demand during the task.

Key features of the dynamic expert:
- Verifier-informed: pre-analyzes Python verification code to extract exact
  DB state requirements the task must satisfy
- Schema-aware: provides full MCP tool schemas to the expert so it can
  generate precise tool names and argument values
- Error recovery: system nudges agent to consult expert after errors/stalls
- Agent-initiated: agent decides when to call expert (0 to N times per task)
- No baseline log required: works from first run

Benchmark on workflow_automation_1 (10 tasks):
  Baseline (no expert): 5/10 complete, avg reward 0.500
  Dynamic expert:       8/10 complete, avg reward 0.800 (+60%)

Also updates EXPERT_ENHANCEMENT.md with architecture comparison between the
static and dynamic approaches, including a side-by-side feature table.

Made-with: Cursor
…loop

Add dynamic expert-in-the-loop runner for AWM environment
- Remove run_awm_task_with_expert.py (static expert that required baseline logs)
- Remove results/ directory with static expert output logs and JSON results
- Rewrite EXPERT_ENHANCEMENT.md to focus solely on the dynamic expert-in-the-loop
  approach: verifier-informed, schema-aware, on-demand consultation

The dynamic expert is the better general-purpose approach — it works from the
first run without prior data, recovers from errors in real-time, and uses
deeper intelligence (verifier code analysis + full tool schemas).

Made-with: Cursor
The docstring still referenced run_awm_task_with_expert.py and "differences
from the static approach" even though the static expert was removed. Updated
to describe the dynamic expert on its own terms.

Made-with: Cursor
…EXPERT_ENHANCEMENT.md

- Add 6 matplotlib-generated charts: reward curves, completion comparison,
  format error divergence, adaptive ratio, training wheels analogy, and
  expert behavior evolution
- Document 4 experiments: baseline, expert-assisted, mixed 50/50, and
  adaptive ratio training with detailed step-by-step metrics
- Add engaging narrative with "training wheels" analogy for adaptive
  expert scaffolding
- Add key findings section with 5 evidence-backed insights
- Add 6 future research directions for teaching agents to ask for help
- Preserve all original content (architecture, bug fixes, learnings, usage)

Made-with: Cursor
- Replace placeholder "step 2 as of writing" with full 10-step results table
- Add head-to-head comparison table (Adaptive vs Mixed vs Baseline)
- Adaptive 6E/2S is the only approach to cross zero reward (step 9: +0.068)
- Remove AgentFly/Snowflake attribution line

Made-with: Cursor
- reward_curves: 4-way comparison with adaptive as the hero line (gold)
- completion_comparison: 3-way grouped bar chart (steps 1-10)
- format_error_divergence: adaptive FE trajectory with phase transition threshold
- adaptive_ratio: dual panel — phase diagram with training path + FE over time
- training_wheels_analogy: actual metrics per phase, current phase highlighted
- expert_behavior_evolution: cumulative completions race + milestone scoreboard

Made-with: Cursor
…ion rate

- Added reward shaping: solo bonus (+1.0), recovery bonus (+0.3), blind penalty (-0.2)
- "Try first, ask if stuck" system prompt
- 16-step results showing positive reward from step 11 onward
- Adaptive phase transition from Scaffold to Balanced at step 10
- Updated all 6 charts with latest training data

Made-with: Cursor
… completion, all 3 phase transitions

- Extended Experiment 4 table from 16 to 29 steps showing full Scaffold->Balanced->Independence journey
- Updated hero stat, key findings, benchmark table, training wheels analogy, expert behavior analysis
- Regenerated all 6 charts with 29-step data (reward curves, completions, FE divergence, phase diagram, training wheels, expert behavior)
- Format errors: 88% -> 31%, completions: 7 -> 57, reward: -0.528 -> +0.824 peak

Made-with: Cursor
- Updated Mermaid sequence diagram: agent now tries call_tool first, only calls expert on error/stuck
- Updated "How It Works" steps: added "Try first" and "Ask if stuck" as distinct steps
- Old flow showed agent calling expert immediately; new flow matches actual training strategy

Made-with: Cursor
…4 new diagrams

- Rewritten as slide-style sections with visual-first layout
- Added AWM environment overview: 1,005 environments, domain categories, architecture
- Added task walkthrough diagram showing step-by-step agent interaction
- Added reward hierarchy visualization explaining graduated reward shaping
- Added final scoreboard comparing all 4 strategies
- Kept all training data (29 steps, all phase transitions, behavior analysis)
- Reduced wall-of-text sections, added tables and diagrams throughout

Made-with: Cursor
- train_grpo_awm.py: standalone TRL + OpenEnv GRPO training script
  - Uses rollout_func pattern from TRL OpenEnv integration
  - Supports colocate and server vLLM modes
  - Optional GPT expert tool with reward shaping
  - Environment reward + format bonus reward functions
- train_grpo_awm.ipynb: Colab notebook with step-by-step walkthrough
  - Setup, environment server, data prep, training, evaluation
  - Works on single T4/A100 GPU with Qwen3-0.6B
  - Links to full results in EXPERT_ENHANCEMENT.md
- Updated README with Quick Start section referencing new files

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants