-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
The current MCP server (agent_orchestrator/alas_mcp_server.py) has three architectural flaws that block reliable LLM-driven operation:
1. Blocking tool calls lock the agent
alas_login_ensure_main calls LoginHandler.handle_app_login(), which is a while 1: loop that can spin for up to 300 seconds before timing out (LOGIN_MAX_TOTAL_SECONDS = 300). When called via MCP over stdio, the entire JSON-RPC transport is blocked. The agent cannot take screenshots, check health, or do anything else — it's stuck waiting for a Python function that may never return.
This is the root cause of the agent getting "stuck" when calling login.
2. No ADB health probe
When ADB dies or the emulator isn't rendering, adb_screenshot either hangs or returns a black frame. There's no way to ask "is the connection alive?" before committing to a call. The agent can't distinguish between:
- Game loading screen (wait)
- ADB transport broken (reconnect)
- Emulator process dead (restart)
3. No CLI mode
alas_mcp_server.py only runs as mcp.run(transport="stdio"). There's no way to run python alas_mcp_server.py screenshot from a terminal. When something goes wrong, you can't poke at it without an MCP client.
4. Tool proliferation anti-pattern
alas_login_ensure_main is a dedicated MCP tool wrapping one ALAS workflow. Following this pattern would require alas_commission_run, alas_dorm_collect, etc. — dozens of tools that are all thin wrappers.
alas_call_tool(name) already exists and can invoke any registered tool. Dedicated workflow wrappers should not exist as separate MCP tools.
Proposed Design
Tool surface (all non-blocking or bounded)
| Tool | What it does | Max time |
|---|---|---|
adb_health |
NEW - Check ADB transport, emulator process, return structured status | < 2s |
adb_screenshot |
Take screenshot (existing) | < 3s |
adb_tap |
Tap coordinate (existing) | < 1s |
adb_swipe |
Swipe (existing) | < 1s |
alas_state |
Current page name (rename from alas_get_current_state) |
< 3s |
alas_goto |
Navigate with timeout (existing, add bound) | < 30s |
alas_tools |
List available tools (rename from alas_list_tools) |
< 1s |
alas_run |
Run a named tool with timeout (rename from alas_call_tool) |
< 60s |
Remove
alas_login_ensure_main— login becomes the agent calling screenshot/tap/state in a loop, not a single blocking function
Add: adb_health tool
@mcp.tool()
def adb_health() -> Dict[str, Any]:
"""Check ADB and emulator connectivity.
Returns structured status:
{
"adb_connected": bool,
"emulator_running": bool,
"screenshot_ok": bool,
"serial": str,
"error": str | None
}
"""Add: CLI mode
# Debug from terminal without MCP client
python alas_mcp_server.py cli --health
python alas_mcp_server.py cli --screenshot out.png
python alas_mcp_server.py cli --state
python alas_mcp_server.py cli --tap 640 360
python alas_mcp_server.py cli --run commission.run
python alas_mcp_server.py cli --list-tools
# MCP mode (existing, default)
python alas_mcp_server.py # stdio MCP serverAdd: Timeout wrapper on all tools
Every MCP tool call should be wrapped with a configurable timeout so that a hung ADB call doesn't block the transport indefinitely.
Acceptance Criteria
-
alas_login_ensure_mainremoved from MCP tool surface -
adb_healthtool added and returning structured connectivity status - All MCP tools have bounded execution time (timeout wrapper)
- CLI mode added (
python alas_mcp_server.py cli --<command>) - Tool names simplified (
alas_get_current_state→alas_state, etc.) - Black screenshot detection:
adb_screenshotreports if image is all-black
Context
- Current MCP server:
agent_orchestrator/alas_mcp_server.py(300 lines) - Blocking login handler:
alas_wrapped/module/handler/login.py(300swhile 1:loop) - Related: Issue arch: Tool Node complexity and feedback loop design #21 (Tool Node design) describes verification patterns that depend on this foundation
- CLAUDE.md mandates: "MCP-only control path" and "the LLM agent IS the scheduler"