AgentArena/project_items.json at main · JustInternetAI/AgentArena · GitHub

1
{"items":[{"content":{"body":"## Summary\nTest and debug the three benchmark scenes that were recently created:\n- `scenes/foraging.tscn`\n- `scenes/crafting_chain.tscn`\n- `scenes/team_capture.tscn`\n\n## Tasks\n- [ ] Load each scene in Godot editor and verify they open without errors\n- [ ] Test foraging scene mechanics (resource collection, agent movement)\n- [ ] Test crafting_chain scene mechanics (item crafting, dependencies)\n- [ ] Test team_capture scene mechanics (team coordination, capture points)\n- [ ] Debug any collision detection issues\n- [ ] Verify IPC communication works with each scene\n- [ ] Add visual improvements if needed (markers, labels, UI)\n- [ ] Document any issues or limitations found\n- [ ] Update scene documentation with testing results\n\n## Context\nThese scenes were recently added as empty placeholders in the project. They need to be populated with actual game world elements and tested to ensure they work properly with the agent runtime.\n\n## Priority\nHigh - These are core evaluation benchmarks for the Agent Arena project\n\n## Component\nScenes, Godot/C++, Testing","number":13,"repository":"JustInternetAI/AgentArena","title":"Test and debug benchmark scenes (foraging, crafting_chain, team_capture)","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/13"},"id":"PVTI_lADODG39W84BHw8kzghD4LU","labels":["enhancement","evals","high-priority"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Done","title":"Test and debug benchmark scenes (foraging, crafting_chain, team_capture)"},{"content":{"body":"## Description\nTool execution in Godot currently returns stub responses. Need to fully implement the tool execution pipeline so agents can actually perform actions in the simulation.\n\n## Background\n- ToolRegistry class exists in C++ module\n- Tools are defined in Python (`python/tools/`)\n- IPC communication is working between Godot and Python\n- Currently returns placeholder responses\n\n## Tasks\n- [ ] Review current tool execution flow\n- [ ] Connect ToolRegistry to IPC system\n- [ ] Implement actual tool execution (not stubs)\n- [ ] Add error handling for tool failures\n- [ ] Test tool execution with movement tools\n- [ ] Test tool execution with inventory tools\n- [ ] Add logging for tool execution events\n- [ ] Document tool execution pipeline\n\n## Acceptance Criteria\n- Agents can execute tools and see real results\n- Tool execution flows from Agent → ToolRegistry → IPC → Python → back\n- Error handling works correctly\n- All existing tools are tested and working\n\n## Assigned To\nJustin Madison","number":16,"repository":"JustInternetAI/AgentArena","title":"Connect tool execution system in Godot","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/16"},"id":"PVTI_lADODG39W84BHw8kzghVXzc","labels":["enhancement","tools","high-priority"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Done","title":"Connect tool execution system in Godot"},{"content":{"body":"## Description\nThe three benchmark scenes (foraging, crafting_chain, team_capture) currently exist as empty placeholders. They need to be populated with actual game worlds, objects, and mechanics.\n\n## Scenes to Populate\n1. **Foraging scene** - Resource gathering environment\n2. **Crafting chain scene** - Multi-step crafting mechanics\n3. **Team capture scene** - Team-based competitive scenario\n\n## Tasks\n- [ ] Design and implement foraging scene world\n- [ ] Add collectible resources and spawn points\n- [ ] Design and implement crafting chain scene\n- [ ] Create crafting stations and item dependencies\n- [ ] Design and implement team capture scene\n- [ ] Add team mechanics and capture points\n- [ ] Test each scene with agent interactions\n- [ ] Document scene mechanics and objectives\n\n## Acceptance Criteria\n- All three benchmark scenes have functional game worlds\n- Scenes are playable and testable\n- Agent interactions work correctly in each scene\n- Scenes are documented\n\n## Assigned To\nJustin Madison","number":15,"repository":"JustInternetAI/AgentArena","title":"Build out benchmark scenes with game content","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/15"},"id":"PVTI_lADODG39W84BHw8kzghVXzU","labels":["enhancement","evals","high-priority"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Done","title":"Build out benchmark scenes with game content"},{"content":{"body":"## Problem\n\nAll benchmark scenes (foraging, crafting_chain, team_capture) duplicate the same agent observation gathering and action execution logic. This violates DRY principles and makes it harder to maintain consistent behavior across scenes.\n\nCurrently each scene manually:\n1. Discovers and tracks SimpleAgent nodes\n2. Connects to agent signals (tool_completed)\n3. Iterates through agents each tick to send observations\n4. Handles tool completion callbacks\n\nReference: [scripts/foraging.gd:217-255](https://github.com/JustInternetAI/AgentArena/blob/main/scripts/foraging.gd#L217-L255)\n\n## Proposed Solution\n\nCreate a `SceneController` base class that handles the common agent perception-action loop, allowing scenes to focus only on their domain-specific observation logic.\n\n### Architecture\n\n**New Base Class: `scripts/base_scene_controller.gd`**\n- Auto-discovers SimpleAgent nodes in scene\n- Manages agent lifecycle (signals, tracking)\n- Provides agent iteration on each simulation tick\n- Defines virtual methods for scene-specific logic:\n  - `_build_observations_for_agent(agent)` - Override to provide scene observations\n  - `_on_agent_tool_completed(tool_name, response, agent)` - Override for scene-specific tool handling\n\n**Updated Scene Scripts**\n- `scripts/foraging.gd` - Extend SceneController instead of Node3D\n- `scripts/crafting_chain.gd` - Extend SceneController\n- `scripts/team_capture.gd` - Extend SceneController\n\n### Benefits\n\n1. **DRY Principle** - Agent loop logic written once, reused everywhere\n2. **Multi-agent Support** - Automatically handles any number of agents\n3. **Scene Focus** - Each scene only implements observation logic specific to its domain\n4. **Easier Testing** - Can test agent loop separately from scene logic\n5. **Consistent Patterns** - All scenes work the same way\n6. **Maintainability** - Changes to agent handling only need to be made once\n\n### Example Usage\n\n```gdscript\n# scripts/foraging.gd\nextends SceneController\n\nfunc _build_observations_for_agent(agent: Node) -> Dictionary:\n    \"\"\"Build foraging-specific observations\"\"\"\n    var agent_pos = agent.global_position\n    \n    var nearby_resources = []\n    for resource in active_resources:\n        if not resource.collected:\n            nearby_resources.append({\n                \"name\": resource.name,\n                \"type\": resource.type,\n                \"position\": resource.position,\n                \"distance\": agent_pos.distance_to(resource.position)\n            })\n    \n    return {\n        \"position\": agent_pos,\n        \"resources_collected\": resources_collected,\n        \"nearby_resources\": nearby_resources,\n        \"nearby_hazards\": _get_nearby_hazards(agent_pos),\n        \"tick\": simulation_manager.current_tick\n    }\n\nfunc _on_agent_tool_completed(tool_name: String, response: Dictionary, agent: Node):\n    \"\"\"Handle foraging-specific tool completion\"\"\"\n    _check_resource_collection()\n    _check_hazard_damage()\n```\n\n## Implementation Steps\n\n1. Create `scripts/base_scene_controller.gd` with agent loop logic\n2. Refactor `scripts/foraging.gd` to extend SceneController\n3. Refactor `scripts/crafting_chain.gd` to extend SceneController\n4. Refactor `scripts/team_capture.gd` to extend SceneController\n5. Test all three benchmark scenes to ensure functionality is preserved\n6. Update documentation if needed\n\n## Files to Modify\n\n- **New**: `scripts/base_scene_controller.gd`\n- **Modified**: `scripts/foraging.gd`\n- **Modified**: `scripts/crafting_chain.gd`\n- **Modified**: `scripts/team_capture.gd`\n\n## Acceptance Criteria\n\n- [ ] SceneController base class created with agent discovery and signal management\n- [ ] All three benchmark scenes extend SceneController\n- [ ] Code duplication eliminated (~50+ lines per scene)\n- [ ] All scenes still function correctly (test with existing scenes)\n- [ ] Multi-agent scenarios work properly (team_capture with 4 agents)","number":24,"repository":"JustInternetAI/AgentArena","title":"Create SceneController base class for agent perception-action loop","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/24"},"id":"PVTI_lADODG39W84BHw8kzghuWhM","repository":"https://github.com/JustInternetAI/AgentArena","status":"Done","title":"Create SceneController base class for agent perception-action loop"},{"content":{"body":"# Agent Backend Integration (Updated)\r\n\r\n## Summary\r\n\r\n**Status: ✅ COMPLETE for Windows Development with llama.cpp backend**\r\n\r\nOriginally focused on vLLM integration, this issue pivoted to prioritize **llama.cpp with GPU acceleration** as the primary backend for Windows development. Both backends are now implemented, tested, and documented.\r\n\r\n---\r\n\r\n## ✅ Completed Work\r\n\r\n### 1. llama.cpp Backend (Primary Windows Solution)\r\n\r\n**Performance Verified:**\r\n- ✅ GPU acceleration with CUDA (RTX 3090: **113 tokens/second**)\r\n- ✅ CPU fallback mode (9 tokens/second)\r\n- ✅ Configurable GPU layer offloading (0=CPU, -1=Full GPU)\r\n- ✅ Q4_K_M quantization support\r\n\r\n**Files Implemented:**\r\n- `python/backends/llama_cpp_backend.py` - Backend implementation\r\n- `python/backends/base.py` - Backend interface\r\n- `python/test_quick_gpu.py` - GPU verification (113 tok/s verified)\r\n- `python/test_agent_gpu.py` - Full agent decision-making test\r\n- `docs/llama_cpp_gpu_setup.md` - GPU setup guide\r\n- `docs/llama_cpp_windows_setup.md` - Windows-specific guide\r\n\r\n### 2. vLLM Backend (Linux/Production)\r\n\r\n**Status:** ✅ Implemented, available for Linux environments\r\n\r\n**Files:**\r\n- `python/backends/vllm_backend.py` - vLLM implementation\r\n- Backend interface compatible with llama.cpp\r\n\r\n### 3. Agent Runtime System\r\n\r\n**Features:**\r\n- ✅ LLM-driven decision making\r\n- ✅ Tool execution via structured output parsing\r\n- ✅ Memory management (short-term FIFO queue)\r\n- ✅ Multi-agent support\r\n\r\n**Files:**\r\n- `python/agent_runtime/agent.py`\r\n- `python/agent_runtime/tool_dispatcher.py`\r\n- `python/agent_runtime/runtime.py`\r\n\r\n### 4. IPC Communication (Godot ↔ Python)\r\n\r\n**Endpoints:**\r\n- ✅ `GET /health` - Server health check\r\n- ✅ `POST /tools/execute` - Direct tool execution (tested & working)\r\n- ✅ `POST /tick` - Agent observation/action loop\r\n- ✅ `POST /agents/register` - Agent registration\r\n\r\n**Files:**\r\n- `python/ipc/server.py` - FastAPI IPC server\r\n- `godot/src/agent_arena.cpp` - C++ IPCClient\r\n- `godot/include/agent_arena.h` - IPCClient interface\r\n\r\n### 5. Testing Infrastructure\r\n\r\n**Test Scenes:**\r\n- ✅ `scenes/tests/test_tool_execution_simple.tscn` - HTTP tool testing (5/5 passing)\r\n- ✅ `scenes/tests/test_tool_execution.tscn` - C++ integration testing\r\n\r\n**Python Tests:**\r\n- ✅ `python/test_quick_gpu.py` - GPU verification\r\n- ✅ `python/test_agent_gpu.py` - Agent decision-making (3 scenarios)\r\n\r\n**Batch Files:**\r\n- ✅ `START_IPC_SERVER.bat` - Quick IPC server\r\n- ✅ `START_GPU_IPC_SERVER.bat` - GPU-accelerated agent server\r\n- ✅ `python/run_ipc_server_with_gpu.py` - Full agent runtime\r\n\r\n### 6. Documentation\r\n\r\n- ✅ `TESTING_AGENT_WITH_GPU.md` - Complete testing workflow\r\n- ✅ `docs/llama_cpp_gpu_setup.md` - GPU acceleration guide\r\n- ✅ `docs/llama_cpp_windows_setup.md` - Windows setup\r\n\r\n---\r\n\r\n## 📊 Performance Metrics\r\n\r\n| Configuration | Speed | Decision Time | Use Case |\r\n|--------------|-------|---------------|----------|\r\n| CPU Only (Q4_K_M) | ~9 tok/s | ~15-20s | Development/testing |\r\n| **GPU Full (RTX 3090)** | **~113 tok/s** | **1-2s** | **Production** |\r\n| Recommended Tick Rate | 0.5-1 Hz | - | Real-time simulation |\r\n\r\n---\r\n\r\n## 🎯 Why llama.cpp for Windows?\r\n\r\n1. **Native Windows Support** - Pre-built wheels, minimal setup\r\n2. **CUDA Integration** - Seamless NVIDIA GPU support on Windows\r\n3. **GGUF Models** - Efficient quantization for consumer GPUs\r\n4. **Development Friendly** - Faster iteration, easier debugging\r\n5. **Lower Dependencies** - Fewer system requirements vs vLLM\r\n\r\n**Note:** vLLM remains available for Linux/production deployments with maximum throughput requirements.\r\n\r\n---\r\n\r\n## 📋 Remaining Integration Tasks\r\n\r\n### Scene Integration\r\n- [ ] Connect `/tick` endpoint to `foraging.tscn`\r\n- [ ] Connect `/tick` endpoint to `crafting_chain.tscn`\r\n- [ ] Connect `/tick` endpoint to `team_capture.tscn`\r\n- [ ] Add agent registration in scene `_ready()` functions\r\n\r\n### Advanced Features\r\n- [ ] Implement visual perception (beyond text observations)\r\n- [ ] FAISS-based long-term memory integration\r\n- [ ] Multi-agent coordination primitives\r\n- [ ] Response caching for performance optimization\r\n\r\n---\r\n\r\n## 🚀 Quick Start\r\n\r\n### Test IPC + Tools (No LLM)\r\n```bash\r\nSTART_IPC_SERVER.bat\r\n```\r\nRun `scenes/tests/test_tool_execution_simple.tscn` in Godot (F6)\r\n\r\n### Test GPU Backend (Python Only)\r\n```bash\r\ncd python && venv\\Scripts\\activate\r\npython test_agent_gpu.py\r\n```\r\n\r\n### Test Full Integration (Godot + GPU)\r\n```bash\r\nSTART_GPU_IPC_SERVER.bat\r\n```\r\n\r\n---\r\n\r\n## Git Commits\r\n\r\n- `eef338f` - GPU acceleration for llama.cpp backend\r\n- `e6f1890` - Quick GPU test (113 tok/s verified)\r\n- `5069a73` - llama.cpp Windows setup\r\n- `72a7e62` - vLLM backend integration\r\n\r\n---\r\n\r\n## Related Issues\r\n\r\n- #18 - MCP Integration & Architecture (requirements addressed)\r\n- #16 - Tool execution system (CLOSED ✓)\r\n\r\n---\r\n\r\n## Component\r\n\r\nBackends\r\n\r\n## Size\r\n\r\n~~L (3-5 days)~~ → **COMPLETED**\r\n\r\n## Priority\r\n\r\nHigh\r\n\r\n## Status\r\n\r\n**✅ COMPLETE for Windows Development**\r\n**🔨 Integration with benchmark scenes in progress**\r\n\r\n---\r\n\r\n**Last Updated:** 2025-11-19\r\n**Branch:** `AndrewDevelopment`\r\n**Next Phase:** Scene integration for perception-decision-action loop\r\n","number":6,"repository":"JustInternetAI/AgentArena","title":"Agent Backend Integration","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/6"},"id":"PVTI_lADODG39W84BHw8kzghADbk","labels":["enhancement","backend","high-priority"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Done","title":"Agent Backend Integration"},{"content":{"body":"## Overview\n\nImplement a test to validate the observation-decision loop, which is the missing piece in our end-to-end pipeline testing. This test will verify that game observations can be sent to the Python backend, processed into decisions, and returned to Godot - without executing actual agent movement.\n\n## Current Status\n\n### Already Tested ✅\n- Backend connectivity (test_autoload_services.gd)\n- Tool execution (test_tool_execution.gd)\n- HTTP communication (test_tool_execution_simple.tscn)\n\n### Missing ❌\n- Observation-based decision loop\n- Perception → Decision → Action cycle\n- Continuous tick loop with backend\n\n## Architecture Flow\n\n```\n┌─────────────────────────────┐\n│   Simplified Test Scene     │\n│   - Mock agent position     │\n│   - Mock resources/hazards  │\n└────────────┬────────────────┘\n             │ Build observations\n             ▼\n┌─────────────────────────────┐\n│   Observation Dictionary    │\n│   - position: [x,y,z]      │\n│   - nearby_resources: []   │\n│   - nearby_hazards: []     │\n└────────────┬────────────────┘\n             │ HTTP POST /observe\n             ▼\n┌─────────────────────────────┐\n│   Python Backend            │\n│   - Mock decision logic     │\n│   - Returns tool + params   │\n└────────────┬────────────────┘\n             │ JSON response\n             ▼\n┌─────────────────────────────┐\n│   Test Scene                │\n│   - Log decision            │\n│   - Don't execute           │\n│   - Continue loop           │\n└─────────────────────────────┘\n```\n\n## Implementation Tasks\n\n### Phase 1: Backend Endpoint (30 min)\n\n**File:** `python/ipc/server.py`\n\n- [ ] Add `/observe` POST endpoint\n- [ ] Implement `make_mock_decision()` function with rule-based logic:\n  - Priority 1: Avoid nearby hazards (distance < 3.0)\n  - Priority 2: Move to nearest resource (distance < 5.0)\n  - Default: Idle\n- [ ] Return decision with tool name, params, and reasoning\n- [ ] Add logging for debugging\n\n### Phase 2: Test Scene (1 hour)\n\n**File:** `scripts/tests/test_observation_loop.gd`\n\n- [ ] Create test script extending Node\n- [ ] Add mock foraging data (agent position, resources, hazards)\n- [ ] Implement `build_observation()` to create observation dict\n- [ ] Implement `send_observation()` using HTTPRequest\n- [ ] Process 10 ticks with 0.5s delay between each\n- [ ] Log observations sent and decisions received\n- [ ] Add keyboard controls (Q to quit)\n\n**File:** `scenes/tests/test_observation_loop.tscn`\n\n- [ ] Create simple scene with test script node\n- [ ] No 3D environment needed\n\n### Phase 3: Documentation\n\n**File:** `scenes/tests/README.md`\n\n- [ ] Add section for test_observation_loop.tscn\n- [ ] Document purpose, how to run, expected output\n- [ ] Add to test suite list\n\n## Mock Decision Logic\n\n```python\ndef make_mock_decision(obs: dict) -> dict:\n    nearby_resources = obs.get(\"nearby_resources\", [])\n    nearby_hazards = obs.get(\"nearby_hazards\", [])\n    \n    # Priority 1: Avoid hazards\n    for hazard in nearby_hazards:\n        if hazard[\"distance\"] < 3.0:\n            return {\n                \"tool\": \"move_away\",\n                \"params\": {\"from_position\": hazard[\"position\"]},\n                \"reasoning\": f\"Avoiding {hazard['type']} hazard\"\n            }\n    \n    # Priority 2: Collect resources\n    if nearby_resources:\n        closest = min(nearby_resources, key=lambda r: r[\"distance\"])\n        if closest[\"distance\"] < 5.0:\n            return {\n                \"tool\": \"move_to\",\n                \"params\": {\"target_position\": closest[\"position\"]},\n                \"reasoning\": f\"Moving to collect {closest['type']}\"\n            }\n    \n    # Default: idle\n    return {\n        \"tool\": \"idle\",\n        \"params\": {},\n        \"reasoning\": \"No immediate actions needed\"\n    }\n```\n\n## Success Criteria\n\n- [ ] `/observe` endpoint responds to POST requests\n- [ ] Mock decision logic returns valid actions\n- [ ] Test scene runs for 10 ticks without errors\n- [ ] Each tick sends observation and receives decision\n- [ ] Decisions logged to console with clear formatting\n- [ ] Decisions make sense based on mock game state\n- [ ] No crashes, memory leaks, or hangs\n\n## Testing Steps\n\n1. Start Python IPC server: `START_IPC_SERVER.bat`\n2. Open `scenes/tests/test_observation_loop.tscn` in Godot\n3. Press F6 to run\n4. Watch console for 10 ticks of observations and decisions\n5. Verify decisions match expected behavior\n6. Press Q to quit\n\n## Expected Console Output\n\n```\n=== Observation-Decision Loop Test ===\nWaiting for backend connection...\n✓ Connected to backend!\n\n=== Starting Observation Loop ===\nRunning 10 ticks...\n\n--- Tick 0 ---\nObservation:\n  Position: (0, 0, 0)\n  Nearby resources: 2\n  Nearby hazards: 1\n✓ Decision received:\n  Tool: move_away\n  Reasoning: Avoiding fire hazard\n\n--- Tick 1 ---\n...\n\n=== Test Complete ===\nAll 10 ticks processed successfully!\nPress Q to quit\n```\n\n## Out of Scope (for this issue)\n\n- ❌ Actual agent movement execution\n- ❌ Real LLM integration (using mock logic)\n- ❌ Integration with foraging.gd (separate task)\n- ❌ Multiple agents\n- ❌ Physics/collision\n\n## Follow-Up Tasks\n\nAfter this test passes:\n1. Integrate observation loop into foraging scene\n2. Replace mock decisions with real LLM backend\n3. Implement actual movement execution\n4. Add multi-agent support\n\n## Estimated Time\n\n- Phase 1: 30 minutes\n- Phase 2: 1 hour\n- Phase 3: 30 minutes\n- **Total: ~2 hours**\n\n## Related Files\n\n- `scripts/tests/test_tool_execution.gd` - Reference for test structure\n- `scripts/foraging.gd` - Will eventually integrate this pattern\n- `python/ipc/server.py` - Backend server to modify\n- `python/tools/movement.py` - Example tool implementations","number":28,"repository":"JustInternetAI/AgentArena","title":"Implement Observation-Decision Loop Test (End-to-End Pipeline)","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/28"},"id":"PVTI_lADODG39W84BHw8kzgh1pgU","repository":"https://github.com/JustInternetAI/AgentArena","status":"Done","title":"Implement Observation-Decision Loop Test (End-to-End Pipeline)"},{"content":{"body":"## Description\nThe foraging benchmark scene needs to be populated with actual game world content, resources, and hazards to create a functional resource gathering environment.\n\n## Parent Issue\nSplit from #15 - Build out benchmark scenes with game content\n\n## Scene Overview\n**Goal**: Agent must collect resources (berries, wood, stone) while avoiding hazards (fire, pits)\n\n**Current State**: Scene structure exists with SceneController base class\n\n## Tasks\n- [ ] Design world layout (terrain, obstacles, paths)\n- [ ] Add collectible resource nodes (berries, wood, stone)\n  - [ ] Place 3+ berry bushes in accessible locations\n  - [ ] Place 2+ wood piles\n  - [ ] Place 2+ stone deposits\n- [ ] Add hazard nodes (fire, pits)\n  - [ ] Place 2+ hazard zones that block optimal paths\n  - [ ] Configure hazard damage values\n- [ ] Set up resource spawn points and respawn logic (if needed)\n- [ ] Add visual indicators for resources and hazards\n- [ ] Test agent interaction with resources\n  - [ ] Verify collection radius works correctly\n  - [ ] Verify hazard damage is applied\n  - [ ] Verify metrics tracking (resources collected, damage taken)\n- [ ] Balance difficulty (resource placement, hazard positioning)\n- [ ] Document scene layout and mechanics in comments\n\n## Acceptance Criteria\n- [ ] Foraging scene has at least 7 collectible resources total\n- [ ] Scene includes hazards that create risk/reward decisions\n- [ ] Agent can successfully collect all resources\n- [ ] Metrics correctly track resources collected and damage taken\n- [ ] Scene is playable and demonstrates foraging behavior\n- [ ] Scene layout is documented\n\n## Technical Notes\n- Scene script: `scripts/foraging.gd` (extends SceneController)\n- Scene file: `scenes/foraging.tscn`\n- Resource collection radius: 2.0 units\n- Hazard damage radius: 1.5 units\n- MAX_RESOURCES = 7\n\n## Related\n- Parent: #15\n- Related: SceneController implementation (#24)","number":25,"repository":"JustInternetAI/AgentArena","title":"Populate Foraging Scene with game content","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/25"},"id":"PVTI_lADODG39W84BHw8kzgh0sik","labels":["enhancement","evals"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Done","title":"Populate Foraging Scene with game content"},{"content":{"body":"# Model Download and Management Tool\r\n\r\n## Description\r\n\r\nAutomated model downloading and version management for LLM models used by Agent Arena backends (llama.cpp, vLLM, etc.).\r\n\r\nThis tool simplifies the process of obtaining and managing different LLM models for development and testing. It will:\r\n- Download models from Hugging Face Hub in GGUF format (for llama.cpp)\r\n- Download models in PyTorch/safetensors format (for vLLM)\r\n- Verify model integrity with checksums\r\n- Cache models locally to avoid re-downloading\r\n- Provide a CLI for easy model management\r\n\r\n## Tasks\r\n\r\n### Core Implementation\r\n- [ ] Create `python/tools/model_manager.py`\r\n- [ ] Implement `ModelManager` class with methods:\r\n  - [ ] `download_model(model_id: str, format: str = \"gguf\")` - Download from HF Hub\r\n  - [ ] `list_models()` - Show cached models\r\n  - [ ] `verify_model(model_path: str)` - Check integrity\r\n  - [ ] `remove_model(model_id: str)` - Delete cached model\r\n  - [ ] `get_model_path(model_id: str)` - Get local path\r\n\r\n### Hugging Face Integration\r\n- [ ] Support Hugging Face Hub downloads using `huggingface_hub` library\r\n- [ ] Handle authentication for gated models (optional)\r\n- [ ] Support model variants (Q4, Q5, Q8 quantizations)\r\n- [ ] Parse model metadata from HF model cards\r\n\r\n### Model Format Support\r\n- [ ] Support GGUF models for llama.cpp backend (Issue #6 dependency)\r\n- [ ] Support PyTorch/safetensors for vLLM backend (Issue #6 dependency)\r\n- [ ] Detect model format automatically from file extension\r\n- [ ] Convert between formats if needed (optional, future)\r\n\r\n### Verification & Caching\r\n- [ ] Verify model checksums (SHA256) after download\r\n- [ ] Cache models in `models/` directory with organized structure:\r\n  ```\r\n  models/\r\n  ├── llama-2-7b-chat/\r\n  │   ├── gguf/\r\n  │   │   ├── q4_k_m/\r\n  │   │   └── q5_k_m/\r\n  │   └── pytorch/\r\n  └── mistral-7b-instruct/\r\n      └── gguf/\r\n  ```\r\n- [ ] Resume interrupted downloads\r\n- [ ] Skip re-downloading if model already exists and is valid\r\n\r\n### CLI Tool\r\n- [ ] Create command-line interface:\r\n  ```bash\r\n  # Download a model\r\n  python -m tools.model_manager download llama-2-7b-chat --format gguf --quant q4_k_m\r\n\r\n  # List cached models\r\n  python -m tools.model_manager list\r\n\r\n  # Verify a model\r\n  python -m tools.model_manager verify llama-2-7b-chat\r\n\r\n  # Remove a model\r\n  python -m tools.model_manager remove llama-2-7b-chat\r\n  ```\r\n- [ ] Add progress bars for downloads (using `tqdm`)\r\n- [ ] Support verbose/quiet modes\r\n\r\n### Configuration\r\n- [ ] Add model registry in `configs/models.yaml`:\r\n  ```yaml\r\n  models:\r\n    llama-2-7b-chat:\r\n      huggingface_id: \"TheBloke/Llama-2-7B-Chat-GGUF\"\r\n      formats:\r\n        gguf:\r\n          q4_k_m:\r\n            file: \"llama-2-7b-chat.Q4_K_M.gguf\"\r\n            sha256: \"abc123...\"\r\n          q5_k_m:\r\n            file: \"llama-2-7b-chat.Q5_K_M.gguf\"\r\n            sha256: \"def456...\"\r\n\r\n    mistral-7b-instruct:\r\n      huggingface_id: \"TheBloke/Mistral-7B-Instruct-v0.2-GGUF\"\r\n      formats:\r\n        gguf:\r\n          q4_k_m:\r\n            file: \"mistral-7b-instruct-v0.2.Q4_K_M.gguf\"\r\n            sha256: \"ghi789...\"\r\n  ```\r\n\r\n### Documentation\r\n- [ ] List compatible models in `README.md`\r\n- [ ] Create `docs/model_management.md` with:\r\n  - [ ] Supported models and formats\r\n  - [ ] How to add custom models\r\n  - [ ] Storage requirements per model\r\n  - [ ] Performance characteristics (speed, quality tradeoffs)\r\n- [ ] Add troubleshooting guide for common download issues\r\n\r\n### Testing\r\n- [ ] Create unit tests in `tests/test_model_manager.py`:\r\n  - [ ] Test download functionality (with mock HF Hub)\r\n  - [ ] Test checksum verification\r\n  - [ ] Test caching logic\r\n  - [ ] Test model listing and removal\r\n- [ ] Integration test: Download a small model end-to-end\r\n\r\n## Component\r\n\r\nTools\r\n\r\n## Size\r\n\r\nM (1-3 days)\r\n\r\n## Priority\r\n\r\nHigh\r\n\r\n## Dependencies\r\n\r\n- Requires `huggingface_hub` library (add to `requirements.txt`)\r\n- Requires `tqdm` for progress bars\r\n- Should work with both llama.cpp and vLLM backends (Issues #6, existing llama.cpp)\r\n\r\n## Relationship to MCP (Issue #18)\r\n\r\nThis tool is **independent of MCP** and does not need changes. It's a utility for managing models that backends use. MCP integration doesn't affect model downloading/management.\r\n\r\nHowever, in the future, you could expose model management as MCP tools:\r\n- `mcp_tool:download_model` - Download a model\r\n- `mcp_tool:list_models` - List available models\r\n- `mcp_resource:models` - Access model metadata\r\n\r\n**For now**: Keep this as a standalone CLI tool. MCP exposure is optional future work.\r\n\r\n## Success Criteria\r\n\r\n- [ ] Can download GGUF models from Hugging Face Hub\r\n- [ ] Can download models for vLLM (safetensors/PyTorch)\r\n- [ ] Checksums verify correctly\r\n- [ ] Models are cached and not re-downloaded\r\n- [ ] CLI is user-friendly with clear progress indicators\r\n- [ ] Documentation lists at least 5 recommended models with specs\r\n- [ ] Tool works on Windows, Linux, and macOS\r\n\r\n## Recommended Models to Support\r\n\r\n### Small Models (Development/Testing)\r\n- **Phi-2** (2.7B) - Fast, good for testing\r\n- **TinyLlama** (1.1B) - Extremely fast, basic capabilities\r\n\r\n### Production Models\r\n- **Llama-2-7B-Chat** - Good balance of speed/quality\r\n- **Mistral-7B-Instruct** - High quality, fast inference\r\n- **Llama-3-8B-Instruct** - Latest, best quality in class\r\n\r\n### Large Models (High Quality)\r\n- **Llama-2-13B-Chat** - Better reasoning\r\n- **Mixtral-8x7B-Instruct** - MoE architecture, excellent quality\r\n\r\n## Example Usage\r\n\r\n```bash\r\n# Initial setup - download a model\r\n$ python -m tools.model_manager download llama-2-7b-chat --format gguf --quant q4_k_m\r\n\r\nDownloading llama-2-7b-chat (Q4_K_M quantization)...\r\nSource: TheBloke/Llama-2-7B-Chat-GGUF\r\nFile: llama-2-7b-chat.Q4_K_M.gguf (3.8 GB)\r\n\r\n[████████████████████████████████████] 100% - 3.8 GB/3.8 GB - 15.2 MB/s\r\n\r\n✓ Download complete\r\n✓ Checksum verified (SHA256)\r\n✓ Model cached at: models/llama-2-7b-chat/gguf/q4_k_m/\r\n\r\n# List downloaded models\r\n$ python -m tools.model_manager list\r\n\r\nCached Models:\r\n┌────────────────────┬────────┬──────────┬────────┐\r\n│ Model              │ Format │ Quant    │ Size   │\r\n├────────────────────┼────────┼──────────┼────────┤\r\n│ llama-2-7b-chat    │ gguf   │ q4_k_m   │ 3.8 GB │\r\n│ mistral-7b-instruct│ gguf   │ q5_k_m   │ 5.1 GB │\r\n└────────────────────┴────────┴──────────┴────────┘\r\n\r\nTotal storage: 8.9 GB\r\n\r\n# Use in config\r\n$ cat configs/backend/llama_cpp.yaml\r\nbackend:\r\n  type: llama_cpp\r\n  model_path: \"models/llama-2-7b-chat/gguf/q4_k_m/llama-2-7b-chat.Q4_K_M.gguf\"\r\n  n_ctx: 4096\r\n```\r\n\r\n## Notes\r\n\r\n- Start with GGUF support for llama.cpp (most immediate need)\r\n- Add vLLM model support when Issue #6 is implemented\r\n- Consider rate limiting to respect Hugging Face Hub API limits\r\n- Large models (13B+) may take significant time and storage\r\n- Provide clear disk space requirements in documentation\r\n","number":7,"repository":"JustInternetAI/AgentArena","title":"Model Download and Management Tool","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/7"},"id":"PVTI_lADODG39W84BHw8kzghADcA","labels":["enhancement","tools","high-priority"],"linked pull requests":["https://github.com/JustInternetAI/AgentArena/pull/21"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Done","title":"Model Download and Management Tool"},{"content":{"body":"## Summary\nDefine the AgentMemory abstract interface and provide built-in implementations that users can choose from or extend.\n\n## Why Third?\nDepends on Phase 1 (schemas) for Observation type. Built-in behaviors (Phase 5) will use these memory implementations.\n\n## Deliverables\n\n### 1. Create directory structure\n```\npython/agent_runtime/memory/\n├── __init__.py\n├── base.py              # AgentMemory ABC\n├── sliding_window.py    # Simple FIFO memory\n├── summarizing.py       # LLM-compressed memory\n└── rag.py               # Vector store retrieval (stub)\n```\n\n### 2. AgentMemory ABC (base.py)\nAbstract interface with methods:\n- `store(observation)` - Store an observation\n- `retrieve(query=None)` - Get relevant observations\n- `summarize()` - Return string for LLM context\n- `clear()` - Reset memory\n\n### 3. SlidingWindowMemory (sliding_window.py)\nSimple FIFO memory keeping N most recent observations.\n- Configurable capacity\n- Good for beginners and simple scenarios\n\n### 4. SummarizingMemory (summarizing.py)\nUses LLM to compress observations into running summary.\n- Keeps summary + small window of recent raw observations\n- Compresses periodically when buffer fills\n- Good for long episodes\n\n### 5. RAGMemory (rag.py) - STUB\nVector store memory with semantic retrieval.\n- Raises NotImplementedError for now\n- Planned for future phase with FAISS integration\n\n### 6. Tests\n- Capacity enforcement\n- Summarize output format\n- Compression trigger\n- Clear resets state\n\n## Acceptance Criteria\n- [ ] AgentMemory ABC defined with clear contract\n- [ ] SlidingWindowMemory fully functional\n- [ ] SummarizingMemory functional (with backend)\n- [ ] RAGMemory stubbed with clear error message\n- [ ] Good docstrings with examples\n- [ ] Unit tests pass\n\n## Dependencies\n- Phase 1: Core Schemas (#37)\n\n## Blocks\n- Phase 5 (Built-in behaviors)","number":39,"repository":"JustInternetAI/AgentArena","title":"[Refactor] Phase 3: AgentMemory Interface and Implementations","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/39"},"id":"PVTI_lADODG39W84BHw8kzgkG3t8","labels":["enhancement","memory","architecture"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Done","title":"[Refactor] Phase 3: AgentMemory Interface and Implementations"},{"assignees":["justinmadison"],"content":{"body":"## Overview\n\nIntegrate the validated observation-decision loop (from #28) into the real foraging scene. This will connect the production game state to backend AI decision-making, without executing actual movement yet.\n\n## Background\n\nIssue #28 successfully validated the end-to-end pipeline with a test scene. Now we need to integrate this into the production foraging benchmark scene to test with real game state.\n\n## Current State\n\n**Foraging scene already has:**\n- ✅ `_build_observations_for_agent()` - Builds observations (position, resources, hazards)\n- ✅ `_on_scene_tick()` - Called every simulation tick\n- ✅ Resource/hazard tracking with distances\n- ✅ Agent discovery and management via SceneController\n\n**What's missing:**\n- ❌ Sending observations to backend\n- ❌ Receiving decisions from backend\n- ❌ Logging decisions (without executing movement)\n\n## Implementation Tasks\n\n### Phase 1: Add Backend Communication (30 min)\n\n**File:** `scripts/foraging.gd`\n\n- [ ] Add decision tracking variables:\n  - `backend_decisions: Array[Dictionary]` - Track all decisions for analysis\n  - `waiting_for_decision: bool` - Prevent concurrent requests\n\n- [ ] Modify `_on_scene_tick()` to request backend decisions:\n  ```gdscript\n  if agents.size() > 0 and not waiting_for_decision:\n      _request_backend_decision(agents[0])\n  ```\n\n- [ ] Implement `_request_backend_decision()` method:\n  - Create HTTPRequest node\n  - Build observation using existing `_build_observations_for_agent()`\n  - Send POST to `http://127.0.0.1:5000/observe`\n  - Connect to `_on_decision_received()` callback\n\n- [ ] Implement `_on_decision_received()` method:\n  - Parse JSON response\n  - Log decision to console\n  - Store in `backend_decisions` array\n  - Emit event via EventBus\n\n- [ ] Implement `_log_backend_decision()` method:\n  - Print formatted decision to console\n  - Show tool, params, reasoning\n  - Emit tracking event\n\n### Phase 2: Add Metrics Display (15 min)\n\n- [ ] Update `_update_metrics_ui()` to show last backend decision:\n  - Display tool name\n  - Display reasoning\n  - Update every tick\n\n### Phase 3: Add Reset Handler (5 min)\n\n- [ ] Update `_reset_scene()` to clear decision state:\n  - Clear `backend_decisions` array\n  - Reset `waiting_for_decision` flag\n\n## Expected Console Output\n\n```\nForaging Benchmark Scene Ready!\nResources available: 7\nHazards: 2\n\n[Backend Decision - Tick 0]\n  Agent: agent_0\n  Tool: move_to\n  Params: {target_position:[-1.5, 0, -1.5], speed:2.0}\n  Reasoning: Avoiding nearby fire hazard at distance 2.8\n\n✓ Collected Berry1 (berry) (1/7)\n\n[Backend Decision - Tick 5]\n  Agent: agent_0\n  Tool: move_to\n  Params: {target_position:[5.2, 0, 3.1], speed:1.5}\n  Reasoning: Moving to collect berry (Berry2) at distance 4.2\n\n[Backend Decision - Tick 20]\n  Agent: agent_0\n  Tool: idle\n  Params: {}\n  Reasoning: No immediate actions needed - exploring environment\n```\n\n## Testing Strategy\n\n### Test 1: Manual Step-Through\n1. Start Python IPC server\n2. Open foraging scene in Godot\n3. Press **S** to step one tick\n4. Check console for \"Backend Decision\" log\n5. Verify decision makes sense based on game state\n\n### Test 2: Full Simulation Run\n1. Press **SPACE** to start simulation\n2. Let it run for 30 ticks or until complete\n3. Check that decisions are logged every tick\n4. Verify decisions change as game state changes\n\n### Test 3: Decision Quality\nExpected behavior:\n- **Early ticks:** Avoid hazards if too close\n- **Mid ticks:** Move toward nearest resource\n- **Late ticks:** Idle when no resources in range or all collected\n\n## Success Criteria\n\n- [ ] Observations sent to backend every tick\n- [ ] Decisions received and logged (not executed)\n- [ ] Decisions visible in metrics UI\n- [ ] Decisions make logical sense based on game state\n- [ ] No crashes or performance issues\n- [ ] Backend decisions tracked in `backend_decisions` array\n- [ ] Scene can be reset cleanly\n- [ ] Manual resource collection still works (backward compatible)\n\n## Advantages\n\n✅ **Low Risk:** No movement execution, just observation  \n✅ **Validates Pipeline:** Tests real game data → backend → decisions  \n✅ **Easy to Debug:** All decisions logged to console  \n✅ **Keeps Existing Tests:** Doesn't break manual resource collection  \n✅ **Metrics Tracking:** Decision history stored for analysis  \n\n## Files to Modify\n\n- `scripts/foraging.gd` - Add backend communication (~80 lines, 4 new methods)\n\n## Estimated Time\n\n- **Phase 1:** 30 minutes (backend communication)\n- **Phase 2:** 15 minutes (metrics display)\n- **Phase 3:** 5 minutes (reset handling)\n- **Testing:** 15 minutes\n\n**Total:** ~1 hour\n\n## Follow-Up Tasks\n\nAfter this is complete:\n1. Add movement execution (execute backend decisions)\n2. Replace mock decisions with real LLM\n3. Test with multiple agents\n\n## Dependencies\n\n- Requires #28 (observation-decision loop test) to be complete\n- Requires Python IPC server running\n- Requires `/observe` endpoint from #28\n\n## Related Issues\n\n- #28 - Observation-decision loop test (prerequisite)","number":30,"repository":"JustInternetAI/AgentArena","title":"Integrate observation-decision loop into foraging scene","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/30"},"id":"PVTI_lADODG39W84BHw8kzgh4Jus","repository":"https://github.com/JustInternetAI/AgentArena","status":"Done","title":"Integrate observation-decision loop into foraging scene"},{"content":{"body":"## Summary\nDefine the core `AgentBehavior` abstract base class and the simplified `SimpleAgentBehavior` for beginners. This establishes the contract that all user agents must fulfill.\n\n## Why Second?\nDepends on Phase 1 (schemas). All built-in behaviors and user agents will implement this interface.\n\n## Deliverables\n\n### 1. Create `python/agent_runtime/behavior.py`\n\n```python\nfrom abc import ABC, abstractmethod\nfrom typing import TYPE_CHECKING\n\nif TYPE_CHECKING:\n    from .schemas import Observation, AgentDecision, SimpleContext, ToolSchema\n\nclass AgentBehavior(ABC):\n    \"\"\"\n    Base class for agent decision-making logic.\n    \n    Users implement this to create custom agents. The framework calls\n    `decide()` each tick with the current observation and available tools.\n    \n    Example:\n        class MyAgent(AgentBehavior):\n            def __init__(self, backend):\n                self.backend = backend\n                self.memory = SlidingWindowMemory(capacity=10)\n            \n            def decide(self, observation, tools):\n                self.memory.store(observation)\n                prompt = self._build_prompt(observation)\n                response = self.backend.generate(prompt)\n                return AgentDecision.from_llm_response(response)\n    \"\"\"\n    \n    @abstractmethod\n    def decide(self, observation: \"Observation\", tools: list[\"ToolSchema\"]) -> \"AgentDecision\":\n        \"\"\"\n        Decide what action to take given the current observation.\n        \n        Args:\n            observation: Current tick's observation from Godot\n            tools: List of available tools with their schemas\n            \n        Returns:\n            AgentDecision specifying which tool to call and with what parameters\n        \"\"\"\n        pass\n    \n    def on_tool_result(self, tool: str, result: dict) -> None:\n        \"\"\"\n        Called after a tool execution completes.\n        \n        Override to react to tool results (e.g., update memory, adjust strategy).\n        \n        Args:\n            tool: Name of the tool that was executed\n            result: Result dictionary from the tool\n        \"\"\"\n        pass\n    \n    def on_episode_start(self) -> None:\n        \"\"\"\n        Called when a new episode begins.\n        \n        Override to reset state, clear memory, etc.\n        \"\"\"\n        pass\n    \n    def on_episode_end(self, success: bool, metrics: dict | None = None) -> None:\n        \"\"\"\n        Called when an episode ends.\n        \n        Override to perform cleanup, learning, or logging.\n        \n        Args:\n            success: Whether the episode goal was achieved\n            metrics: Optional metrics from the scenario\n        \"\"\"\n        pass\n\n\nclass SimpleAgentBehavior(AgentBehavior):\n    \"\"\"\n    Simplified interface for beginners.\n    \n    Users only need to implement `decide()` returning a tool name.\n    The framework handles memory, prompts, and parameter inference.\n    \n    Example:\n        class MyFirstAgent(SimpleAgentBehavior):\n            system_prompt = \"You are a foraging agent. Collect apples.\"\n            \n            def decide(self, context):\n                if context.nearby_resources:\n                    return \"move_to\"  # Framework infers target\n                return \"idle\"\n    \"\"\"\n    \n    # User can override these class attributes\n    system_prompt: str = \"You are an autonomous agent.\"\n    memory_capacity: int = 10\n    \n    def __init__(self):\n        self._observations: list = []\n        self._goal: str | None = None\n    \n    @abstractmethod\n    def decide(self, context: \"SimpleContext\") -> str:\n        \"\"\"\n        Decide which tool to use.\n        \n        Args:\n            context: Simplified context with key information\n            \n        Returns:\n            Name of the tool to execute (e.g., \"move_to\", \"pickup\", \"idle\")\n        \"\"\"\n        pass\n    \n    def set_goal(self, goal: str) -> None:\n        \"\"\"Set the current goal for the agent.\"\"\"\n        self._goal = goal\n    \n    # Internal: converts full interface to simple interface\n    def _internal_decide(self, observation: \"Observation\", tools: list[\"ToolSchema\"]) -> \"AgentDecision\":\n        \"\"\"Framework calls this; converts to SimpleContext and calls user's decide().\"\"\"\n        from .schemas import SimpleContext, AgentDecision\n        \n        # Store observation for memory\n        self._observations.append(observation)\n        if len(self._observations) > self.memory_capacity:\n            self._observations = self._observations[-self.memory_capacity:]\n        \n        # Build simple context\n        context = SimpleContext.from_observation(observation, self._goal)\n        \n        # Get tool name from user\n        tool_name = self.decide(context)\n        \n        # Infer parameters based on context\n        params = self._infer_parameters(tool_name, context, tools)\n        \n        return AgentDecision(tool=tool_name, params=params)\n    \n    def _infer_parameters(self, tool_name: str, context: \"SimpleContext\", tools: list[\"ToolSchema\"]) -> dict:\n        \"\"\"Infer tool parameters from context.\"\"\"\n        if tool_name == \"move_to\" and context.nearby_resources:\n            # Move to nearest resource\n            nearest = min(context.nearby_resources, key=lambda r: r.get(\"distance\", float(\"inf\")))\n            return {\"target_position\": nearest.get(\"position\", context.position)}\n        elif tool_name == \"pickup\" and context.nearby_resources:\n            nearest = min(context.nearby_resources, key=lambda r: r.get(\"distance\", float(\"inf\")))\n            return {\"item_id\": nearest.get(\"name\", \"\")}\n        return {}\n```\n\n### 2. Create tests in `tests/test_behavior.py`\n- Test that abstract methods are enforced\n- Test SimpleAgentBehavior parameter inference\n- Test lifecycle methods (on_episode_start, on_episode_end)\n- Test memory capacity enforcement\n\n### 3. Update `python/agent_runtime/__init__.py`\n```python\nfrom .behavior import AgentBehavior, SimpleAgentBehavior\n```\n\n## Acceptance Criteria\n- [ ] `AgentBehavior` ABC defined with all methods\n- [ ] `SimpleAgentBehavior` provides working parameter inference\n- [ ] Clear docstrings with examples\n- [ ] Unit tests pass\n- [ ] Type hints throughout\n\n## Dependencies\n- Phase 1: Core Schemas (#37)\n\n## Blocked By\n- #37\n\n## Blocks\n- Phase 5 (Built-in behaviors)\n- Phase 6 (User agents directory)","number":38,"repository":"JustInternetAI/AgentArena","title":"[Refactor] Phase 2: AgentBehavior Interface and Base Classes","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/38"},"id":"PVTI_lADODG39W84BHw8kzgkG3OE","labels":["enhancement","architecture"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Done","title":"[Refactor] Phase 2: AgentBehavior Interface and Base Classes"},{"content":{"body":"## Summary\nCreate the AgentArena class that hides IPC complexity and provides a simple entry point for users to run their agents.\n\n## Why Fourth?\nDepends on Phases 1-3 (schemas, behavior, memory). This connects user code to the Godot simulation.\n\n## Deliverables\n\n### 1. Create python/agent_runtime/arena.py\n\n```python\nclass AgentArena:\n    \"\"\"\n    Main orchestrator connecting user agents to Godot simulation.\n    \n    Handles:\n    - IPC connection management\n    - Agent registration and lifecycle\n    - Tick loop execution\n    - Tool schema distribution\n    \n    Example:\n        arena = AgentArena.connect()\n        arena.register('agent_001', MyAgent())\n        arena.run()  # Blocks, handles tick loop\n    \"\"\"\n    \n    @classmethod\n    def connect(cls, host='127.0.0.1', port=5000) -> 'AgentArena':\n        \"\"\"Connect to running Godot simulation.\"\"\"\n        ...\n    \n    def register(self, agent_id: str, behavior: AgentBehavior) -> None:\n        \"\"\"Register an agent behavior for the given ID.\"\"\"\n        ...\n    \n    def run(self) -> None:\n        \"\"\"Run the main tick loop (blocking).\"\"\"\n        ...\n    \n    def run_async(self) -> None:\n        \"\"\"Run tick loop in background thread.\"\"\"\n        ...\n    \n    def stop(self) -> None:\n        \"\"\"Stop the tick loop.\"\"\"\n        ...\n```\n\n### 2. Update IPC Server\nModify existing ipc/server.py to:\n- Accept registered AgentBehavior instances\n- Convert IPC messages to Observation objects\n- Call behavior.decide() and convert AgentDecision back to IPC format\n- Handle lifecycle events (episode start/end)\n\n### 3. Create python/run_agent.py entry point\n```python\n# Simple entry point for users\nfrom agent_runtime import AgentArena\nfrom user_agents import MyAgent  # User imports their agent\n\narena = AgentArena.connect()\narena.register('agent_001', MyAgent())\narena.run()\n```\n\n### 4. Update existing code\n- Refactor existing Agent class to use new schemas\n- Keep backward compatibility where possible\n- Add deprecation warnings for old patterns\n\n## Acceptance Criteria\n- [ ] AgentArena.connect() establishes IPC connection\n- [ ] Registered behaviors receive Observation objects\n- [ ] AgentDecision is correctly sent back to Godot\n- [ ] Lifecycle methods called at appropriate times\n- [ ] Clean error messages when connection fails\n- [ ] Backward compatible with existing IPC protocol\n\n## Dependencies\n- Phase 1: Core Schemas (#37)\n- Phase 2: AgentBehavior Interface (#38)\n- Phase 3: AgentMemory Interface (#39)\n\n## Blocks\n- Phase 5 (Built-in behaviors need arena for testing)\n- Phase 6 (User agents need arena to run)","number":40,"repository":"JustInternetAI/AgentArena","title":"[Refactor] Phase 4: AgentArena Orchestrator and IPC Integration","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/40"},"id":"PVTI_lADODG39W84BHw8kzgkG3wE","labels":["enhancement","architecture"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Done","title":"[Refactor] Phase 4: AgentArena Orchestrator and IPC Integration"},{"content":{"body":"## Summary\nDefine the core data contracts that both the framework and user code will use. This is the foundation that everything else builds on.\n\n## Why First?\nAll other phases depend on having stable, well-defined schemas. Getting this right first prevents breaking changes later.\n\n## Deliverables\n\n### 1. Create `python/agent_runtime/schemas.py`\n\n```python\nfrom dataclasses import dataclass, field\nfrom typing import Any\n\n@dataclass\nclass EntityInfo:\n    \"\"\"Information about a visible entity.\"\"\"\n    id: str\n    type: str\n    position: tuple[float, float, float]\n    distance: float\n    metadata: dict = field(default_factory=dict)\n\n@dataclass\nclass ResourceInfo:\n    \"\"\"Information about a nearby resource.\"\"\"\n    name: str\n    type: str\n    position: tuple[float, float, float]\n    distance: float\n\n@dataclass\nclass HazardInfo:\n    \"\"\"Information about a nearby hazard.\"\"\"\n    name: str\n    type: str\n    position: tuple[float, float, float]\n    distance: float\n    damage: float = 0.0\n\n@dataclass\nclass ItemInfo:\n    \"\"\"Information about an inventory item.\"\"\"\n    id: str\n    name: str\n    quantity: int = 1\n\n@dataclass\nclass ToolSchema:\n    \"\"\"Schema for an available tool.\"\"\"\n    name: str\n    description: str\n    parameters: dict  # JSON Schema format\n    \n    def to_openai_format(self) -> dict:\n        \"\"\"Convert to OpenAI function calling format.\"\"\"\n        ...\n\n@dataclass\nclass Observation:\n    \"\"\"What the agent receives from Godot each tick.\"\"\"\n    agent_id: str\n    tick: int\n    position: tuple[float, float, float]\n    rotation: tuple[float, float, float] | None = None\n    velocity: tuple[float, float, float] | None = None\n    visible_entities: list[EntityInfo] = field(default_factory=list)\n    nearby_resources: list[ResourceInfo] = field(default_factory=list)\n    nearby_hazards: list[HazardInfo] = field(default_factory=list)\n    inventory: list[ItemInfo] = field(default_factory=list)\n    health: float = 100.0\n    energy: float = 100.0\n    custom: dict = field(default_factory=dict)\n    \n    @classmethod\n    def from_dict(cls, data: dict) -> \"Observation\":\n        \"\"\"Create from IPC dictionary.\"\"\"\n        ...\n    \n    def to_dict(self) -> dict:\n        \"\"\"Convert to dictionary for serialization.\"\"\"\n        ...\n\n@dataclass\nclass AgentDecision:\n    \"\"\"What the agent returns to the framework.\"\"\"\n    tool: str\n    params: dict = field(default_factory=dict)\n    reasoning: str | None = None\n    \n    @classmethod\n    def from_llm_response(cls, response: str) -> \"AgentDecision\":\n        \"\"\"Parse LLM JSON response into decision.\"\"\"\n        ...\n    \n    @classmethod\n    def idle(cls, reasoning: str = None) -> \"AgentDecision\":\n        \"\"\"Create an idle decision.\"\"\"\n        return cls(tool=\"idle\", params={}, reasoning=reasoning)\n    \n    def to_dict(self) -> dict:\n        \"\"\"Convert to dictionary for IPC.\"\"\"\n        ...\n\n@dataclass\nclass SimpleContext:\n    \"\"\"Simplified context for beginners (Layer 1).\"\"\"\n    position: tuple[float, float, float]\n    nearby_resources: list[dict]\n    nearby_hazards: list[dict]\n    inventory: list[str]\n    goal: str | None = None\n    tick: int = 0\n    \n    @classmethod\n    def from_observation(cls, obs: Observation, goal: str = None) -> \"SimpleContext\":\n        \"\"\"Create simplified context from full observation.\"\"\"\n        ...\n```\n\n### 2. Create tests in `tests/test_schemas.py`\n- Test serialization/deserialization round-trips\n- Test `from_dict` / `to_dict` methods\n- Test `from_llm_response` parsing with various formats\n- Test `SimpleContext.from_observation` conversion\n\n### 3. Update `python/agent_runtime/__init__.py`\nExport all public schemas:\n```python\nfrom .schemas import (\n    Observation,\n    AgentDecision,\n    SimpleContext,\n    ToolSchema,\n    EntityInfo,\n    ResourceInfo,\n    HazardInfo,\n    ItemInfo,\n)\n```\n\n## Acceptance Criteria\n- [ ] All dataclasses defined with type hints\n- [ ] Serialization methods work correctly\n- [ ] LLM response parsing handles edge cases (malformed JSON, missing fields)\n- [ ] Unit tests pass\n- [ ] No changes to existing IPC code yet (that's Phase 4)\n\n## Dependencies\nNone - this is the first phase.\n\n## Blocked By\nNothing\n\n## Blocks\n- Phase 2 (AgentBehavior interface)\n- Phase 3 (Memory interface)\n- Phase 4 (IPC integration)","number":37,"repository":"JustInternetAI/AgentArena","title":"[Refactor] Phase 1: Core Schemas and Data Contracts","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/37"},"id":"PVTI_lADODG39W84BHw8kzgkG3E0","labels":["enhancement","architecture"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Done","title":"[Refactor] Phase 1: Core Schemas and Data Contracts"},{"content":{"body":"## Summary\r\nCreate a simple example agent to validate that the entire agent runtime architecture (Phases 1-4) works correctly end-to-end.\r\n\r\n## Why This Issue?\r\nWe've completed the core infrastructure:\r\n- Phase 1: Schemas and data contracts ✅\r\n- Phase 2: Behavior interfaces ✅\r\n- Phase 3: Memory systems ✅\r\n- Phase 4: Arena orchestrator and IPC integration ✅\r\n\r\nBut we haven't validated that everything works together. This issue creates a simple test agent to verify the full pipeline works before building more complex features.\r\n\r\n## Deliverables\r\n\r\n### 1. Create `python/user_agents/examples/simple_forager.py`\r\n\r\nA basic rule-based foraging agent demonstrating the full AgentBehavior interface.\r\n\r\n### 2. Create `python/user_agents/examples/simple_forager_simple.py`\r\n\r\nA SimpleAgentBehavior example for beginners showing the simplified interface.\r\n\r\n### 3. Update `python/run_agent.py`\r\n\r\nAdd example imports and registration (commented out by default).\r\n\r\n### 4. Create `python/user_agents/examples/__init__.py`\r\n\r\nPackage initialization to export example agents.\r\n\r\n### 5. Create test script `python/test_simple_agent.py`\r\n\r\nStandalone script to test agent with mock observations (no Godot required).\r\n\r\n### 6. Create `docs/quickstart_agent.md`\r\n\r\nQuick guide showing users how to create and run their first agent.\r\n\r\n## Testing Plan\r\n\r\n1. Run standalone test: `python python/test_simple_agent.py`\r\n2. Test with IPC server (if available): Uncomment agent in `run_agent.py` and start server\r\n3. Verify agent receives observations and makes decisions\r\n4. Check that decisions are converted to actions correctly\r\n\r\n## Acceptance Criteria\r\n\r\n- [ ] SimpleForager agent implemented with full control\r\n- [ ] SimpleForagerSimple agent implemented for beginners\r\n- [ ] Both agents make sensible decisions based on observations\r\n- [ ] Standalone test script passes\r\n- [ ] Example usage documented in run_agent.py\r\n- [ ] Quickstart guide created\r\n- [ ] No errors when importing agents\r\n\r\n## Dependencies\r\n\r\n- Phase 1: Core Schemas (#37) ✅\r\n- Phase 2: AgentBehavior Interface (#38) ✅\r\n- Phase 3: AgentMemory Interface (#39) ✅\r\n- Phase 4: AgentArena and IPC (#40) ✅\r\n\r\n## Value\r\n\r\nThis validates our entire architecture works and provides:\r\n- Working examples for users to learn from\r\n- Proof that the framework can support real agents\r\n- Foundation for building more complex agents\r\n- Early detection of integration issues\r\n","number":41,"repository":"JustInternetAI/AgentArena","title":"[Testing] Create Simple Test Agent to Validate Phase 1-4 Implementation","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/41"},"id":"PVTI_lADODG39W84BHw8kzgkKh8A","labels":["enhancement","testing"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Done","title":"[Testing] Create Simple Test Agent to Validate Phase 1-4 Implementation"},{"assignees":["AndrewBMadison"],"content":{"body":"# Long-Term Memory with Vector Store (FAISS)\r\n\r\n## Description\r\n\r\nImplement RAG-based long-term memory system with FAISS vector store for episodic memory and experience retrieval.\r\n\r\nThis implements the **core memory infrastructure** that will be exposed to agents via MCP (Issue #18). The memory system provides:\r\n- Vector-based similarity search for retrieving relevant past experiences\r\n- Episodic memory storage with embeddings\r\n- Persistence for saving/loading memory across sessions\r\n\r\n## Architecture\r\n\r\n```\r\n┌─────────────────────────────────────────────────┐\r\n│  Agent Decision Loop                            │\r\n└────────────────┬────────────────────────────────┘\r\n                 │\r\n                 │ (Future: via MCP - Issue #18)\r\n                 ▼\r\n┌─────────────────────────────────────────────────┐\r\n│  python/memory/long_term_memory.py              │\r\n│  ┌───────────────────────────────────────────┐  │\r\n│  │  LongTermMemory                           │  │\r\n│  │  - store_memory(text, metadata)           │  │\r\n│  │  - query_memory(query, k=5)               │  │\r\n│  │  - recall_by_id(memory_id)                │  │\r\n│  │  - save() / load()                        │  │\r\n│  └───────────────┬───────────────────────────┘  │\r\n│                  │                               │\r\n│         ┌────────┴────────┐                     │\r\n│         │                 │                      │\r\n│    ┌────▼─────┐    ┌─────▼──────┐              │\r\n│    │  FAISS   │    │ Sentence   │              │\r\n│    │  Index   │    │Transformers│              │\r\n│    └──────────┘    └────────────┘              │\r\n└─────────────────────────────────────────────────┘\r\n```\r\n\r\n## Tasks\r\n\r\n### Core Implementation\r\n- [ ] Create `python/memory/` module structure\r\n- [ ] Create `python/memory/long_term_memory.py`\r\n- [ ] Implement `LongTermMemory` class with methods:\r\n  - [ ] `store_memory(text: str, metadata: dict) -> str` - Store experience with ID\r\n  - [ ] `query_memory(query: str, k: int = 5) -> list[dict]` - Similarity search\r\n  - [ ] `recall_by_id(memory_id: str) -> dict` - Direct retrieval\r\n  - [ ] `get_all_memories() -> list[dict]` - Full memory dump\r\n  - [ ] `clear_memories()` - Reset memory store\r\n\r\n### Vector Store Integration\r\n- [ ] Integrate FAISS for vector storage (`faiss-cpu` or `faiss-gpu`)\r\n- [ ] Use sentence-transformers for embeddings (recommend: `all-MiniLM-L6-v2`)\r\n- [ ] Implement efficient similarity search with cosine distance\r\n- [ ] Support incremental index updates (add memories without full rebuild)\r\n\r\n### Persistence\r\n- [ ] Implement `save(filepath: str)` - Save FAISS index + metadata\r\n- [ ] Implement `load(filepath: str)` - Load from disk\r\n- [ ] Use pickle or JSON for metadata storage\r\n- [ ] Handle versioning for future compatibility\r\n\r\n### Testing & Validation\r\n- [ ] Create unit tests in `tests/test_long_term_memory.py`:\r\n  - [ ] Test memory storage and retrieval\r\n  - [ ] Test similarity search accuracy\r\n  - [ ] Test save/load persistence\r\n  - [ ] Test with varying embedding dimensions\r\n- [ ] Benchmark performance:\r\n  - [ ] Query latency with 1K, 10K, 100K memories\r\n  - [ ] Storage size per memory\r\n  - [ ] Index build time\r\n\r\n### Configuration\r\n- [ ] Add memory config in `configs/memory/long_term.yaml`:\r\n  ```yaml\r\n  memory:\r\n    type: faiss\r\n    embedding_model: \"all-MiniLM-L6-v2\"\r\n    embedding_dim: 384\r\n    index_type: \"Flat\"  # or \"IVF\" for larger datasets\r\n    persist_path: \"./data/memory/long_term.faiss\"\r\n  ```\r\n\r\n### Documentation\r\n- [ ] Document API in docstrings\r\n- [ ] Add usage examples in `docs/memory_system.md`\r\n- [ ] Document embedding model choices and tradeoffs\r\n\r\n## Component\r\n\r\nMemory\r\n\r\n## Size\r\n\r\nL (3-5 days)\r\n\r\n## Priority\r\n\r\nHigh\r\n\r\n## Dependencies\r\n\r\n- **Prerequisite for Issue #18**: MCP integration will wrap this memory system\r\n- Uses `sentence-transformers` and `faiss-cpu` (add to requirements.txt)\r\n\r\n## Relationship to Issue #18 (MCP Integration)\r\n\r\nThis issue implements the **underlying memory infrastructure**. Issue #18 will:\r\n1. Create `python/mcp_servers/memory_server.py` that **uses** `LongTermMemory`\r\n2. Expose MCP tools that call these methods:\r\n   - `query_memory` tool → calls `LongTermMemory.query_memory()`\r\n   - `store_memory` tool → calls `LongTermMemory.store_memory()`\r\n   - `recall_episode` tool → calls `LongTermMemory.recall_by_id()`\r\n3. Expose MCP resources (e.g., `memory://episodes`)\r\n\r\n**Development Order**: Complete Issue #8 first, then Issue #18 can wrap it with MCP.\r\n\r\n## Success Criteria\r\n\r\n- [ ] Can store 10,000+ memories without performance degradation\r\n- [ ] Query latency <50ms for similarity search (k=5) on 1K memories\r\n- [ ] Memories persist across sessions (save/load works)\r\n- [ ] Retrieves semantically similar memories (not just keyword match)\r\n- [ ] All unit tests pass\r\n- [ ] Memory system works independently (can be tested without MCP)\r\n\r\n## Example Usage\r\n\r\n```python\r\nfrom memory.long_term_memory import LongTermMemory\r\n\r\n# Initialize\r\nmemory = LongTermMemory(\r\n    embedding_model=\"all-MiniLM-L6-v2\",\r\n    persist_path=\"./data/memory.faiss\"\r\n)\r\n\r\n# Store experience\r\nmemory_id = memory.store_memory(\r\n    text=\"I collected 5 berries near the forest edge and avoided the fire hazard.\",\r\n    metadata={\r\n        \"episode\": 42,\r\n        \"outcome\": \"success\",\r\n        \"reward\": 25.0,\r\n        \"timestamp\": \"2025-01-15T10:30:00Z\"\r\n    }\r\n)\r\n\r\n# Query similar experiences\r\nsimilar = memory.query_memory(\r\n    query=\"How do I avoid hazards while collecting resources?\",\r\n    k=3\r\n)\r\n\r\nfor mem in similar:\r\n    print(f\"Memory: {mem['text']}\")\r\n    print(f\"Similarity: {mem['score']}\")\r\n    print(f\"Metadata: {mem['metadata']}\")\r\n\r\n# Save to disk\r\nmemory.save(\"./data/agent_001_memory.faiss\")\r\n\r\n# Load later\r\nmemory.load(\"./data/agent_001_memory.faiss\")\r\n```\r\n\r\n## Notes\r\n\r\n- Start with `faiss.IndexFlatL2` for simplicity, can upgrade to IVF/HNSW later for scale\r\n- Embedding model choice affects:\r\n  - **all-MiniLM-L6-v2**: Fast, 384D, good for most use cases\r\n  - **all-mpnet-base-v2**: Slower, 768D, higher quality\r\n- Consider async/await for embedding generation if it becomes a bottleneck\r\n- Future: Support multiple embedding models per agent (multi-modal)\r\n","number":8,"repository":"JustInternetAI/AgentArena","title":"Long-Term Memory with Vector Store (FAISS)","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/8"},"id":"PVTI_lADODG39W84BHw8kzghADcg","labels":["enhancement","memory","high-priority"],"linked pull requests":["https://github.com/JustInternetAI/AgentArena/pull/42"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Done","title":"Long-Term Memory with Vector Store (FAISS)"},{"assignees":["justinmadison","AndrewBMadison"],"content":{"body":"## Priority: High | Size: M | Component: Agent Runtime / Backends\n**Blocking**: LLM agent integration with foraging scene\n\n## Problem Statement\n\nThe codebase has two separate agent systems that aren't connected:\n\n1. **AgentBehavior system** (`python/agent_runtime/behavior.py`)\n   - Classes: `AgentBehavior`, `SimpleAgentBehavior`, `LLMAgentBehavior`\n   - Used by: IPC server's `/observe` endpoint via `server.behaviors` dict\n   - Learner-facing API with three tiers (beginner/intermediate/advanced)\n\n2. **Local LLM Backend system** (`python/backends/`)\n   - Classes: `LlamaCppBackend`, `VLLMBackend`\n   - Used by: `run_ipc_server_with_gpu.py` via `Agent` class\n   - GPU-accelerated local inference\n\n### Current Flow (Broken)\n```\nGodot foraging scene\n    ↓ POST /observe (sends observation)\npython/ipc/server.py line 424-475\n    ↓ checks server.behaviors dict → EMPTY\n    ↓ falls back to mock rule-based logic\n    (LlamaCppBackend never gets called)\n```\n\n## Goal\n\nCreate `LocalLLMBehavior` class that wraps local backends (LlamaCppBackend, VLLMBackend) and implements the `AgentBehavior` interface, so local LLMs can power agents via the `/observe` endpoint.\n\n---\n\n## Architecture Context\n\nThe IPC server (`python/ipc/server.py`) has a `behaviors` dict that maps `agent_id → AgentBehavior`. When `/observe` receives an observation:\n- Line 424-426: Gets agent_id from observation\n- Line 427: Checks `behavior = self.behaviors.get(agent_id)`\n- Line 428-471: If behavior exists, calls `behavior.decide(observation, tools)`\n- Line 472-475: If no behavior, falls back to `_make_mock_decision()`\n\nThe `LLMAgentBehavior` class (line 281-422 in behavior.py) already supports cloud LLMs (Anthropic, OpenAI, Ollama) but NOT the local backends (LlamaCppBackend, VLLMBackend).\n\n---\n\n## Files to Understand First\n\n1. `python/agent_runtime/behavior.py` - Base classes, especially `LLMAgentBehavior`\n2. `python/backends/base.py` - `BaseBackend` interface with `generate()` and `generate_with_tools()`\n3. `python/backends/llama_cpp_backend.py` - Local GPU inference implementation\n4. `python/backends/vllm_backend.py` - vLLM server client\n5. `python/ipc/server.py` - See `/observe` endpoint (line 397-493) and `create_server()` function\n6. `python/user_agents/examples/llm_forager.py` - Example of LLMAgentBehavior subclass\n7. `python/scenarios/foraging.py` - Scenario definition with `to_system_prompt()` method\n\n---\n\n## Implementation Tasks\n\n- [ ] **Create `python/agent_runtime/local_llm_behavior.py`** with `LocalLLMBehavior` class\n- [ ] **Add factory function** `create_local_llm_behavior()` for easy creation\n- [ ] **Update `python/ipc/server.py`** `create_server()` with `default_behavior` parameter\n- [ ] **Create `python/run_local_llm_forager.py`** startup script\n- [ ] **Integrate scenario system prompts** from `python/scenarios/foraging.py`\n- [ ] **Add to `__init__.py` exports**\n- [ ] **Write tests** in `tests/test_local_llm_behavior.py`\n\n---\n\n## Reference: Backend API\n\n```python\nclass BaseBackend(ABC):\n    @abstractmethod\n    def generate(self, prompt: str, temperature: float | None, max_tokens: int | None) -> GenerationResult:\n        pass\n\n    @abstractmethod\n    def generate_with_tools(self, prompt: str, tools: list[dict], temperature: float | None) -> GenerationResult:\n        pass\n\n@dataclass\nclass GenerationResult:\n    text: str\n    tokens_used: int\n    finish_reason: str\n    metadata: dict[str, Any]\n```\n\n---\n\n## Agent ID Matching\n\nThe Godot foraging scene uses `SimpleAgent` which has an `agent_id` property:\n- If set in scene: uses that value\n- If empty: auto-generates `\"agent_\" + timestamp`\n\n**Recommendation**: Set `agent_id = \"forager_001\"` in the Godot scene's SimpleAgent node for explicit control.\n\n---\n\n## Acceptance Criteria\n\n- [ ] `LocalLLMBehavior` implements full `AgentBehavior` interface\n- [ ] Works with both `LlamaCppBackend` and `VLLMBackend`\n- [ ] Integrates with scenario system prompts (`to_system_prompt()`)\n- [ ] `run_local_llm_forager.py` successfully runs foraging scene with local LLM\n- [ ] Agent makes reasonable decisions (moves to resources, avoids hazards)\n- [ ] Graceful error handling (returns idle on LLM failures)\n- [ ] All tests pass\n- [ ] Pre-commit hooks pass (black, ruff, mypy)\n\n---\n\n## Testing Instructions\n\n1. Download a GGUF model (e.g., `llama-2-7b-chat.Q4_K_M.gguf`)\n2. Run: `python run_local_llm_forager.py --model path/to/model.gguf`\n3. Open Godot and run the foraging scene\n4. Observe agent behavior in console logs\n5. Verify decisions are LLM-generated (not mock logic)\n\n---\n\n## Related Files\n\n| File | Purpose |\n|------|---------|\n| `python/agent_runtime/behavior.py` | Base classes to extend |\n| `python/backends/llama_cpp_backend.py` | Local GPU backend |\n| `python/backends/vllm_backend.py` | vLLM server backend |\n| `python/backends/base.py` | Backend interface |\n| `python/ipc/server.py` | IPC server with /observe endpoint |\n| `python/scenarios/foraging.py` | System prompt source |\n| `python/user_agents/examples/llm_forager.py` | Reference implementation |\n| `scripts/simple_agent.gd` | Godot agent (sends agent_id) |\n| `scripts/base_scene_controller.gd` | Sends observations to /observe |\n\n---\n\n📄 **Full details also in**: `docs/backlog_items.md` (B-31)","number":43,"repository":"JustInternetAI/AgentArena","title":"B-31: LocalLLMBehavior - Bridge Local Backends to AgentBehavior API","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/43"},"id":"PVTI_lADODG39W84BHw8kzgkbJ4E","labels":["enhancement"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Done","title":"B-31: LocalLLMBehavior - Bridge Local Backends to AgentBehavior API"},{"content":{"body":"# Add MCP Layer for External Context and Memory Systems\r\n\r\n## Overview\r\n\r\nIntegrate the Model Context Protocol (MCP) to provide Agent Arena agents with access to external context sources (memory, knowledge bases, external APIs) while maintaining our existing Godot ↔ Python IPC for simulation tools.\r\n\r\n## Motivation\r\n\r\nCurrently, Agent Arena uses a custom `ToolDispatcher` for all tools, including both:\r\n1. **Simulation tools** (move_to, collect, craft) - tightly coupled to Godot\r\n2. **Context tools** (memory queries, knowledge retrieval) - could benefit from standardization\r\n\r\nBy adopting MCP for external context sources, we gain:\r\n- **Standardization**: Compatible with Claude, GPT, and other LLM tooling\r\n- **Modularity**: Easy to add new context sources without modifying core agent code\r\n- **Separation of Concerns**: Simulation tools vs. context/memory tools\r\n- **Ecosystem Access**: Leverage existing MCP servers (databases, APIs, etc.)\r\n\r\n## Architecture\r\n\r\n### Current System\r\n```\r\n┌─────────────────┐\r\n│  Godot Engine   │\r\n└────────┬────────┘\r\n         │ HTTP/JSON (Custom IPC)\r\n         │\r\n┌────────▼────────┐\r\n│ FastAPI Server  │\r\n└────────┬────────┘\r\n         │\r\n┌────────▼────────┐\r\n│ Agent Runtime   │\r\n│ ToolDispatcher  │ ← All tools (simulation + context)\r\n└─────────────────┘\r\n```\r\n\r\n### Proposed System\r\n```\r\n┌─────────────────┐\r\n│  Godot Engine   │\r\n└────────┬────────┘\r\n         │ HTTP/JSON (Custom IPC - UNCHANGED)\r\n         │\r\n┌────────▼────────────────────────────────────┐\r\n│         Python Agent Runtime                │\r\n│  ┌──────────────┐      ┌─────────────────┐ │\r\n│  │ToolDispatcher│      │  MCP Client     │ │\r\n│  │(Godot Tools) │      │                 │ │\r\n│  └──────────────┘      └────────┬────────┘ │\r\n│         │                       │           │\r\n│    Simulation            ┌──────┴─────────┐ │\r\n│    Tools Only            │                │ │\r\n│                     ┌────▼───┐      ┌────▼───┐\r\n│                     │ MCP    │      │ MCP    │\r\n│                     │ Server │      │ Server │\r\n│                     │ (RAG)  │      │ (Docs) │\r\n│                     └────────┘      └────────┘\r\n│                          │               │\r\n│                     Vector DB      Knowledge\r\n│                     (FAISS)         Base\r\n└─────────────────────────────────────────────┘\r\n```\r\n\r\n## Implementation Plan\r\n\r\n### Phase 1: MCP Infrastructure Setup\r\n- [ ] Add MCP SDK to Python dependencies (`pip install mcp anthropic-mcp`)\r\n- [ ] Create `python/mcp_servers/` directory structure\r\n- [ ] Create base MCP server template/utilities\r\n- [ ] Update documentation with MCP architecture diagrams\r\n\r\n### Phase 2: Memory MCP Server\r\n- [ ] Implement `python/mcp_servers/memory_server.py`\r\n  - [ ] `query_memory` tool - RAG retrieval from vector store\r\n  - [ ] `store_memory` tool - Save episode experiences\r\n  - [ ] `recall_episode` tool - Retrieve specific past episodes\r\n  - [ ] Resource: `memory://episodes` - Access to episode history\r\n- [ ] Integrate with existing FAISS/Milvus infrastructure\r\n- [ ] Add embedding generation (sentence-transformers)\r\n- [ ] Create unit tests for memory server\r\n\r\n### Phase 3: Knowledge Base MCP Server\r\n- [ ] Implement `python/mcp_servers/knowledge_server.py`\r\n  - [ ] `get_recipe` tool - Retrieve crafting recipes\r\n  - [ ] `query_rules` tool - Get game rules and mechanics\r\n  - [ ] `search_strategy` tool - Find strategy guides\r\n  - [ ] Resource: `knowledge://recipes` - All crafting recipes\r\n  - [ ] Resource: `knowledge://rules` - Game rules\r\n- [ ] Populate initial knowledge base with:\r\n  - [ ] Crafting recipes from benchmark scenes\r\n  - [ ] Game mechanics documentation\r\n  - [ ] Basic strategy guides\r\n- [ ] Create unit tests for knowledge server\r\n\r\n### Phase 4: Agent Integration\r\n- [ ] Modify `python/agent_runtime/agent.py`:\r\n  - [ ] Add MCP client session management\r\n  - [ ] Connect to MCP servers on agent initialization\r\n  - [ ] Implement `_get_all_tools()` to combine Godot + MCP tools\r\n  - [ ] Add routing logic to distinguish tool types:\r\n    - Godot tools → Execute via ToolDispatcher → Send to Godot via IPC\r\n    - MCP tools → Execute via MCP client session\r\n  - [ ] Update `_build_context()` to include MCP resources\r\n- [ ] Add configuration for MCP servers in `configs/`\r\n- [ ] Update LLM prompts to distinguish tool categories\r\n\r\n### Phase 5: Testing & Validation\r\n- [ ] Create integration tests:\r\n  - [ ] Agent with both Godot and MCP tools\r\n  - [ ] Memory storage and retrieval flow\r\n  - [ ] Knowledge base queries during decision-making\r\n- [ ] Add example scenarios:\r\n  - [ ] Agent recalls past foraging strategy\r\n  - [ ] Agent queries crafting recipe before attempting craft\r\n  - [ ] Agent stores episode summary after completion\r\n- [ ] Performance benchmarking:\r\n  - [ ] Measure MCP tool call latency\r\n  - [ ] Compare with current ToolDispatcher performance\r\n  - [ ] Ensure <50ms overhead for context queries\r\n\r\n### Phase 6: Documentation & Examples\r\n- [ ] Create `docs/mcp_integration.md`:\r\n  - [ ] Architecture overview\r\n  - [ ] How to create new MCP servers\r\n  - [ ] Tool routing logic\r\n  - [ ] Configuration guide\r\n- [ ] Add code examples:\r\n  - [ ] Creating a custom MCP server\r\n  - [ ] Registering MCP server with agent\r\n  - [ ] Querying MCP resources from agent context\r\n- [ ] Update `docs/architecture.md` with MCP layer\r\n- [ ] Create tutorial: \"Building Your First MCP Server for Agent Arena\"\r\n\r\n## Technical Details\r\n\r\n### MCP Servers to Implement\r\n\r\n#### 1. Memory Server (`mcp_servers/memory_server.py`)\r\n**Purpose**: RAG-based episodic memory and experience retrieval\r\n\r\n**Tools**:\r\n- `query_memory(query: str, k: int = 5)` → Similar past experiences\r\n- `store_memory(episode: str, outcome: str, metadata: dict)` → Save experience\r\n- `recall_episode(episode_id: str)` → Retrieve specific episode\r\n\r\n**Resources**:\r\n- `memory://episodes` - All stored episodes\r\n- `memory://summaries` - Episode summaries\r\n\r\n**Backend**: FAISS vector store + sentence-transformers\r\n\r\n#### 2. Knowledge Base Server (`mcp_servers/knowledge_server.py`)\r\n**Purpose**: Static game knowledge (recipes, rules, strategies)\r\n\r\n**Tools**:\r\n- `get_recipe(item_name: str)` → Crafting recipe\r\n- `query_rules(topic: str)` → Game mechanics\r\n- `search_strategy(scenario: str)` → Strategy recommendations\r\n\r\n**Resources**:\r\n- `knowledge://recipes` - All crafting recipes\r\n- `knowledge://rules` - Game rules and mechanics\r\n- `knowledge://strategies` - Strategy guides\r\n\r\n**Backend**: JSON files or lightweight database\r\n\r\n### Tool Routing Logic\r\n\r\n```python\r\nasync def execute_tool(self, tool_name: str, params: dict):\r\n    # Godot simulation tools\r\n    if tool_name in [\"move_to\", \"collect\", \"craft\", \"rotate_to\", \"capture_point\"]:\r\n        return self.godot_tools.execute_tool(tool_name, params)\r\n\r\n    # MCP tools (format: \"server:tool_name\")\r\n    elif \":\" in tool_name:\r\n        server_name, mcp_tool_name = tool_name.split(\":\", 1)\r\n        if server_name in self.mcp_sessions:\r\n            return await self.mcp_sessions[server_name].call_tool(mcp_tool_name, params)\r\n\r\n    raise ValueError(f\"Unknown tool: {tool_name}\")\r\n```\r\n\r\n### Configuration Example\r\n\r\n```yaml\r\n# configs/mcp_servers.yaml\r\nmcp_servers:\r\n  memory:\r\n    command: python\r\n    args:\r\n      - \"-m\"\r\n      - \"mcp_servers.memory_server\"\r\n    env:\r\n      VECTOR_DB_PATH: \"./data/memory.faiss\"\r\n      EMBEDDING_MODEL: \"all-MiniLM-L6-v2\"\r\n\r\n  knowledge:\r\n    command: python\r\n    args:\r\n      - \"-m\"\r\n      - \"mcp_servers.knowledge_server\"\r\n    env:\r\n      KNOWLEDGE_BASE_PATH: \"./data/knowledge/\"\r\n```\r\n\r\n## Non-Goals (Out of Scope)\r\n\r\n- **NOT replacing Godot ↔ Python IPC**: Our custom HTTP/JSON protocol stays\r\n- **NOT migrating simulation tools to MCP**: Tools like `move_to`, `collect` remain in ToolDispatcher\r\n- **NOT using MCP for real-time simulation control**: Only for context/memory queries\r\n\r\n## Success Criteria\r\n\r\n- [ ] Agents can query memory using MCP `query_memory` tool\r\n- [ ] Agents can retrieve recipes using MCP `get_recipe` tool\r\n- [ ] MCP tool calls add <50ms latency overhead\r\n- [ ] All existing Godot tools continue to work unchanged\r\n- [ ] New MCP servers can be added without modifying agent code\r\n- [ ] Documentation explains when to use MCP vs ToolDispatcher\r\n\r\n## Open Questions\r\n\r\n1. Should we use stdio or SSE for MCP transport? (Recommend: stdio for simplicity)\r\n2. How many recent memories should agents auto-retrieve per tick?\r\n3. Should MCP servers run as separate processes or in-process?\r\n4. Do we need an MCP server for external APIs (weather, player stats)?\r\n\r\n## References\r\n\r\n- [MCP Documentation](https://modelcontextprotocol.io/)\r\n- [MCP Python SDK](https://github.com/modelcontextprotocol/python-sdk)\r\n- Agent Arena Architecture: `docs/architecture.md`\r\n- Current IPC Protocol: `docs/ipc_protocol.md`\r\n\r\n## Related Issues\r\n\r\n- Phase 3 Roadmap: Memory system (short-term + vector store)\r\n- Phase 4 Roadmap: First benchmark scene (foraging)\r\n\r\n---\r\n\r\n**Labels**: enhancement, architecture, memory-system, phase-3\r\n**Milestone**: Phase 3 - Memory System\r\n","number":18,"repository":"JustInternetAI/AgentArena","title":"Add MCP Layer for External Context and Memory Systems","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/18"},"id":"PVTI_lADODG39W84BHw8kzghdJbI","labels":["enhancement","architecture"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Backlog","title":"Add MCP Layer for External Context and Memory Systems"},{"content":{"body":"## Problem\nCurrently, each agent makes an individual LLM call during `AgentRuntime.process_tick()`. While agents are processed concurrently via ThreadPoolExecutor, their LLM requests are not batched together, which underutilizes vLLM's continuous batching capabilities.\n\n## Current Flow\n```python\n# python/agent_runtime/runtime.py:87\nfor agent_id, agent in self.agents.items():\n    task = asyncio.create_task(self._agent_decide(agent))\n    # Each agent makes individual LLM call\n```\n\n## Proposed Solution\nImplement batch LLM generation in `AgentRuntime.process_tick()`:\n\n1. **Collect all agent contexts first:**\n   ```python\n   contexts = {}\n   for agent_id, agent in self.agents.items():\n       contexts[agent_id] = agent._build_context()\n   ```\n\n2. **Send all prompts to vLLM together:**\n   ```python\n   if isinstance(self.backend, VLLMBackend):\n       results = await self.backend.generate_batch(prompts)\n   else:\n       # Fallback to concurrent individual calls\n       results = await self._concurrent_llm_calls(contexts)\n   ```\n\n3. **Parse results into actions:**\n   ```python\n   actions = {}\n   for agent_id, result in results.items():\n       actions[agent_id] = agent._parse_action(result)\n   ```\n\n## Expected Impact\n- **50-70% faster LLM inference** with 4+ agents\n- Better utilization of vLLM's continuous batching (PagedAttention)\n- With 4 agents: ~1.5x time of single agent instead of 4x\n\n## Files to Modify\n- `python/agent_runtime/runtime.py` - Implement batch processing in `process_tick()`\n- `python/backends/vllm_backend.py` - Add `generate_batch()` method\n- `python/backends/base.py` - Add abstract `generate_batch()` interface\n- `python/backends/llama_cpp_backend.py` - Implement fallback (no native batching)\n\n## References\n- vLLM Continuous Batching: https://docs.vllm.ai/en/latest/serving/engine_args.html\n- Current implementation: `python/agent_runtime/runtime.py:66-100`\n- Backend interface: `python/backends/base.py:68-86`\n\n## Priority\n**HIGH** - This is the primary bottleneck in multi-agent scenarios","number":22,"repository":"JustInternetAI/AgentArena","title":"Implement LLM request batching for concurrent agent decisions","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/22"},"id":"PVTI_lADODG39W84BHw8kzghqIvs","repository":"https://github.com/JustInternetAI/AgentArena","status":"Backlog","title":"Implement LLM request batching for concurrent agent decisions"},{"content":{"body":"## Problem\nTool execution in `ToolDispatcher` is fully synchronous and blocking. This means:\n- Multiple concurrent tool requests block each other\n- FastAPI cannot handle multiple tool requests in parallel\n- Long-running tools block the entire IPC server thread\n\n## Current Implementation\n```python\n# python/agent_runtime/tool_dispatcher.py:103\ndef execute_tool(self, name: str, parameters: dict[str, Any]) -> dict[str, Any]:\n    # Synchronous blocking call\n    result = self.tools[name](**parameters)\n    return {\"success\": True, \"result\": result}\n```\n\n## Proposed Solution\nMake the tool dispatcher async-compatible:\n\n```python\nasync def execute_tool(self, name: str, parameters: dict[str, Any]) -> dict[str, Any]:\n    \"\"\"Execute tool asynchronously if possible.\"\"\"\n    tool_func = self.tools[name]\n    \n    # Check if tool function is async\n    if asyncio.iscoroutinefunction(tool_func):\n        result = await tool_func(**parameters)\n    else:\n        # Run sync function in executor to avoid blocking\n        loop = asyncio.get_event_loop()\n        result = await loop.run_in_executor(None, tool_func, **parameters)\n    \n    return {\"success\": True, \"result\": result}\n```\n\n## Benefits\n1. **Concurrent tool execution** - Multiple agents can execute tools simultaneously\n2. **Non-blocking I/O** - Tools with I/O operations don't block other requests\n3. **Better FastAPI utilization** - Server can handle multiple tool requests concurrently\n4. **Backwards compatible** - Existing sync tools continue to work\n\n## Files to Modify\n- `python/agent_runtime/tool_dispatcher.py` - Convert `execute_tool()` to async\n- `python/ipc/server.py:234` - Already async endpoint, just needs await\n- Tool implementations - Can optionally be converted to async for better performance\n\n## Migration Path\n1. Make `ToolDispatcher.execute_tool()` async with fallback to executor\n2. Update IPC server endpoint to await the async call\n3. Gradually convert individual tools to async where beneficial (file I/O, network calls, etc.)\n\n## Expected Impact\n- **MEDIUM-HIGH** - Allows FastAPI to handle multiple tool requests concurrently\n- Prevents long-running tools from blocking other operations\n- Enables future optimizations for I/O-bound tools\n\n## Current Code References\n- `python/agent_runtime/tool_dispatcher.py:76-116` - Current sync implementation\n- `python/ipc/server.py:213-256` - Tool execution endpoint\n- `godot/src/agent_arena.cpp:619-649` - C++ client request queue\n\n## Priority\n**MEDIUM** - Quality of life improvement, becomes important with concurrent tool execution","number":23,"repository":"JustInternetAI/AgentArena","title":"Convert tool execution to async for concurrent tool handling","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/23"},"id":"PVTI_lADODG39W84BHw8kzghqIyI","repository":"https://github.com/JustInternetAI/AgentArena","status":"Backlog","title":"Convert tool execution to async for concurrent tool handling"},{"content":{"body":"## Description\nThe crafting chain benchmark scene needs to be populated with base resources, crafting stations, and a functional multi-step crafting system.\n\n## Parent Issue\nSplit from #15 - Build out benchmark scenes with game content\n\n## Scene Overview\n**Goal**: Agent must gather base resources and use crafting stations to create complex items through multi-step recipes\n\n**Target Item**: iron_sword (requires iron_rod + wooden_handle)\n\n**Current State**: Scene structure exists with SceneController base class and recipe system defined\n\n## Crafting Recipe Tree\n```\niron_sword (target)\n├── iron_rod (requires: iron_ingot)\n│   └── iron_ingot (requires: iron_ore + coal)\n└── wooden_handle (requires: wood x2)\n```\n\n## Tasks\n- [ ] Design world layout with distinct zones\n  - [ ] Resource gathering area\n  - [ ] Crafting station area\n- [ ] Add base resource nodes\n  - [ ] Place 3+ iron_ore deposits\n  - [ ] Place 3+ wood piles\n  - [ ] Place 2+ coal deposits\n- [ ] Add crafting station nodes\n  - [ ] Place Furnace (for smelting iron_ingot)\n  - [ ] Place Anvil (for forging iron_rod and iron_sword)\n  - [ ] Place Workbench (for crafting wooden_handle)\n- [ ] Add visual indicators\n  - [ ] Resource type identification\n  - [ ] Crafting station labels/markers\n  - [ ] Crafting progress indicators\n- [ ] Test crafting workflow\n  - [ ] Verify resource collection works\n  - [ ] Verify station proximity detection (CRAFTING_RADIUS = 2.5)\n  - [ ] Verify recipe requirements are enforced\n  - [ ] Verify crafting time delays work\n  - [ ] Test full crafting chain to iron_sword\n- [ ] Balance resource placement\n  - [ ] Ensure enough resources for full crafting chain\n  - [ ] Add some excess resources to test efficiency\n- [ ] Document crafting mechanics and recipes\n\n## Acceptance Criteria\n- [ ] Scene has all required base resources (iron_ore, wood, coal)\n- [ ] Scene has all three crafting stations (Furnace, Anvil, Workbench)\n- [ ] Agent can gather resources and navigate to stations\n- [ ] Full crafting chain works: resources → iron_sword\n- [ ] Metrics track items crafted, recipe efficiency, resource waste\n- [ ] Scene demonstrates multi-step planning and execution\n- [ ] Scene layout and recipes are documented\n\n## Technical Notes\n- Scene script: `scripts/crafting_chain.gd` (extends SceneController)\n- Scene file: `scenes/crafting_chain.tscn`\n- Collection radius: 2.0 units\n- Crafting radius: 2.5 units\n- Recipes defined in RECIPES constant\n- Crafting is time-based (2-4 seconds per item)\n\n## Related\n- Parent: #15\n- Related: SceneController implementation (#24)","number":26,"repository":"JustInternetAI/AgentArena","title":"Populate Crafting Chain Scene with game content","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/26"},"id":"PVTI_lADODG39W84BHw8kzgh0slY","labels":["enhancement","evals"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Backlog","title":"Populate Crafting Chain Scene with game content"},{"content":{"body":"## Description\nThe team capture benchmark scene needs to be populated with capture points, team spawn zones, and a functional team-based competitive environment.\n\n## Parent Issue\nSplit from #15 - Build out benchmark scenes with game content\n\n## Scene Overview\n**Goal**: Two teams compete to capture and hold strategic objectives to reach a target score\n\n**Teams**: Blue Team vs Red Team (2 agents per team)\n\n**Current State**: Scene structure exists with SceneController base class and multi-agent support\n\n## Scoring System\n- Capturing a point: +10 points\n- Holding a point per tick: +1 point per tick\n- Win condition: First team to 100 points\n\n## Tasks\n- [ ] Design arena layout\n  - [ ] Create balanced battlefield with cover and pathways\n  - [ ] Define team spawn zones (opposite corners)\n  - [ ] Identify strategic capture point locations\n- [ ] Add capture point nodes (3-5 points recommended)\n  - [ ] Place capture points in strategic locations\n  - [ ] Configure capture radius (CAPTURE_RADIUS = 3.0)\n  - [ ] Configure capture time (CAPTURE_TIME = 5.0 seconds)\n  - [ ] Add visual indicators (neutral/blue/red ownership)\n  - [ ] Add capture progress visualization\n- [ ] Set up team spawn zones\n  - [ ] Blue team spawn area (with spawn positions for 2+ agents)\n  - [ ] Red team spawn area (with spawn positions for 2+ agents)\n  - [ ] Ensure balanced starting positions\n- [ ] Add environmental elements\n  - [ ] Cover/obstacles for tactical gameplay\n  - [ ] Pathways between capture points\n  - [ ] Visual landmarks for navigation\n- [ ] Test multi-agent mechanics\n  - [ ] Verify capture point detection works\n  - [ ] Verify team-based capture (majority control)\n  - [ ] Verify contested capture point behavior\n  - [ ] Test full match to 100 points\n  - [ ] Verify team coordination metrics\n- [ ] Balance gameplay\n  - [ ] Capture point placement creates strategic decisions\n  - [ ] Teams have equal advantages\n  - [ ] Match duration is reasonable (target: 2-5 minutes)\n- [ ] Document team mechanics and strategies\n\n## Acceptance Criteria\n- [ ] Scene has 3-5 capture points in strategic locations\n- [ ] Scene has defined spawn zones for both teams\n- [ ] Multiple agents per team can coordinate\n- [ ] Capture mechanics work correctly (contested, capturing, held)\n- [ ] Scoring system tracks captures and holding points\n- [ ] Team coordination metrics calculate properly\n- [ ] Scene is balanced (neither team has unfair advantage)\n- [ ] Scene demonstrates team strategy and coordination\n- [ ] Scene layout and mechanics are documented\n\n## Technical Notes\n- Scene script: `scripts/team_capture.gd` (extends SceneController)\n- Scene file: `scenes/team_capture.tscn`\n- Capture radius: 3.0 units\n- Capture time: 5.0 seconds\n- Communication radius: 15.0 units (for agent coordination)\n- Uses `get_agents_by_team()` helper for team queries\n- Tracks individual contributions and team coordination score\n\n## Multi-Agent Complexity\nThis scene tests:\n- Team-based agent coordination\n- Concurrent multi-agent decision making\n- Strategic objective prioritization\n- Defensive vs offensive behaviors\n\n## Related\n- Parent: #15\n- Related: SceneController implementation (#24)","number":27,"repository":"JustInternetAI/AgentArena","title":"Populate Team Capture Scene with game content","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/27"},"id":"PVTI_lADODG39W84BHw8kzgh0snY","labels":["enhancement","evals"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Backlog","title":"Populate Team Capture Scene with game content"},{"content":{"body":"## Summary\nBuild a debugging tool that shows exactly what prompts are sent to the LLM and what responses are received, enabling developers to understand and debug agent decision-making.\n\n## Educational Value\n- **Core Skill**: Understanding how observations become prompts and how LLM responses become actions\n- **Debugging**: See why an agent made a particular decision\n- **Learning**: Study effective prompt patterns by examining working agents\n\n## Features\n\n### Minimum Viable\n- [ ] Display the full prompt sent to the LLM for each decision\n- [ ] Display the raw LLM response\n- [ ] Show tool calls extracted from the response\n- [ ] Timestamp and tick number for each exchange\n\n### Enhanced\n- [ ] Syntax highlighting for JSON tool schemas and responses\n- [ ] Collapsible sections (system prompt, observations, memory context, etc.)\n- [ ] Search/filter by tick range or keywords\n- [ ] Export conversation log to file\n\n### Advanced\n- [ ] Side-by-side comparison of prompts from different runs\n- [ ] Token count display (input/output)\n- [ ] Latency metrics per request\n- [ ] Highlight which observations changed between ticks\n\n## Implementation Notes\n- Could be a Godot UI panel or a separate web-based viewer\n- Needs hooks into the IPC layer to capture request/response pairs\n- Consider log-based approach for replay compatibility\n\n## Acceptance Criteria\n- [ ] Can view the exact prompt sent for any agent decision\n- [ ] Can view the exact LLM response received\n- [ ] Works during live simulation and replay mode","number":31,"repository":"JustInternetAI/AgentArena","title":"Prompt Inspector - View LLM Input/Output in Real-Time","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/31"},"id":"PVTI_lADODG39W84BHw8kzgkGzgQ","labels":["enhancement"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Backlog","title":"Prompt Inspector - View LLM Input/Output in Real-Time"},{"content":{"body":"## Summary\nImplement a debugging mode that allows pausing the simulation and advancing tick-by-tick, enabling detailed inspection of agent behavior at each decision point.\n\n## Educational Value\n- **Core Skill**: Understanding the perception-reasoning-action loop\n- **Debugging**: Isolate exactly when and why behavior diverges from expectation\n- **Learning**: Study agent behavior in slow motion\n\n## Features\n\n### Minimum Viable\n- [ ] Pause/Resume simulation button\n- [ ] Step forward one tick\n- [ ] Display current tick number prominently\n- [ ] Show agent state at current tick (position, inventory, health, etc.)\n\n### Enhanced\n- [ ] Step backward (requires event log replay)\n- [ ] Jump to specific tick number\n- [ ] Breakpoints: pause when specific conditions occur (e.g., agent picks up item)\n- [ ] Speed controls (0.5x, 1x, 2x, 5x)\n\n### Advanced\n- [ ] Conditional breakpoints (pause when agent.health < 50)\n- [ ] Watch expressions (track specific values across ticks)\n- [ ] Timeline scrubber with event markers\n- [ ] Fork simulation from any tick (what-if exploration)\n\n## Implementation Notes\n- Leverage existing deterministic replay system\n- SimulationManager already has step_simulation() method\n- Need UI controls in Godot\n- Backward stepping requires replaying from start to target tick\n\n## Acceptance Criteria\n- [ ] Can pause simulation at any point\n- [ ] Can advance exactly one tick at a time\n- [ ] Can see agent state after each tick\n- [ ] Works with multiple agents","number":32,"repository":"JustInternetAI/AgentArena","title":"Step-Through Debug Mode - Tick-by-Tick Simulation Control","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/32"},"id":"PVTI_lADODG39W84BHw8kzgkGzhE","labels":["enhancement"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Backlog","title":"Step-Through Debug Mode - Tick-by-Tick Simulation Control"},{"content":{"body":"## Summary\nBuild a tool that runs the same scenario with different agent implementations and presents a side-by-side comparison of their performance, decisions, and outcomes.\n\n## Educational Value\n- **Core Skill**: Understanding how different approaches affect agent performance\n- **Experimentation**: Test hypotheses about agent design\n- **Benchmarking**: Objectively compare implementations\n\n## Features\n\n### Minimum Viable\n- [ ] Run same scenario with two different agent configs\n- [ ] Display final metrics side-by-side (score, time, resources collected, etc.)\n- [ ] Same random seed for fair comparison\n- [ ] Summary report of key differences\n\n### Enhanced\n- [ ] Compare more than two agents simultaneously\n- [ ] Timeline view showing when agents diverged in behavior\n- [ ] Highlight decision points where agents chose differently\n- [ ] Statistical comparison over multiple runs (mean, stddev)\n\n### Advanced\n- [ ] Automated regression testing (did my change make it worse?)\n- [ ] Heatmaps showing where agents spent time\n- [ ] Decision tree visualization of agent choices\n- [ ] Export comparison reports (markdown, HTML)\n\n## Implementation Notes\n- Reuse deterministic replay for consistent comparisons\n- Store metrics in structured format for analysis\n- Consider integration with existing evals harness\n\n## Acceptance Criteria\n- [ ] Can run two agents on identical scenario conditions\n- [ ] Can see side-by-side performance metrics\n- [ ] Results are reproducible with same seed","number":33,"repository":"JustInternetAI/AgentArena","title":"Agent Comparison Tool - A/B Testing for Agent Implementations","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/33"},"id":"PVTI_lADODG39W84BHw8kzgkGzhU","labels":["enhancement","evals"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Backlog","title":"Agent Comparison Tool - A/B Testing for Agent Implementations"},{"content":{"body":"## Summary\nCreate a series of tutorial scenarios that guide new users through building their first agent, with in-context hints, explanations, and progressive challenges.\n\n## Educational Value\n- **Onboarding**: Get new users productive quickly\n- **Structured Learning**: Build skills in logical order\n- **Immediate Success**: Early wins build confidence\n\n## Proposed Tutorials\n\n### Tutorial 1: Hello Agent\n- Minimal agent that responds to observations\n- Learn: Basic agent structure, receiving observations, returning actions\n- Goal: Move to a marked location\n\n### Tutorial 2: Tool Time\n- Agent with multiple tools available\n- Learn: Tool schemas, choosing appropriate tools, parameter passing\n- Goal: Pick up specific items using the right tools\n\n### Tutorial 3: Remember This\n- Agent that needs to remember past observations\n- Learn: Short-term memory, context window management\n- Goal: Find items seen earlier but now out of view\n\n### Tutorial 4: Plan Ahead\n- Multi-step task requiring planning\n- Learn: Goal decomposition, sequential actions\n- Goal: Craft an item requiring multiple gathered resources\n\n### Tutorial 5: Team Up\n- Two agents that must coordinate\n- Learn: Multi-agent communication, role assignment\n- Goal: Complete task requiring cooperation\n\n## Features\n\n### Minimum Viable\n- [ ] Tutorial 1 & 2 implemented\n- [ ] In-game hint system (text overlays)\n- [ ] Clear success/failure feedback\n- [ ] Link to documentation for each concept\n\n### Enhanced\n- [ ] All 5 tutorials implemented\n- [ ] Progress tracking (which tutorials completed)\n- [ ] Code scaffolding provided (fill in the blanks)\n- [ ] Common mistake detection with helpful hints\n\n### Advanced\n- [ ] Interactive code editor in-game\n- [ ] Video walkthroughs embedded\n- [ ] Achievement/badge system\n- [ ] Community solutions gallery\n\n## Acceptance Criteria\n- [ ] New user can complete Tutorial 1 within 15 minutes\n- [ ] Each tutorial teaches one clear concept\n- [ ] Hints available when user is stuck","number":34,"repository":"JustInternetAI/AgentArena","title":"Tutorial Scenarios - Guided Learning Experiences","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/34"},"id":"PVTI_lADODG39W84BHw8kzgkGzhg","labels":["enhancement","evals"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Backlog","title":"Tutorial Scenarios - Guided Learning Experiences"},{"content":{"body":"## Summary\nBuild a visual editor that allows users to create custom scenarios without writing code, then share them with the community.\n\n## Educational Value\n- **Creativity**: Design unique challenges to test specific agent capabilities\n- **Community**: Learn from scenarios others have created\n- **Deep Understanding**: Building scenarios teaches how agents interact with environments\n\n## Features\n\n### Minimum Viable\n- [ ] Place/remove objects in 3D space (resources, obstacles, goals)\n- [ ] Set agent spawn points\n- [ ] Define win/lose conditions (collect X items, reach location, survive time)\n- [ ] Save/load scenarios to files\n- [ ] Basic metrics configuration (what to measure)\n\n### Enhanced\n- [ ] Object property editor (health, value, behavior)\n- [ ] Trigger system (when X happens, do Y)\n- [ ] Multiple agent spawn support\n- [ ] Terrain painting/modification\n- [ ] Scenario metadata (name, description, difficulty, tags)\n\n### Advanced\n- [ ] Scenario validation (is it completable?)\n- [ ] Upload to community repository\n- [ ] Browse/download community scenarios\n- [ ] Rating and review system\n- [ ] Scenario templates (start from existing scenario)\n\n## Implementation Notes\n- Godot has built-in editor capabilities that could be leveraged\n- Scenarios saved as .tscn + config.yaml\n- Community sharing could use GitHub releases or dedicated server\n\n## Acceptance Criteria\n- [ ] Can create a simple foraging scenario without code\n- [ ] Can define clear win condition\n- [ ] Scenario can be saved and loaded\n- [ ] Created scenario works with existing agent implementations","number":35,"repository":"JustInternetAI/AgentArena","title":"Scenario Editor - Create and Share Custom Scenarios","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/35"},"id":"PVTI_lADODG39W84BHw8kzgkGzig","labels":["enhancement"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Backlog","title":"Scenario Editor - Create and Share Custom Scenarios"},{"content":{"body":"## Summary\nProvide a library of ready-to-use agent templates implementing common agentic patterns, giving users a starting point for their own implementations.\n\n## Educational Value\n- **Quick Start**: Get a working agent immediately, then customize\n- **Best Practices**: Learn proven patterns from working examples\n- **Comparison**: See how different architectures handle the same task\n\n## Proposed Templates\n\n### Basic Templates\n1. **Reactive Agent** - Simple stimulus-response, no memory\n2. **ReAct Agent** - Reasoning + Acting pattern with scratchpad\n3. **Memory Agent** - Uses short-term memory for context\n\n### Intermediate Templates\n4. **RAG Agent** - Retrieves from long-term vector memory\n5. **Planning Agent** - Generates and executes multi-step plans\n6. **Reflective Agent** - Self-critiques and revises decisions\n\n### Advanced Templates\n7. **Hierarchical Agent** - High-level planner + low-level executor\n8. **Multi-Agent Coordinator** - Manages team of sub-agents\n9. **Learning Agent** - Updates behavior based on outcomes\n\n## Features\n\n### Minimum Viable\n- [ ] 3 basic templates (Reactive, ReAct, Memory)\n- [ ] Each template has README explaining the pattern\n- [ ] Templates work out-of-box with foraging scenario\n- [ ] Clear extension points marked in code\n\n### Enhanced\n- [ ] All 9 templates implemented\n- [ ] Comparison documentation (when to use which)\n- [ ] Configuration options for each template\n- [ ] Unit tests for each template\n\n### Advanced\n- [ ] Interactive template selector (quiz: what do you need?)\n- [ ] Template composition (combine patterns)\n- [ ] Performance benchmarks for each template\n- [ ] Video explanations of each pattern\n\n## File Structure\n```\npython/\n├── templates/\n│   ├── reactive/\n│   │   ├── agent.py\n│   │   ├── README.md\n│   │   └── config.yaml\n│   ├── react/\n│   ├── memory/\n│   └── ...\n```\n\n## Acceptance Criteria\n- [ ] User can copy a template and have working agent immediately\n- [ ] Each template demonstrates one clear pattern\n- [ ] Templates are well-documented with inline comments","number":36,"repository":"JustInternetAI/AgentArena","title":"Agent Templates Library - Starter Code for Common Patterns","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/36"},"id":"PVTI_lADODG39W84BHw8kzgkGzi0","labels":["enhancement","tools"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Backlog","title":"Agent Templates Library - Starter Code for Common Patterns"},{"content":{"body":"## Problem Statement\n\nAdvanced learners need to inspect and debug agent memory contents. Currently there's no standardized way to view what's stored in memory, query it externally, or export it for analysis.\n\n## Goal\n\nProvide memory inspection tools for Tier 3 learners.\n\n**Depends On**: #43 (B-31: LocalLLMBehavior)\n\n## Implementation Tasks\n\n- [ ] Add `memory.dump()` method to `AgentMemory` interface\n- [ ] Add `memory.query(query: str)` for semantic retrieval\n- [ ] Create CLI tool: `python -m tools.inspect_agent --memory`\n- [ ] Add memory export to JSON/CSV formats\n- [ ] Document memory inspection in learner_tiers.md\n\n## Acceptance Criteria\n\n- [ ] Learners can view full memory contents\n- [ ] Learners can search memory by query\n- [ ] Export works for analysis in external tools\n\n## Context\n\nThis is part of the Tier 3 Advanced learner features. See [docs/learner_tiers.md](docs/learner_tiers.md) for the full progression guide.\n\n| Capability | What You Control | Inspection Tool |\n|------------|------------------|-----------------|\n| **Memory** | `self.memory.add()`, `memory.query()` | `memory.dump()`, memory viewer |\n\n---\n**Priority**: Medium\n**Component**: Agent Runtime\n**Size**: S","number":44,"repository":"JustInternetAI/AgentArena","title":"B-32: Tier 3 Memory Inspection API","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/44"},"id":"PVTI_lADODG39W84BHw8kzgkcLCk","labels":["enhancement"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Backlog","title":"B-32: Tier 3 Memory Inspection API"},{"content":{"body":"## Problem Statement\n\nAdvanced learners need to understand step-by-step how their agent made decisions. They need to see:\n- What was retrieved from memory\n- What prompt was sent to the LLM\n- What the LLM responded\n- How that was parsed into a decision\n\n## Goal\n\nProvide a reasoning trace system that logs every step of the decision process.\n\n**Depends On**: #43 (B-31: LocalLLMBehavior)\n\n## Implementation Tasks\n\n- [ ] Add `ReasoningTrace` class to track decision steps\n- [ ] Add `log_step(name, data)` method to AgentBehavior\n- [ ] Store traces per-episode with timestamps\n- [ ] Create CLI tool: `python -m tools.inspect_agent --last-decision`\n- [ ] Create CLI tool: `python -m tools.inspect_agent --watch` (live mode)\n- [ ] Add trace visualization (text-based tree view)\n- [ ] Document in learner_tiers.md\n\n## Acceptance Criteria\n\n- [ ] Each decision step is logged with timestamp and data\n- [ ] Learners can replay full decision traces\n- [ ] Live watching mode shows decisions as they happen\n\n## Example Usage\n\n```python\nclass MyAgent(LLMAgentBehavior):\n    def decide(self, observation, tools):\n        # Each step is logged\n        relevant = self.memory.query(observation, k=5)\n        self.log_step(\"retrieved\", relevant)\n        \n        prompt = self.build_prompt(observation, tools, relevant)\n        self.log_step(\"prompt\", prompt)\n        \n        response = self.complete(prompt)\n        self.log_step(\"response\", response)\n        \n        decision = self.parse_response(response, tools)\n        self.log_step(\"decision\", decision)\n        \n        return decision\n```\n\n## Context\n\nThis is part of the Tier 3 Advanced learner features. See [docs/learner_tiers.md](docs/learner_tiers.md) for the full progression guide.\n\n---\n**Priority**: Medium\n**Component**: Agent Runtime\n**Size**: M","number":45,"repository":"JustInternetAI/AgentArena","title":"B-33: Tier 3 Reasoning Trace System","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/45"},"id":"PVTI_lADODG39W84BHw8kzgkcLFA","labels":["enhancement"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Backlog","title":"B-33: Tier 3 Reasoning Trace System"},{"content":{"body":"## Problem Statement\n\nAdvanced learners want agents that learn from experience. They need hooks to:\n- Reflect on episode outcomes\n- Store insights from past episodes\n- Use those insights in future decisions\n\n## Goal\n\nProvide reflection hooks that enable learning from past episodes.\n\n**Depends On**: #43 (B-31), #44 (B-32), #45 (B-33)\n\n## Implementation Tasks\n\n- [ ] Add `on_episode_end(outcome: dict)` hook to AgentBehavior\n- [ ] Add `reflect(outcome) -> str` method to LLMAgentBehavior\n- [ ] Add `self.reflections` storage for past insights\n- [ ] Integrate reflections into prompt building\n- [ ] Create example: `ReflectiveForager` in user_agents/examples/\n- [ ] Add reflection viewer to CLI tools\n- [ ] Document reflection patterns in learner_tiers.md\n\n## Acceptance Criteria\n\n- [ ] Agents can reflect on episode outcomes using LLM\n- [ ] Reflections are stored and can be retrieved\n- [ ] Reflections improve future decision-making\n- [ ] Clear documentation with examples\n\n## Example Usage\n\n```python\nclass ReflectiveForager(LLMAgentBehavior):\n    def __init__(self, **kwargs):\n        super().__init__(**kwargs)\n        self.reflections = []\n    \n    def reflect(self, outcome: dict) -> None:\n        \"\"\"Called after episode ends - learn from experience.\"\"\"\n        prompt = f\"\"\"\n        Episode summary:\n        - Resources collected: {outcome.get('resources_collected', 0)}\n        - Damage taken: {outcome.get('damage_taken', 0)}\n        \n        What could be improved next time?\n        \"\"\"\n        insight = self.complete(prompt)\n        self.reflections.append({\n            \"timestamp\": time.time(),\n            \"outcome\": outcome,\n            \"insight\": insight\n        })\n    \n    def on_episode_end(self, success: bool) -> None:\n        self.reflect({\"success\": success})\n```\n\n## Context\n\nThis is part of the Tier 3 Advanced learner features. See [docs/learner_tiers.md](docs/learner_tiers.md) for the full progression guide.\n\n---\n**Priority**: Medium\n**Component**: Agent Runtime\n**Size**: M","number":46,"repository":"JustInternetAI/AgentArena","title":"B-34: Tier 3 Reflection Hooks","type":"Issue","url":"https://github.com/JustInternetAI/AgentArena/issues/46"},"id":"PVTI_lADODG39W84BHw8kzgkcLK0","labels":["enhancement"],"repository":"https://github.com/JustInternetAI/AgentArena","status":"Backlog","title":"B-34: Tier 3 Reflection Hooks"}],"totalCount":30}