Skip to content

Introduce a unified config/composition layer for all agent components #224

@ahmad-ajmal

Description

@ahmad-ajmal

Summary

The agent today wires components together ad-hoc. Each component reaches into globals (asyncio, env vars, module-level singletons, nest_asyncio.apply(), hard-coded paths, hard-coded timeouts) and makes its own assumptions about the OS, Python version, and installed packages. That coupling makes the whole system fragile: a single environment mismatch (Python 3.14 vs nest_asyncio) can break unrelated subsystems with no clear signal, and any per-deployment change requires code edits across many files.

I want to move the agent to a composition-based architecture with a unified config layer: every component is a pure, version-agnostic, OS-agnostic unit that receives a config object at construction time. All environmental decisions - Python version checks, platform detection, package availability, feature flags, timeouts, paths, credentials, model selection, logging - are resolved by the config layer at a single entry point, not scattered inside the components.

The asyncio.wait_for / nest_asyncio / Python 3.14 bug (see compat shim at the top of agent_core/core/impl/action/manager.py) is the concrete example that forced this issue, but it's a symptom, not the problem. The same class of fragility applies to MCP setup, LLM provider switching, sandboxed action execution, scheduler wiring, interface mode (browser/cli/tui), and more.

Context: how we got here

Frankie hit a blocker on Python 3.14.x where every asyncio.wait_for(...) call raised RuntimeError: Timeout should be used inside a task, breaking MCP stdio startup and action execution. Root cause: nest_asyncio.apply() doesn't propagate Python 3.14's task context variable, so asyncio.timeout() can't find the current task.

Debugging was painful because:

  • No Python version is recorded in logs.
  • The trigger consumer swallowed the failure silently (except Exception: pass with no log), so the agent looked dead with zero signal.
  • The traceback surfaces inside stdlib with no hint that nest_asyncio is involved.
  • nest_asyncio itself is only needed because ~10 places in the codebase call asyncio.run() / loop.run_until_complete() from inside an already-running event loop.

I've shipped a band-aid shim that monkey-patches asyncio.wait_for. It works today but silently rewrites a stdlib function, swallows BaseException during cleanup, and hides the real architectural problem. It needs to go away as part of this refactor.

The proposal

1. Components become pure + version-agnostic

Every component - trigger consumer, action manager, action executor, MCP client, LLM interface, memory manager, scheduler, external comms, UI adapters, state manager - is rewritten to:

  • Take all of its dependencies via constructor / DI.
  • Hold no module-level globals, no nest_asyncio.apply(), no direct env-var reads.
  • Be testable in isolation without spinning up the whole agent.
  • Not care about OS, Python version, or package availability.

2. Unified config layer

A single AgentConfig (or similar) object owns everything environmental:

  • Python version + runtime capability checks (does asyncio.timeout work? do we need a shim? is nest_asyncio needed?).
  • Platform detection (win32/darwin/linux branching).
  • Package availability probes (Node/npm for MCP, tesseract, playwright, etc.).
  • Paths (data dir, chroma dir, agent FS, workspace root).
  • Timeouts, retry budgets, rate limits.
  • LLM providers, models, API keys, base URLs.
  • Feature flags (gui_mode, slow_mode, experimental toggles).
  • Interface mode and adapter selection.
  • Logging setup (level, sinks, format).

Config is built once at startup from settings.json + CLI args + env detection, then handed to the composition root.

3. Single composition entry point

One place - likely a replacement/extension of app/main.py::main_async - builds the config, instantiates every component with it, wires them together, and hands control to the interface. No component constructs its dependencies itself; they all come from the composition root.

This is where version-specific workarounds live - exactly once - gated by the config's capability flags. The asyncio.wait_for shim, for example, becomes if config.needs_wait_for_shim: install_shim() at the composition root and nowhere else.

4. Eliminate nest_asyncio

As part of the component rewrite, every asyncio.run() / loop.run_until_complete() call inside a running loop (app/data/action/task_end.py, app/data/action/send_message_with_attachment.py, app/data/action/integration_management.py, agent_core/core/impl/llm/interface.py:319, agent_core/core/impl/config/watcher.py:240, agent_core/core/impl/skill/manager.py:90, and others) is converted to proper await / asyncio.create_task / asyncio.to_thread patterns. Once those are gone, nest_asyncio can be dropped from requirements.txt / environment.yml - and with it, the shim.

Wins

  • One place to do environment/version checks instead of scattered runtime surprises.
  • Swap-ability: changing LLM provider, interface mode, or storage backend is a config change, not a code change.
  • Testability: every component can be unit-tested with a fake config.
  • Debuggability: config object dump at startup = full picture of the runtime environment, no more guessing Python versions from tracebacks.
  • Portability: same component code runs on 3.10, 3.11, 3.12, 3.13, 3.14, and whatever comes next - only the config resolver changes.
  • The nest_asyncio / asyncio.wait_for bug disappears as a free side-effect, not as a targeted fix.

Scope / phasing suggestion

This is a multi-week refactor, not a weekend PR. Rough phasing:

  1. Phase 0 - diagnostics (already partially done): log Python version at startup, log trigger consumer exits, log component init.
  2. Phase 1 - config skeleton: define AgentConfig schema, build it from settings.json + env + CLI at a single point, pass it down. No component rewrites yet - just make sure everything flows through one config object.
  3. Phase 2 - remove asyncio.run() inside running loop: convert the ~10 offending call sites to proper async. Drop nest_asyncio + shim.
  4. Phase 3 - component extraction: one subsystem at a time (start with action executor or LLM interface), move globals → constructor args, wire via composition root.
  5. Phase 4 - docs: docs/architecture.md describing components + config + composition root.

Out of scope

  • Behavioral changes to the agent itself. This is purely structural.
  • Breaking the public config surface (settings.json schema can evolve but shouldn't break existing users in phase 1–2).
  • Maybe later move all the configs to be dependant on models rather than json. Example: the onboarding configs and settings json structure shouldn't depend on the file. Would allow to create them if they are missing instead of the agent just crashing

Metadata

Metadata

Assignees

No one assigned

    Labels

    ImprovementOptimization and improvement over existing featureMCPbugSomething isn't workingdocumentationImprovements or additions to documentationquestionFurther information is requested

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions