Skip to content

[FEAT]: Resource-Aware Async Orchestrator for VRAM Safety & Non-Blocking Inference #281

@sanvishukla

Description

@sanvishukla

name: 🚀 Feature Request
about: Suggest an idea or a new capability for FireForm.
title: "[FEAT]: Resource-Aware Async Orchestrator for VRAM Safety & Non-Blocking Inference"
labels: enhancement, architecture, performance
assignees: ''


📝 Description

As FireForm scales toward multi-modal support, integrating Voice via faster-whisper (issue #115 ) and Extraction via Ollama, we face a high risk of resource contention and event-loop starvation.

The current implementation of llm.py utilises synchronous requests, which blocks the FastAPI event loop during inference (as identified in issue #277). This feature introduces a dedicated Orchestration layer to manage hardware access, ensure non-blocking API behaviour, and prevent memory crashes on low-end hardware.

💡 Rationale

In the resource-constrained environments typical of fire departments, heavy model inference (Whisper/Ollama) will crash low-memory terminals (8GB/16GB) if tasks are stacked or run concurrently. To fulfil FireForm's mission as a stable Digital Public Good, we need hardware to ensure:

  • Safety: Models are swapped in/out of memory rather than fighting for the same VRAM.

  • Concurrency: The API remains responsive to other users while the LLM is thinking.

🛠️ Proposed Solution

I propose introducing a Hardware Orchestration Layer (src/core/orchestrator.py) that acts as a singleton "Traffic Cop" for all inference-heavy operations.

  • - Async Mutex Locking: Implement an asyncio.Lock() to serialize access to the GPU/VRAM. This ensures that the event loop remains unblocked while the transcription (Whisper) and extraction (Ollama) tasks are executed in a safe, serial queue.
  • - Multimodal Model Swapping: The orchestrator will manage the lifecycle of models. By explicitly offloading the Whisper model from memory before triggering the Ollama text extraction, we can maintain stability on devices with as little as 8GB of Unified Memory.
  • - Non-Blocking Transport: Refactor the core LLM execution logic in src/llm.py to use httpx.AsyncClient. This replaces the existing blocking requests.post() calls, allowing the FastAPI server to remain responsive for other CRUD operations even during a 30-second inference window.

✅ Acceptance Criteria

  • - Concurrency Test: The FastAPI server must remain responsive (e.g., GET /templates returns a 200) while a heavy POST /forms/fill extraction is running in the background.
  • - VRAM Safety: Simultaneous requests for voice transcription and form filling must be serialized via the Orchestrator, preventing hardware memory collisions and OOM crashes.
  • - Transport Reliability: Ollama requests must include configurable timeouts and robust error handling for connection failures, replacing the current silent hanging behavior.
  • - Test Coverage: Provide a pytest suite that uses respx to simulate slow LLM responses and verify non-blocking behavior.

📌 Additional Context

This architectural fix directly resolves the event-loop blocking identified in #277 and serves as the necessary foundation for the local Voice Transcription feature (#115). I have a technical prototype ready and can submit a PR shortly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions