[FEAT]: Resource-Aware Async Orchestrator for VRAM Safety & Non-Blocking Inference

---
name: 🚀 Feature Request
about: Suggest an idea or a new capability for FireForm.
title: "[FEAT]: Resource-Aware Async Orchestrator for VRAM Safety & Non-Blocking Inference"
labels: enhancement, architecture, performance
assignees: ''

---

## 📝 Description
As FireForm scales toward multi-modal support, integrating Voice via faster-whisper (issue #115 ) and Extraction via Ollama, we face a high risk of resource contention and event-loop starvation.

The current implementation of llm.py utilises synchronous requests, which blocks the FastAPI event loop during inference (as identified in issue #277). This feature introduces a dedicated Orchestration layer to manage hardware access, ensure non-blocking API behaviour, and prevent memory crashes on low-end hardware.

## 💡 Rationale
In the resource-constrained environments typical of fire departments, heavy model inference (Whisper/Ollama) will crash low-memory terminals (8GB/16GB) if tasks are stacked or run concurrently. To fulfil FireForm's mission as a stable Digital Public Good, we need hardware to ensure:

- Safety: Models are swapped in/out of memory rather than fighting for the same VRAM.

- Concurrency: The API remains responsive to other users while the LLM is thinking.

## 🛠️ Proposed Solution
I propose introducing a Hardware Orchestration Layer (src/core/orchestrator.py) that acts as a singleton "Traffic Cop" for all inference-heavy operations.

- [ ] - Async Mutex Locking: Implement an asyncio.Lock() to serialize access to the GPU/VRAM. This ensures that the event loop remains unblocked while the transcription (Whisper) and extraction (Ollama) tasks are executed in a safe, serial queue.
- [ ] - Multimodal Model Swapping: The orchestrator will manage the lifecycle of models. By explicitly offloading the Whisper model from memory before triggering the Ollama text extraction, we can maintain stability on devices with as little as 8GB of Unified Memory.
- [ ] - Non-Blocking Transport: Refactor the core LLM execution logic in src/llm.py to use httpx.AsyncClient. This replaces the existing blocking requests.post() calls, allowing the FastAPI server to remain responsive for other CRUD operations even during a 30-second inference window.

## ✅ Acceptance Criteria

- [ ] - Concurrency Test: The FastAPI server must remain responsive (e.g., GET /templates returns a 200) while a heavy POST /forms/fill extraction is running in the background.
- [ ] - VRAM Safety: Simultaneous requests for voice transcription and form filling must be serialized via the Orchestrator, preventing hardware memory collisions and OOM crashes.
- [ ] - Transport Reliability: Ollama requests must include configurable timeouts and robust error handling for connection failures, replacing the current silent hanging behavior.
- [ ] - Test Coverage: Provide a pytest suite that uses respx to simulate slow LLM responses and verify non-blocking behavior.

## 📌 Additional Context
This architectural fix directly resolves the event-loop blocking identified in #277 and serves as the necessary foundation for the local Voice Transcription feature (#115). I have a technical prototype ready and can submit a PR shortly.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT]: Resource-Aware Async Orchestrator for VRAM Safety & Non-Blocking Inference #281

📝 Description

💡 Rationale

🛠️ Proposed Solution

✅ Acceptance Criteria

📌 Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEAT]: Resource-Aware Async Orchestrator for VRAM Safety & Non-Blocking Inference #281

Description

📝 Description

💡 Rationale

🛠️ Proposed Solution

✅ Acceptance Criteria

📌 Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions