feat: Resource-Aware Async Orchestrator for Hardware-Safe Inference#282
Open
sanvishukla wants to merge 1 commit intofireform-core:mainfrom
Open
feat: Resource-Aware Async Orchestrator for Hardware-Safe Inference#282sanvishukla wants to merge 1 commit intofireform-core:mainfrom
sanvishukla wants to merge 1 commit intofireform-core:mainfrom
Conversation
|
Thanks @sanvishukla for the detailed implementation. I originally noticed the blocking behavior while testing the FireForm API locally, so it's great to see a full async refactor addressing the issue. The orchestrator approach for managing VRAM access also seems useful for supporting multiple models on lower-memory systems. I’ll try testing this PR locally and share feedback if I notice anything. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This Pull Request introduces a VRAM-Aware Hardware Orchestrator and refactors the core inference pipeline to be fully asynchronous. This addresses the critical issue of server hanging during LLM extraction (identified in #277) and provides the necessary architectural foundation for local voice transcription (#115) on resource-constrained hardware.
Key Changes
1. VRAM Orchestrator (
src/core/orchestrator.py)VRAMOrchestratorsingleton that utilizes anasyncio.Lock(Mutex) to gate-keep hardware access.2. Full Async Pipeline Refactor
llm.pyto usehttpx.AsyncClientfor non-blocking communication with local Ollama instances.Filler,FileManipulator, andControllerlayers to supportasync/await./forms/fillendpoint toasync def, allowing the FastAPI server to remain responsive to other requests (e.g.,GET /templates) even during heavy 30–60 second inference tasks.3. Improved Transport Reliability
ConnectError,TimeoutException), providing clear feedback when Ollama is unavailable.Technical Implementation Details
httpx(async HTTP),pytest-asyncio(async testing), andrespx(transport mocking).main_loopto allow yielding the event loop between extraction steps.Verification & Testing
Automated Tests
tests/test_vram_orchestrator.pytests/test_reliability.pyrespxto simulate slow LLM responses and connection failuresCloses #281