A multi-agent coding framework that runs entirely on local hardware with 6GB GPU support. Uses quantized GGUF models via llama-cpp-python for LLM inference and ChromaDB + sentence-transformers for local RAG (Retrieval-Augmented Generation).
No API keys. No cloud. Everything runs on your machine.
User Request
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββ
β AgentOrchestrator β
β β’ Detects intent (code/debug/docs) β
β β’ Fetches RAG context from codebase β
β β’ Routes to agent pipeline β
βββββββββββββββββ¬βββββββββββββββββββββββββ
β
βββββββββββββΌβββββββββββββββββββ
βΌ βΌ βΌ
ββββββββββ ββββββββββββ ββββββββββββββββ
β Coder β βDebugger β β DocAgent β
β Agent β β Agent β β β
βββββ¬βββββ ββββββ¬ββββββ ββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββββ
β ReviewerAgent β
β (auto quality gate)β
ββββββββββββββββββββββ
β
βΌ
Final Output
RAG Stack (CPU):
ChromaDB βββΊ bge-small-en βββΊ Your Codebase
| Component | VRAM | Notes |
|---|---|---|
| DeepSeek Coder 6.7B (28 layers) | ~4.2GB | Primary LLM |
| KV Cache (4096 ctx) | ~0.8GB | Context window |
| Overhead / buffers | ~0.5GB | CUDA runtime |
| Total | ~5.5GB | Leaves 0.5GB headroom |
| Embedding model | CPU only | bge-small-en, 133MB RAM |
git clone <this-repo>
cd local_agent_framework
pip install -e .# For NVIDIA GPU (CUDA 12.x):
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python --force-reinstall
# CPU-only (slower but works anywhere):
pip install llama-cpp-python# Download default model (DeepSeek Coder 6.7B Q4_K_M, ~4GB)
lagent download-model
# Or download a specific model:
lagent download-model \
--repo TheBloke/CodeLlama-7B-Instruct-GGUF \
--file codellama-7b-instruct.Q4_K_M.gguflagent gpu-check# Interactive chat mode
lagent chat
# Single task
lagent run "Write a Python async HTTP client with retry logic"
# Index your codebase for RAG
lagent index ./my_project
# Debug an error (paste traceback)
lagent run "Fix: AttributeError: 'NoneType' object has no attribute 'split'"
# Generate documentation
lagent run "Write docstrings for all functions in auth.py" --lang python
# Custom pipeline
lagent run "Refactor this to use async/await" --pipeline coder,reviewer
# Save output to file
lagent run "Create a FastAPI CRUD app" --output api.pyfrom local_agent_framework import AgentOrchestrator, RAGPipeline
# Basic usage
orchestrator = AgentOrchestrator()
result = orchestrator.run("Write a Redis cache decorator")
print(result.content)
# With RAG (index your project first)
rag = RAGPipeline()
rag.index_directory("./my_project")
orchestrator = AgentOrchestrator(rag=rag)
result = orchestrator.run(
"Add rate limiting to the existing API endpoints",
language="python"
)
# Print extracted code blocks
for code in result.code_blocks:
print(code)Routes requests to the appropriate pipeline automatically:
- Bug/error keywords β
debugger β reviewer - Code generation β
coder β reviewer - Documentation β
doc - Explanation β
coder(explain mode) - Refactor β
coder β reviewer
Writes, refactors, explains, and converts code. Follows language-specific best practices, adds type hints, docstrings, and error handling.
Reviews code for:
- π΄ Critical: Bugs, security vulnerabilities, data loss risks
- π‘ Warnings: Performance issues, bad practices
- π΅ Suggestions: Style, readability improvements
Returns: APPROVED | NEEDS_CHANGES | REJECTED
Given a traceback + code, identifies root cause and provides a precise fix with explanation.
Generates: docstrings (Google/NumPy style), README files, API documentation, inline comments.
from local_agent_framework import RAGPipeline
rag = RAGPipeline()
# Index a directory
stats = rag.index_directory(
"./my_project",
recursive=True,
exclude_dirs=[".git", "node_modules", "venv"],
force=False, # Skip files already indexed
)
# β {"files_indexed": 47, "chunks_added": 312, "files_skipped": 3}
# Manual retrieval
results = rag.retrieve("how does user authentication work?", top_k=5)
for r in results:
print(f"[{r['score']:.2f}] {r['metadata']['file_name']}")
print(r['content'])
# Stats
print(rag.stats())
# β {"total_chunks": 312, "collection_name": "codebase", ...}Supported file types: .py .js .ts .jsx .tsx .java .go .rs .cpp .c .h .cs .rb .php .md .txt .yaml .yml .json .toml .sh .sql
Generate a config file:
lagent show-config --save config.yamlEdit config.yaml:
model:
model_name: deepseek-coder-6.7b-instruct.Q4_K_M.gguf
n_gpu_layers: 28 # Reduce if OOM errors (try 20-24)
n_ctx: 4096 # Reduce to 2048 to save VRAM
temperature: 0.1 # Low = deterministic code
rag:
top_k: 5
chunk_size: 1000
min_relevance_score: 0.3Use custom config:
lagent run "..." --config ./my_config.yaml| Model | Size | VRAM | Best For |
|---|---|---|---|
| DeepSeek Coder 6.7B Q4_K_M β | 4.0GB | ~4.5GB | Code generation (recommended) |
| CodeLlama 7B Q4_K_M | 4.1GB | ~4.5GB | Code + instruction following |
| Mistral 7B Q4_K_M | 4.1GB | ~4.5GB | General coding tasks |
| Phi-3 Mini 3.8B Q4 | 2.3GB | ~2.8GB | Fast responses, lighter tasks |
For 8GB+ VRAM: use CodeLlama 13B or DeepSeek Coder 33B (Q4_K_S)
from local_agent_framework import AgentOrchestrator
from local_agent_framework.agents.base import BaseAgent, AgentRole, AgentTask, AgentResult
class TestWriterAgent(BaseAgent):
@property
def role(self):
return AgentRole.CODER # Reuse existing role enum or extend it
@property
def system_prompt(self):
return """You are a test engineer. Write comprehensive pytest test suites.
Always include: unit tests, edge cases, fixtures, and mocks where appropriate."""
def run(self, task: AgentTask) -> AgentResult:
prompt = self._build_task_prompt(task)
response = self._generate(prompt)
return AgentResult(
agent_name=self.name,
task_id=task.task_id,
success=True,
content=response,
)
# Register and use
orchestrator = AgentOrchestrator()
orchestrator.add_agent("test_writer", TestWriterAgent(orchestrator.model_loader))
result = orchestrator.run(
"Write tests for my UserAuth class",
pipeline=["test_writer"]
)CUDA out of memory:
# In config.yaml, reduce GPU layers:
model:
n_gpu_layers: 20 # or 16
n_ctx: 2048 # Reduce context windowModel not found:
lagent download-model
# Check models dir:
ls ~/.local_agent_framework/models/Slow inference:
# Use a smaller/faster model:
lagent download-model --repo microsoft/Phi-3-mini-4k-instruct-gguf --file Phi-3-mini-4k-instruct-q4.ggufPoor code quality:
- Increase
max_tokensin config - Lower
temperature(try 0.05) - Enable auto-review:
AgentOrchestrator(auto_review=True)
MIT