High-Availability AI Inference Gateway
A production-ready, fault-tolerant AI assistant with distributed inference and local fallback capabilities
easyResearchAssistant is a lightweight, high-availability AI inference gateway designed for families who want reliable access to AI assistants without single points of failure. The system intelligently routes requests across multiple distributed inference providers and automatically falls back to local GPU inference when cloud services are unavailable.
| Feature | Description |
|---|---|
| Distributed Inference | Load balancing across multiple cloud inference nodes |
| Automatic Failover | Intelligent retry with exponential backoff on errors |
| Local Fallback | Seamless switch to local Ollama when cloud is exhausted |
| Research Mode (RAG) | Real-time web search with DuckDuckGo for up-to-date information |
| Monitoring Dashboard | Real-time observability with node status, metrics, and live logs |
| Streaming Responses | Real-time chat experience with SSE |
| Access Control | Token-based authentication |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Client Applications β
β (Streamlit UI / API Consumers) β
ββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β API Gateway (FastAPI) β
β βββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββββββ β
β β Auth β β Request Router β β Streaming Handler β β
β β Layer β β & Retry Logic β β (SSE) β β
β βββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββ β
β β RAG Search Tool β βββ Research Mode β
β β (DuckDuckGo) β β
β βββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Provider Manager β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Selection Strategies: Round Robin β Random β Least Used β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Health Monitoring β Cooldown Management β Auto-Recovery β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββΌββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Inference β β Inference β β Inference β
β Node Alpha β β Node Beta β β Node Gamma β
β (Cloudflare) β β (Cloudflare) β β (Cloudflare) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
β (All cloud nodes exhausted)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Local Fallback (Ollama) β
β RTX 3050 β’ Llama 3.1 β’ Low-VRAM Mode β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Redundancy: Multiple inference providers ensure no single point of failure
- Graceful Degradation: Automatic fallback to local inference preserves availability
- User Privacy: Local processing option keeps sensitive queries on-premises
- Resource Awareness: Lightweight local inference respects GPU memory constraints
- Python 3.10+
- Ollama (optional, for local fallback)
- Cloudflare Workers AI API access
# Clone the repository
git clone https://github.com/hzjanuary/easyResearchAssistant.git
cd easyResearchAssistant
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# or: .\venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Configure environment
cp .env.template .env
# Edit .env with your credentials-
Generate Access Token:
python -c "import secrets; print(secrets.token_urlsafe(32))" -
Configure
.env:# Security ACCESS_TOKEN=your_generated_token_here ADMIN_PASSWORD=your_admin_password_here # Add your Cloudflare credentials CLOUDFLARE_ACCOUNT_1_ID=your_account_id CLOUDFLARE_ACCOUNT_1_TOKEN=your_api_token CLOUDFLARE_ACCOUNT_1_NAME=Provider-Alpha CLOUDFLARE_MODEL=@cf/meta/llama-3.1-8b-instruct # Enable local fallback (optional) OLLAMA_ENABLED=true OLLAMA_MODEL=llama3.1 # For Docker: automatically configured via docker-compose.yml # BACKEND_URL=http://backend:8000 # LOG_FILE=/app/logs/system.log
-
Start Local Fallback (optional):
ollama pull llama3.1 ollama serve
The system consists of three components:
| Component | Purpose | Default Port |
|---|---|---|
| API Gateway | FastAPI backend | 8000 |
| Chat UI | User-facing chat interface | 8501 |
| Admin Dashboard | System monitoring & health | 8502 |
# Build and start all services
docker-compose up --build
# Or run in background
docker-compose up -d --buildAccess:
- Chat UI:
http://localhost:8501 - Admin Dashboard:
http://localhost:8502 - API Gateway:
http://localhost:8000
Note for Windows/Mac: Ollama running on host is accessible via
host.docker.internal:11434
The docker-compose.yml creates a shared volume app_logs for centralized logging:
- Backend writes logs to
/app/logs/system.log - Admin UI reads logs from the same volume (read-only)
To view logs from Docker:
docker-compose logs -f backendStart the API Gateway:
python api_gateway.pyStart the Chat UI (Port 8501):
streamlit run streamlit_app.pyStart the Admin Dashboard (Port 8502):
streamlit run admin_app.py --server.port 8502Run Both Apps Simultaneously (PowerShell):
Start-Process -NoNewWindow powershell -ArgumentList "streamlit run streamlit_app.py"
Start-Process -NoNewWindow powershell -ArgumentList "streamlit run admin_app.py --server.port 8502"Run Both Apps Simultaneously (Bash):
streamlit run streamlit_app.py &
streamlit run admin_app.py --server.port 8502 &Access:
- Chat UI:
http://localhost:8501 - Admin Dashboard:
http://localhost:8502
All endpoints (except /health) require Bearer token authentication:
curl -H "Authorization: Bearer YOUR_TOKEN" http://localhost:8000/v1/status| Method | Endpoint | Description |
|---|---|---|
GET |
/ |
Service info (public) |
GET |
/health |
Health check with provider status (public) |
POST |
/v1/inference |
Main inference endpoint |
GET |
/v1/status |
Detailed gateway status |
POST |
/v1/providers/strategy/{strategy} |
Change selection strategy |
POST |
/v1/providers/reset |
Reset all providers |
GET |
/v1/monitoring/stats |
Comprehensive monitoring statistics |
GET |
/v1/monitoring/logs |
Recent log entries |
GET |
/v1/monitoring/health |
Lightweight health for polling (public) |
curl -X POST http://localhost:8000/v1/inference \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain quantum entanglement",
"research_mode": true,
"stream": true,
"max_tokens": 2048,
"temperature": 0.7
}'Streaming (SSE):
data: {"response": "Quantum "}
data: {"response": "entanglement "}
data: {"response": "is..."}
data: [DONE]
Non-Streaming:
{
"response": "Quantum entanglement is...",
"provider": "Provider-Alpha"
}Research Mode enables Retrieval-Augmented Generation (RAG) - the system performs real-time web searches before generating responses, ensuring up-to-date information.
βββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ βββββββββββββββ
β User Query βββββΆβ DuckDuckGo SearchβββββΆβ Augmented PromptβββββΆβ LLM Responseβ
βββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ βββββββββββββββ
β β
βΌ βΌ
[Search Results] [System Prompt + Results + Query]
- Real-Time Web Search: Queries DuckDuckGo for current information
- Source Citations: Results include titles, snippets, and URLs
- Intelligent Augmentation: Search results are injected into the system prompt
- Graceful Fallback: If search fails, uses base LLM knowledge
Query: "Who is the current President of Vietnam?"
- System searches DuckDuckGo for latest information
- Search results are formatted and added to the prompt
- LLM generates response using real-time data (2026)
{"prompt": "Who won the 2026 World Cup?", "research_mode": true}Toggle "Enable Research Mode" in the Streamlit sidebar. A π badge will appear indicating RAG is active.
# .env
RAG_MAX_RESULTS=3 # Number of search results to fetchThe Admin Dashboard (admin_app.py) provides real-time observability into your inference gateway, separate from the chat interface to avoid disrupting the user experience.
- Node Status: Visual indicators showing Active (green), Cooldown (yellow), Offline (red)
- Request Metrics: Total requests, success rates, and rate limit counts
- Request Distribution: Bar chart showing load distribution across providers
- Live Logs: Real-time streaming of gateway events (provider switching, errors, recoveries)
- Live System Logs (File): Persistent log viewer reading from
system.logfile - Auto-refresh: Dashboard updates every 10 seconds (only in admin app)
- Strategy Control: Change load balancing strategy on-the-fly
The Admin Dashboard requires two-factor authentication:
- ACCESS_TOKEN: Same API token used for the chat app
- ADMIN_PASSWORD: Additional password configured in
.env
ADMIN_PASSWORD=your_secure_admin_passwordIf ADMIN_PASSWORD is not set, only the ACCESS_TOKEN is required.
- Start the admin app:
streamlit run admin_app.py --server.port 8502 - Open
http://localhost:8502 - Enter your Access Token and Admin Password
- Click Login
# Get comprehensive stats
curl -H "Authorization: Bearer TOKEN" http://localhost:8000/v1/monitoring/stats
# Get recent logs
curl -H "Authorization: Bearer TOKEN" http://localhost:8000/v1/monitoring/logs?count=10
# Lightweight health check (for polling)
curl http://localhost:8000/v1/monitoring/healthWhen a provider returns HTTP 429 (rate limited), it enters cooldown:
COOLDOWN_MINUTES=30 # Default: 30 minutesNodes automatically recover after the cooldown period expires.
# Cloud providers
CLOUDFLARE_ACCOUNT_1_ID=abc123...
CLOUDFLARE_ACCOUNT_1_TOKEN=xyz789...
CLOUDFLARE_ACCOUNT_1_NAME=Provider-Alpha
CLOUDFLARE_MODEL=@cf/meta/llama-3.1-8b-instruct
# Networking (Docker sets these automatically)
BACKEND_URL=http://backend:8000 # Service name in Docker
LOG_FILE=/app/logs/system.log # Shared volume path
# Local fallback
OLLAMA_ENDPOINT=http://host.docker.internal:11434 # Docker on Windows/MacCreate providers.json from the template:
cp providers.example.json providers.json
# Edit with your credentialsFor high availability, configure Ollama as a local fallback:
# Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Pull Llama 3.1 (recommended for better accuracy)
ollama pull llama3.1The gateway automatically configures Ollama for constrained environments:
- Token Limit: 1024 (configurable via
OLLAMA_MAX_TOKENS) - Single Concurrent Request: Prevents VRAM exhaustion
- CPU Thread Limit: Preserves system responsiveness
| Component | Minimum | Recommended |
|---|---|---|
| CPU | 4 cores | 6+ cores |
| RAM | 8 GB | 16 GB |
| GPU (for fallback) | 4GB VRAM | 8GB+ VRAM |
| Network | Stable broadband | Low-latency connection |
Tested On: Acer Aspire 7 with RTX 3050 (4GB VRAM), Pop!_OS
- Never commit credentials:
.envandproviders.jsonare gitignored - Rotate tokens regularly: Generate new access tokens periodically
- Use HTTPS in production: Deploy behind a reverse proxy (nginx/Caddy)
- Limit network exposure: Bind to localhost unless needed externally
"No available providers"
- Check your Cloudflare API tokens are valid
- Verify account IDs are correct
- Run
/v1/providers/resetto clear cooldowns
"Local fallback failed"
- Ensure Ollama is running:
ollama serve - Check the model is pulled:
ollama list - Verify endpoint:
curl http://localhost:11434/api/tags
"Connection refused"
- Start the API gateway:
python api_gateway.py - Check the port isn't in use:
lsof -i :8000
The system uses centralized logging with both file and console output:
Log File Location:
- Local:
./system.log - Docker:
/app/logs/system.log(shared volume between backend and admin-ui)
Log Format:
2026-03-11 14:30:00 | INFO | api_gateway | Attempt 1: Using Provider-Alpha
2026-03-11 14:30:01 | WARNING | provider_manager | Node Provider-Alpha rate limited (429), cooldown: 30min
2026-03-11 14:30:01 | INFO | api_gateway | Attempt 2: Using Provider-Beta
2026-03-11 14:30:05 | INFO | provider_manager | Falling back to local Ollama
View Logs:
- Admin Dashboard: Expand "π Live System Logs (File)" section
- Docker:
docker-compose logs -f backend - Local:
tail -f system.log
easyResearchAssistant/
βββ api_gateway.py # FastAPI backend (port 8000)
βββ provider_manager.py # Distributed provider orchestration
βββ search_tool.py # RAG web search utility (DuckDuckGo)
βββ streamlit_app.py # Chat UI (port 8501)
βββ admin_app.py # Admin monitoring dashboard (port 8502)
βββ docker-compose.yml # Docker multi-service configuration
βββ Dockerfile # Container image definition
βββ requirements.txt # Python dependencies
βββ .env.template # Configuration template
βββ providers.example.json # Provider config template (alternative)
βββ system.log # Centralized log file (auto-generated)
βββ README.md # This file
pytest tests/ -v- Fork the repository
- Create a feature branch
- Submit a pull request
MIT License - see LICENSE for details.