Skip to content

Privacy-first local AI model comparison platform with blind evaluation, per-model hyperparameters, and multi-configuration testing. Compare 2-6 models side-by-side through Ollama with zero cloud dependencies.

License

Notifications You must be signed in to change notification settings

sammy995/Local-LLM-Arena

Repository files navigation

Local LLM Arena β€” Local Multi-Model AI Comparison Platform

A privacy-first, local-only web application for side-by-side evaluation of multiple AI models via Ollama, with blind evaluation, per-model hyperparameters, and exportable results.

Version 3.0.0 β€” Advanced Model Configuration & Blind Testing

Note: This application uses Ollama (Β© Ollama, Inc.) as the local inference engine. Ollama is a separate product with its own MIT License. All models run locally on your machine through Ollama.


🎯 Problem Statement

Privacy and Compliance Constraints in AI Evaluation

Organizations and individuals in regulated industries (healthcare, finance, legal, government) face significant barriers when evaluating AI models:

  • Data Sensitivity: Sending proprietary or sensitive data to cloud-based comparison platforms (ChatGPT Arena, LMSys) violates privacy policies and compliance requirements (GDPR, HIPAA, SOX)
  • API Key Risk: Cloud tools require API keys, creating security exposure and audit complexity
  • No Control Over Data: Once data leaves local systems, there's no guarantee of deletion, non-training use, or compliance with data residency laws
  • Vendor Lock-in: Cloud platforms control access, pricing, and model availability

Limitations of Cloud-Based Comparison Tools

Existing solutions like ChatGPT Arena and LMSys are excellent for public benchmarking, but fail for:

  • Private datasets: Cannot evaluate models on proprietary or confidential information
  • Offline environments: Require internet connectivity and external dependencies
  • Reproducibility: No control over model versions, parameters, or state persistence
  • Cost: API usage fees accumulate quickly during evaluation campaigns
  • Auditability: No local logs or exportable artifacts for compliance reporting

Ollama Arena solves this by bringing multi-model evaluation entirely to your local machine, with zero external API calls and complete data sovereignty.


✨ Features

1. Side-by-Side Comparison

Benchmark multiple models at once. Witness real-time generation and compare how different architectures handle the same prompt across independent context windows.

Compare LLMs Demo

2. Multi-Model Blind Testing

Eliminate brand bias. In Blind Mode, model names are hidden, allowing you to vote for the best response based purely on the quality and accuracy of the outputβ€”essential for objective model evaluation.

Blind Mode Demo

3. Audit-Ready History & Export

Every session is documented. Export your chat history and model performance metrics into structured JSON files for internal audits, compliance reporting, or further RAG analysis.

Export History Demo

4. Audit-Ready History & Export

Every session is documented. Export your chat history and model performance metrics into structured JSON files for internal audits or compliance reporting.

Export History Demo


πŸ—οΈ System Overview

Local-First Architecture

Ollama Arena is a 100% local, zero-cloud Flask web application that orchestrates multiple AI models through Ollama:

  • No API Keys Required: All models run locally via Ollama's inference engine
  • No Internet Dependency: Works completely offline (after initial model downloads)
  • Full Data Control: Conversations never leave your machine
  • Browser-Based UI: Modern, responsive interface accessible at http://127.0.0.1:7860

Multi-Model Orchestration with Advanced Configuration

  • Arena Mode: Send identical prompts to 2-6 models simultaneously for blind comparison
  • Blind Evaluation: Hide model identities to eliminate bias (Model A, B, C labels)
  • Per-Model Hyperparameters: Configure 6 parameters independently per model instance
    • temperature, top_p, top_k, repeat_penalty, num_predict, seed
  • Multi-Configuration Testing: Compare same model with different parameter sets
  • Single Model Mode: Interactive chat with one model at a time
  • Dynamic Model Switching: Start in arena mode, continue conversations with individual models
  • Real-Time Streaming: See responses as they're generated, character by character

Persistent State and Exportable Artifacts

  • Conversation Export: Download full chat histories as JSON with timestamps and metadata
  • Blind Mode Export: Privacy-preserving exports with masked model names until revealed
  • Session Persistence: Conversations survive browser refreshes (in-memory state)
  • Model Metadata Tracking: Records model names, hyperparameters, response times, and token counts
  • Voting System: Like/dislike responses in blind mode for unbiased evaluation
  • Audit Trail: Complete logs available in logger.py output for compliance verification

✨ Key Features (v3.0.0)

🎭 Blind Evaluation Mode

Eliminate bias in model comparison by hiding model identities during evaluation.

  • Anonymous Labels: Models displayed as "Model A", "Model B", "Model C"
  • Randomized Order: Display order randomized to prevent position bias
  • Voting System: πŸ‘/πŸ‘Ž buttons for each response (blind to which model)
  • Model Reveal: Unlock actual identities with detailed statistics
    • Shows model names, hyperparameters, and vote counts
    • Locks voting after reveal to preserve integrity
  • Privacy-Preserving Export: Masked model names in JSON until reveal
    • Filename gets _blind suffix for clarity
    • Full mapping included after reveal

Use Cases:

  • Unbiased benchmarking without brand perception
  • Team evaluations where model choices are debated
  • A/B testing without preconceived notions
  • Educational settings to teach critical evaluation

βš™οΈ Per-Model Hyperparameters

Fine-tune each model instance independently with 6 Ollama-supported parameters:

Parameter Range Default Description
temperature 0.01-2.0 0.7 Controls randomness (low = deterministic, high = creative)
top_p 0-1 0.9 Nucleus sampling (cumulative probability cutoff)
top_k 0-100 40 Limits token choices to top K candidates
repeat_penalty 1.0-2.0 1.1 Penalizes repetitive text (higher = more diverse)
num_predict -1 to 4096 -1 Max tokens to generate (-1 = unlimited)
seed 0+ 0 For reproducible outputs (0 = random)

Visual Display: Model chips show all parameters inline:

  • Core params always visible: gemma3:1b (T=0.7 P=0.9 K=40)
  • Advanced params shown when non-default: + R=1.5 M=500 S=42

Persistence: Hyperparameters saved per model instance, survive refreshes

πŸ”„ Multi-Configuration Testing

Compare the same model with different parameter sets to optimize performance:

  • Unique Instances: gemma3:1b at T=0.1, T=0.7, T=2.0 treated as separate models
  • Deterministic IDs: Instance ID = {model}__{temp}_{top_p}_{top_k}_{repeat}_{predict}_{seed}
    • Example: gemma3_1b__0.7_0.9_40_1.1_-1_0
  • Prevents Duplicates: Cannot add identical model+param combos twice
  • Visual Distinction: Each instance shown as separate chip with parameters

Use Cases:

  • Find optimal temperature for creative vs. factual tasks
  • Test impact of top_k on response diversity
  • Discover best repeat_penalty for long-form content
  • Compare deterministic (seed set) vs. random outputs

πŸ“Š Enhanced Model Reveal

When you click "Reveal All Models" in blind mode:

Blind Label Actual Model Hyperparameters πŸ‘ Likes πŸ‘Ž Dislikes
Model A gemma3:1b T=0.7 P=0.9 K=40 R=1.1 M=-1 S=0 3 1
Model B qwen2.5:3b T=0.5 P=0.8 K=30 R=1.2 M=500 S=42 5 0
Model C llama3.2:3b T=1.0 P=1.0 K=50 R=1.0 M=-1 S=0 2 2
  • Hyperparameters Column: Shows full configuration for each instance
  • Vote Counts: Aggregate likes/dislikes across all responses
  • Makes Distinguishing Easy: See which parameter set performed best

πŸ“Š Architecture Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     BROWSER (localhost:7860)                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  Web UI (templates/index.html + static/app.js)          β”‚   β”‚
β”‚  β”‚  β€’ Arena mode (multi-model) vs Single mode              β”‚   β”‚
β”‚  β”‚  β€’ Blind evaluation with voting                         β”‚   β”‚
β”‚  β”‚  β€’ Per-model hyperparameter controls (6 params)         β”‚   β”‚
β”‚  β”‚  β€’ Real-time streaming display                          β”‚   β”‚
β”‚  β”‚  β€’ Copy, export, regenerate controls                    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚ HTTP/SSE (local only)
                          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚             FLASK APP (web_chat.py + app/)                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Routes & API (app/routes.py, app/api_routes.py)        β”‚  β”‚
β”‚  β”‚  β€’ /chat (arena mode) - broadcasts to N models          β”‚  β”‚
β”‚  β”‚  β€’ Supports model_instances with hyperparameters        β”‚  β”‚
β”‚  β”‚  β€’ /single_chat - single model streaming                β”‚  β”‚
β”‚  β”‚  β€’ /export - JSON conversation download                 β”‚  β”‚
β”‚  β”‚  β€’ /models - list available Ollama models               β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                       β”‚                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Ollama Service (app/ollama_service.py)                 β”‚  β”‚
β”‚  β”‚  β€’ Model validation & health checks                     β”‚  β”‚
β”‚  β”‚  β€’ Streaming response handlers                          β”‚  β”‚
β”‚  β”‚  β€’ Hyperparameter passthrough to Ollama API             β”‚  β”‚
β”‚  β”‚  β€’ Error recovery & retry logic                         β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚ localhost:11434 (Ollama API)
                          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    OLLAMA (Local Process)                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Model Inference Engine                                  β”‚  β”‚
β”‚  β”‚  β€’ llama3.2, qwen2.5, gemma2, etc.                       β”‚  β”‚
β”‚  β”‚  β€’ GPU acceleration (if available)                       β”‚  β”‚
β”‚  β”‚  β€’ Model parameter control (temp, top_p, top_k, etc.)    β”‚  β”‚
β”‚  β”‚  β€’ Accepts options: repeat_penalty, num_predict, seed    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚ Local filesystem only
                          β–Ό
               [Models stored in ~/.ollama/models/]

═══════════════════════════════════════════════════════════════════
                    🚫 NO CLOUD BOUNDARY 🚫
         ALL PROCESSING HAPPENS ON YOUR LOCAL MACHINE
         NO DATA TRANSMITTED TO EXTERNAL SERVICES
═══════════════════════════════════════════════════════════════════

Key Data Flows:

  1. User sends prompt via browser β†’ Flask backend receives it locally
  2. User configures hyperparameters per model instance in UI
  3. Flask routes to arena (multi-model) or single-model handler
  4. Backend extracts model_instances array with params, passes options to Ollama
  5. Ollama service streams responses from local Ollama instance with configured params
  6. Results streamed back to browser in real-time via Server-Sent Events
  7. Blind mode masks model identities, voting system tracks preferences
  8. Export function serializes conversation to JSON (masked or revealed based on state)

🧠 Key Design Decisions

Why Local-First?

Privacy is non-negotiable for many use cases:

  • Evaluating models on confidential business data (contracts, financials, customer records)
  • Testing with PII (personal identifiable information) in regulated sectors
  • Research with sensitive datasets (medical records, legal documents)
  • Competitive analysis using proprietary information

Cost and control:

  • No per-token API fees (models run on your hardware)
  • No rate limits or quotas
  • Complete control over model versions and parameters
  • Works in air-gapped or restricted network environments

Why Blind Evaluation?

Human evaluation of AI quality is highly subjective and influenced by brand perception:

  • Eliminates Bias: Without knowing which model is which, evaluators judge purely on output quality
  • Surprises: Smaller models often outperform larger ones on specific tasks
  • Team Consensus: Resolves debates by letting results speak for themselves
  • Educational: Teaches critical evaluation without preconceived notions

Why Per-Model Hyperparameters?

Different tasks require different parameter sets:

  • Creative Writing: High temperature (1.5+), high top_p (0.95+)
  • Code Generation: Low temperature (0.1-0.3), low top_k (10-20)
  • Factual Q&A: Medium temperature (0.5-0.7), repeat_penalty (1.2+)
  • Reproducibility: Set seed > 0 for deterministic outputs

Flexibility: Run same model at multiple temperatures to find optimal setting for your use case.

Why Human-in-the-Loop Evaluation?

Automated metrics (perplexity, BLEU, F1) fail to capture real-world usefulness:

  • Context matters: A "wrong" answer may be more helpful than a "correct" but pedantic one
  • Tone and empathy: Critical for customer service, healthcare, education use cases
  • Domain expertise: Only humans can judge accuracy in specialized fields (law, medicine, engineering)
  • Safety and ethics: Automated tools miss subtle biases, offensive content, or dangerous advice

Ollama Arena enables rapid human evaluation:

  • See 2-6 model responses instantly
  • Compare side-by-side in real-time
  • Export conversations for team review or compliance audits

Why Persistent State Matters

Reproducibility and accountability:

  • Audit trails: Export conversations with timestamps for compliance reporting
  • Iterative evaluation: Refine prompts and re-test without losing history
  • Team collaboration: Share exported JSON for peer review
  • Long-term tracking: Monitor model performance over time as versions change

Technical robustness:

  • In-memory session state survives browser refreshes
  • Graceful error handling prevents conversation data loss
  • Structured JSON exports enable integration with analysis tools

πŸ€– AI-Assisted QA Workflow

This project was developed using AI-assisted coding and testing (GitHub Copilot, ChatGPT), demonstrating the powerβ€”and limitationsβ€”of AI-driven development.

How the Testing Agent Was Used

  1. Test Case Generation: AI generated comprehensive test scenarios for arena mode, single mode, error handling, and edge cases
  2. Bug Discovery Automation: AI systematically tested all UI features (copy, export, regenerate, model switching) and documented failures
  3. Fix Implementation: AI proposed code patches for 6 critical bugs, which were reviewed and applied
  4. Regression Testing: After fixes, AI re-validated all workflows to ensure no new breaks

What It Caught

βœ… 6 Critical Bugs Fixed:

  • Missing "Copy Response" button in single-model mode
  • "Continue with One Model" feature not working in arena mode
  • Regenerate button triggering errors on edge cases
  • Export function missing metadata fields
  • Model switching race conditions
  • UI state inconsistencies after rapid interactions

βœ… Code Quality Issues:

  • Inconsistent error handling in streaming responses
  • Missing input validation for model selection
  • Unhandled edge cases in conversation history management

What It Didn't Catch

❌ User Experience Nuances:

  • Confusing button labels (AI didn't recognize UX friction)
  • Suboptimal response rendering for long outputs (required manual CSS tuning)
  • Accessibility issues (keyboard navigation, screen reader support)

❌ Performance Edge Cases:

  • Memory leaks with 50+ message conversations (found during manual stress testing)
  • Streaming lag with slow models (AI couldn't simulate real latency)

❌ Domain-Specific Validation:

  • Model response quality assessment (AI can't judge "good" vs. "bad" answers)
  • Privacy/security review (e.g., ensuring no telemetry or logging of sensitive data)

Why Human Oversight Remains Essential

AI is a powerful co-pilot, not an autopilot:

  • AI-generated tests are narrow and literal (miss creative edge cases)
  • AI cannot evaluate subjective quality (UX, tone, usefulness)
  • AI lacks contextual judgment (business requirements, compliance needs)
  • Final accountability rests with humans (code reviews, security audits, production deployment decisions)

Hybrid workflow works best:

  1. Use AI for initial scaffolding, boilerplate, and systematic testing
  2. Apply human judgment for architecture, UX, security, and domain correctness
  3. Iterate: AI proposes, human refines, AI validates, human approves

🎯 Use Cases

1. Research

  • Academic ML Research: Compare model architectures on standardized benchmarks without cloud API costs
  • Prompt Engineering: Rapidly iterate on prompt designs and see how different models respond
  • Model Selection: Evaluate which open-source model (Llama, Qwen, Gemma) best fits your task

2. Regulated Environments

  • Healthcare: Test AI assistants on synthetic patient data locally (HIPAA compliance)
  • Finance: Evaluate models on financial reports without violating SOX/GDPR
  • Legal: Compare legal reasoning models on case files (attorney-client privilege)
  • Government: Air-gapped evaluation in secure environments

3. Enterprise Pre-Deployment Evaluation

  • Model Vetting: Test multiple models on real-world tasks before committing to vendor contracts
  • Cost-Benefit Analysis: Compare cloud API models (via local proxies) vs. self-hosted options
  • Team Alignment: Export conversation samples for stakeholder review before production deployment
  • Risk Assessment: Identify biases, hallucinations, or safety issues in candidate models

πŸš€ How to Run

Prerequisites

  • Python 3.8+
  • Ollama CLI installed (ollama.ai)
  • At least one model pulled (e.g., ollama pull llama3.2)

Installation

  1. Create & activate a virtual environment:
python -m venv .venv
.\.venv\Scripts\Activate.ps1  # Windows PowerShell
# or: source .venv/bin/activate (Linux/Mac)
  1. Install dependencies:
pip install -r requirements.txt
  1. Pull Ollama models (if not already done):
ollama pull llama3.2
ollama pull qwen2.5:latest
ollama pull gemma2:9b

Running the Application

Option 1: Web UI (Recommended)

python .\web_chat.py

Open http://127.0.0.1:7860 in your browser.

Option 2: Terminal Chat

python .\Chatbot.py

What's Included


πŸ“š How to Cite

If you use Ollama Arena in your research, please cite:

@software{ollama_arena_2026,
  author       = {Lagad, Shubham},
  title        = {Ollama Arena: Local-First Multi-Model AI Comparison Platform},
  year         = {2026},
  month        = {January},
  url          = {https://github.com/yourusername/ollama-arena},
  note         = {Privacy-first side-by-side evaluation of local AI models via Ollama},
  license      = {MIT}
}

Key Features to Highlight in Citations:

  • Local-first architecture (zero cloud dependencies)
  • Multi-model orchestration for blind comparison
  • Human-in-the-loop evaluation workflows
  • Compliance-friendly (GDPR, HIPAA, SOX)

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

Copyright Β© 2026 Shubham Lagad

Third-Party Software Notice

This software uses Ollama (Β© Ollama, Inc.), which is a separate product with its own license terms. Ollama is not included in this MIT License. Please refer to Ollama's license for details.


🀝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for:

  • Code of Conduct: Respectful, inclusive collaboration
  • Development Setup: How to fork, clone, and set up dev environment
  • Coding Standards: Python style guide, linting rules, test requirements
  • Pull Request Process: Branch naming, commit messages, review workflow
  • Issue Guidelines: Bug reports, feature requests, documentation improvements

Quick Start for Contributors:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make changes and test locally
  4. Run linter: pip install -r requirements-dev.txt && flake8
  5. Submit PR with clear description of changes

Priority Areas for Contribution:

  • πŸ”§ Accessibility improvements (ARIA labels, keyboard navigation)
  • 🎨 UI/UX enhancements (dark mode, mobile responsiveness)
  • πŸ“Š Export formats (CSV, Markdown, PDF)
  • πŸ§ͺ Automated testing (pytest suite expansion)
  • πŸ“– Documentation (tutorials, video guides, translations)

πŸ“ž Support


Made with ❀️ for privacy-conscious AI practitioners

About

Privacy-first local AI model comparison platform with blind evaluation, per-model hyperparameters, and multi-configuration testing. Compare 2-6 models side-by-side through Ollama with zero cloud dependencies.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages