Skip to content

Local Models

Mo Abualruz edited this page Dec 3, 2025 · 1 revision

Local Models Guide

Status: ✅ Complete

Last Updated: December 3, 2025


Overview

Local models allow you to run AI models directly on your machine using Ollama, enabling offline development, enhanced privacy, and reduced costs. This guide covers everything you need to know about using local models with RiceCoder.

What is Ollama?

Ollama is an open-source tool that makes it easy to run large language models locally. Instead of sending your code and conversations to cloud providers, Ollama runs models on your own hardware, giving you complete control over your data.

Benefits of Local Models

Privacy: Your code and conversations never leave your machine. No data is sent to external servers.

Offline Development: Work without internet connectivity. Perfect for airplanes, trains, or areas with poor connectivity.

Cost Savings: No API fees. Run unlimited models after the initial download.

Customization: Fine-tune models for your specific use cases and domains.

Speed: Instant responses without network latency (depending on your hardware).

Control: Complete control over model versions and behavior.

When to Use Local Models

Use local models when:

  • You need privacy and data security
  • You work offline frequently
  • You want to reduce costs
  • You're developing locally and don't need cloud features
  • You want to experiment with different models

Use cloud providers when:

  • You need the latest, most powerful models
  • You want to minimize hardware requirements
  • You need guaranteed uptime and support
  • You're working in a team with shared infrastructure

Installation

Prerequisites

Before installing Ollama, ensure your system meets the requirements:

Hardware Requirements:

  • Minimum: 4GB RAM (8GB recommended)
  • GPU (optional but recommended): NVIDIA, AMD, or Apple Silicon for faster inference
  • Disk Space: 10-50GB depending on models you want to run

System Requirements:

  • Windows 10/11, macOS 11+, or Linux
  • Administrator/sudo access for installation

Windows Installation

Step 1: Download Ollama

  1. Visit https://ollama.ai
  2. Click "Download for Windows"
  3. Run the installer (OllamaSetup.exe)

Step 2: Install

  1. Follow the installation wizard
  2. Accept the license agreement
  3. Choose installation location (default: C:\Users\{username}\AppData\Local\Programs\Ollama)
  4. Click "Install"

Step 3: Verify Installation

Open PowerShell and run:

ollama --version

You should see the version number.

Step 4: Start Ollama

Ollama runs as a background service on Windows. It starts automatically after installation.

To verify it's running:

curl http://localhost:11434/api/tags

You should get a JSON response.

macOS Installation

Step 1: Download Ollama

  1. Visit https://ollama.ai
  2. Click "Download for macOS"
  3. Choose the appropriate version:
    • Apple Silicon (M1/M2/M3): ollama-darwin-arm64.zip
    • Intel: ollama-darwin-amd64.zip

Step 2: Install

  1. Open the downloaded .dmg file
  2. Drag Ollama to the Applications folder
  3. Wait for the copy process to complete

Step 3: Verify Installation

Open Terminal and run:

ollama --version

You should see the version number.

Step 4: Start Ollama

Ollama runs as a background service on macOS. Start it with:

ollama serve

Or use Spotlight to launch Ollama from Applications.

Linux Installation

Step 1: Download and Install

Open Terminal and run:

curl https://ollama.ai/install.sh | sh

This script will:

  • Download Ollama
  • Install it to /usr/local/bin/ollama
  • Set up systemd service

Step 2: Verify Installation

ollama --version

You should see the version number.

Step 3: Start Ollama

Start the Ollama service:

sudo systemctl start ollama

Enable it to start on boot:

sudo systemctl enable ollama

Verify it's running:

curl http://localhost:11434/api/tags

You should get a JSON response.

Model Management

Pulling Models

Before you can use a model, you need to pull it from the Ollama library.

Pull a Model

ollama pull mistral

This downloads the Mistral model (about 4GB). The first pull takes time depending on your internet speed.

Available Models

Popular models available on Ollama:

Lightweight Models (fast, low memory):

  • mistral - 7B parameters, fast and efficient
  • neural-chat - 7B parameters, optimized for chat
  • orca-mini - 3B parameters, very lightweight

Balanced Models (good quality, moderate resources):

  • llama2 - 7B parameters, general purpose
  • neural-chat - 7B parameters, chat optimized
  • dolphin-mixtral - 8x7B mixture of experts

Powerful Models (high quality, high resource):

  • mistral-large - 34B parameters, very capable
  • llama2-uncensored - 7B parameters, less restricted
  • neural-chat-7b-v3 - 7B parameters, latest version

Specialized Models:

  • codeup - Optimized for code generation
  • phind-codellama - Code-focused model
  • sqlcoder - SQL generation

For a complete list, visit https://ollama.ai/library

Pull Multiple Models

ollama pull mistral
ollama pull llama2
ollama pull neural-chat

Listing Models

View all models you've downloaded:

ollama list

Output example:

NAME                    ID              SIZE    MODIFIED
mistral:latest          2ae6f6dd4470    4.1 GB  2 minutes ago
llama2:latest           78e26419b446    3.8 GB  1 hour ago
neural-chat:latest      42ab0d3f4b51    4.1 GB  3 hours ago

Removing Models

Delete a model to free up disk space:

ollama rm mistral

Remove all models:

ollama rm -a

Updating Models

Pull the latest version of a model:

ollama pull mistral

Ollama automatically downloads the latest version if available.

Configuring RiceCoder for Ollama

Step 1: Ensure Ollama is Running

First, make sure Ollama is running on your system:

# Windows (PowerShell)
curl http://localhost:11434/api/tags

# macOS/Linux
curl http://localhost:11434/api/tags

If you get a connection error, start Ollama:

# macOS
ollama serve

# Linux
sudo systemctl start ollama

# Windows - Ollama runs as a service automatically

Step 2: Configure RiceCoder

Edit your RiceCoder configuration file:

Global Configuration (~/.ricecoder/config.yaml):

provider: ollama
ollama-url: http://localhost:11434
model: mistral

Project Configuration (.agent/config.yaml):

provider: ollama
model: mistral

Step 3: Verify Configuration

Check that RiceCoder can connect to Ollama:

rice config show

You should see:

provider: ollama
ollama-url: http://localhost:11434
model: mistral

Step 4: Test the Connection

Start a chat session:

rice chat

Type a message and press Enter. If everything is configured correctly, you should get a response from the local model.

Configuration Options

provider

Set to ollama to use local models:

provider: ollama

ollama-url

URL where Ollama is running (default: http://localhost:11434):

ollama-url: http://localhost:11434

If Ollama is on a different machine:

ollama-url: http://192.168.1.100:11434

model

Which model to use:

model: mistral

Other options:

model: llama2
model: neural-chat
model: codeup

ollama-timeout

Timeout for Ollama requests in seconds (default: 300):

ollama-timeout: 600  # 10 minutes for complex tasks

Usage Examples

Basic Chat

Start a chat session with your local model:

rice chat

Then type your questions:

> What is Rust?

The model will respond with information about Rust.

Code Generation

Generate code using your local model:

rice chat
> Generate a Rust function that validates email addresses

The model will generate code based on your request.

Code Review

Ask the model to review code:

rice chat
> Review this function for potential issues:
> 
> fn process_data(data: &str) -> String {
>     data.to_uppercase()
> }

Spec-Driven Development

Use local models with specs:

rice spec create my-feature
rice spec design my-feature
rice gen my-feature

The generation will use your local model instead of cloud providers.

Performance Optimization

Hardware Considerations

CPU-Only Performance:

  • Mistral 7B: ~5-10 tokens/second
  • Llama2 7B: ~3-8 tokens/second
  • Smaller models: ~10-20 tokens/second

GPU Performance (NVIDIA):

  • Mistral 7B: ~50-100 tokens/second
  • Llama2 7B: ~40-80 tokens/second
  • Larger models: ~20-50 tokens/second

Optimization Tips

1. Use Smaller Models for Speed

If speed is critical, use smaller models:

model: orca-mini  # 3B, very fast

2. Increase Timeout for Complex Tasks

For complex code generation, increase the timeout:

ollama-timeout: 900  # 15 minutes

3. Use GPU Acceleration

If you have an NVIDIA GPU, Ollama automatically uses it. Verify:

ollama list

Look for GPU memory usage in the output.

4. Adjust Model Parameters

For faster responses, use fewer tokens:

rice chat --max-tokens 500

5. Run Ollama on a Dedicated Machine

For team environments, run Ollama on a dedicated server:

ollama-url: http://ollama-server.local:11434

Memory Management

Monitor Memory Usage:

# macOS/Linux
top  # Look for ollama process

# Windows
tasklist | findstr ollama

Reduce Memory Usage:

  1. Use smaller models
  2. Reduce context window
  3. Run only one model at a time

Free Up Memory:

ollama rm model-name

Troubleshooting

"Connection refused" Error

Problem: RiceCoder can't connect to Ollama

Solution:

  1. Check if Ollama is running:
curl http://localhost:11434/api/tags
  1. If not running, start Ollama:
# macOS
ollama serve

# Linux
sudo systemctl start ollama

# Windows - restart the Ollama service
  1. Check the URL in configuration:
rice config get ollama-url
  1. If using a different URL, update it:
rice config set ollama-url http://your-ollama-url:11434

"Model not found" Error

Problem: The specified model doesn't exist

Solution:

  1. Check available models:
ollama list
  1. Pull the model:
ollama pull mistral
  1. Update configuration:
rice config set model mistral

"Out of memory" Error

Problem: Ollama runs out of memory

Solution:

  1. Use a smaller model:
ollama pull orca-mini
rice config set model orca-mini
  1. Increase system memory or swap
  2. Close other applications
  3. Reduce context window size

"Timeout" Error

Problem: Ollama takes too long to respond

Solution:

  1. Increase timeout:
ollama-timeout: 900  # 15 minutes
  1. Use a faster model:
ollama pull mistral
rice config set model mistral
  1. Check system resources (CPU, memory, disk)
  2. Reduce model size or complexity

"Slow Response" Issue

Problem: Model responses are very slow

Solution:

  1. Check if GPU is being used:
ollama list
  1. If CPU-only, consider:

    • Using a smaller model
    • Getting a GPU
    • Running Ollama on a more powerful machine
  2. Monitor system resources:

# macOS/Linux
top

# Windows
tasklist
  1. Close other applications to free resources

"Model Download Fails"

Problem: Can't download a model

Solution:

  1. Check internet connection:
ping ollama.ai
  1. Try again (temporary network issue):
ollama pull mistral
  1. Check disk space:
# macOS/Linux
df -h

# Windows
dir C:\
  1. Try a different model:
ollama pull llama2

"Ollama Won't Start"

Problem: Ollama service won't start

Solution:

  1. Check if port 11434 is in use:
# macOS/Linux
lsof -i :11434

# Windows
netstat -ano | findstr :11434
  1. If in use, kill the process or change the port
  2. Restart Ollama:
# macOS
killall ollama
ollama serve

# Linux
sudo systemctl restart ollama

# Windows
# Restart the Ollama service from Services
  1. Check logs for errors:
# macOS/Linux
tail -f ~/.ollama/logs/server.log

# Windows
# Check Event Viewer

Best Practices

1. Start with Mistral

Mistral is a good balance of speed and quality. Start here:

model: mistral

2. Use Project Configuration

Store model preferences in .agent/config.yaml:

provider: ollama
model: mistral

3. Monitor Resource Usage

Keep an eye on CPU, memory, and disk:

# macOS/Linux
top

# Windows
tasklist

4. Keep Models Updated

Periodically pull the latest versions:

ollama pull mistral

5. Document Model Choices

Explain why you chose a specific model:

# .agent/config.yaml
# Using Mistral for fast local development
# Switch to OpenAI for production
provider: ollama
model: mistral

6. Test Before Committing

Always test generated code before committing:

rice chat
> Generate a function to validate emails
# Review the generated code
# Test it locally
# Then commit

7. Use Appropriate Models for Tasks

  • Code generation: codeup, phind-codellama
  • General chat: mistral, neural-chat
  • Lightweight: orca-mini
  • Powerful: mistral-large, llama2-uncensored

See Also


Last updated: December 3, 2025

Clone this wiki locally