llama-gradio-ui

A local chat UI for llama.cpp server with Gradio 6, MCP tool calling, and optional privacy processing.

What You Can Do

Start and stop llama-server from the UI and load models from a GGUF folder or custom path
Stream chat responses from the OpenAI-compatible /v1/chat/completions endpoint
Connect MCP servers over stdio, SSE, or HTTP and let the model call tools
Run a two-step privacy flow: redact PII with Presidio, then restyle text locally

Requirements

Python 3.10+
llama-server / llama-server.exe from llama.cpp
At least one GGUF model file
uv (recommended) or pip

This repository is a uv project (pyproject.toml + uv.lock).

Quick Start

Clone and enter the project

git clone https://github.com/RoyCoding8/llama-gradio-ui.git
cd llama-gradio-ui

Create .env from the example
```
cp .env.example .env
```

Update .env with your local paths and defaults

LLAMA_SERVER_DIR=C:\path\to\llama-cpp-build
GGUF_DIR=C:\path\to\models

GPU_LAYERS=-1
CTX_SIZE=4096
KV_CACHE_TYPE_K=f16
KV_CACHE_TYPE_V=f16

Install dependencies

uv sync

Or with pip:

pip install -e .
python -m spacy download en_core_web_lg

Run the app
```
uv run python app.py
```
On Windows you can also use start.bat.

By default, the UI is available at http://127.0.0.1:7860.

Configuration

Key values in .env:

LLAMA_HOST, LLAMA_PORT: target llama-server host and port
LLAMA_SERVER_DIR: directory that contains llama-server
GGUF_DIR: directory scanned for .gguf models
UI_HOST, UI_PORT, UI_SHARE: Gradio host, port, and public share mode
CTX_SIZE, GPU_LAYERS: default runtime settings for llama-server
KV_CACHE_TYPE_K, KV_CACHE_TYPE_V: KV cache quantization settings
ALLOW_REMOTE_TOOLS: when UI_SHARE=1, set this to 1 only if you explicitly want remote tool execution

MCP Server Setup

You can add MCP servers in the UI or edit mcp_servers.json directly.

{
  "servers": {
    "filesystem": {
      "name": "filesystem",
      "transport": "stdio",
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "C:/docs"],
      "enabled": true,
      "autostart": false
    }
  }
}

The MCP tab also accepts Claude/Cursor-format imports.

Tool-Calling Flow

The model receives chat history plus OpenAI-format tool schemas from connected MCP servers
If it emits tool calls, the app dispatches them to MCP and records results
Tool results are appended to conversation context
The model runs again, up to 5 rounds
Final output streams back to the chat UI

Thinking Mode Notes

The Chat tab Think toggle is forwarded to llama.cpp on each request using:
- reasoning: "on" | "off"
- reasoning_budget: -1 | 0
- chat_template_kwargs: {"enable_thinking": true | false}
Some Qwen3.5 + llama.cpp builds have known upstream issues where thinking control can be inconsistent.
If Think: OFF still hangs or emits reasoning on your build, update llama.cpp and prefer server startup with explicit reasoning flags (for example --reasoning off).

Project Structure

File	Purpose
`app.py`	Application entry point and Gradio wiring
`server_runtime.py`	`llama-server` process lifecycle and model discovery
`chat_engine.py`	Streaming chat and MCP tool-call loop
`mcp_manager.py`	Async MCP client manager and server connections
`mcp_facade.py`	UI-facing MCP actions and response formatting
`privacy_shield.py`	PII redaction and local restyling flow
`config.py`	Environment and `.env` parsing
`style.css`	UI styling

License

Apache 2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama-gradio-ui

What You Can Do

Requirements

Quick Start

Configuration

MCP Server Setup

Tool-Calling Flow

Thinking Mode Notes

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
app.py		app.py
chat_engine.py		chat_engine.py
config.py		config.py
mcp_facade.py		mcp_facade.py
mcp_manager.py		mcp_manager.py
mcp_servers.json		mcp_servers.json
privacy_shield.py		privacy_shield.py
pyproject.toml		pyproject.toml
server_runtime.py		server_runtime.py
start.bat		start.bat
style.css		style.css
test.cpp		test.cpp

Folders and files

Latest commit

History

Repository files navigation

llama-gradio-ui

What You Can Do

Requirements

Quick Start

Configuration

MCP Server Setup

Tool-Calling Flow

Thinking Mode Notes

Project Structure

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages