A local chat UI for llama.cpp server with Gradio 6, MCP tool calling, and optional privacy processing.
- Start and stop
llama-serverfrom the UI and load models from a GGUF folder or custom path - Stream chat responses from the OpenAI-compatible
/v1/chat/completionsendpoint - Connect MCP servers over stdio, SSE, or HTTP and let the model call tools
- Run a two-step privacy flow: redact PII with Presidio, then restyle text locally
- Python 3.10+
llama-server/llama-server.exefrom llama.cpp- At least one GGUF model file
- uv (recommended) or pip
This repository is a uv project (pyproject.toml + uv.lock).
-
Clone and enter the project
git clone https://github.com/RoyCoding8/llama-gradio-ui.git cd llama-gradio-ui -
Create
.envfrom the examplecp .env.example .env
-
Update
.envwith your local paths and defaultsLLAMA_SERVER_DIR=C:\path\to\llama-cpp-build GGUF_DIR=C:\path\to\models GPU_LAYERS=-1 CTX_SIZE=4096 KV_CACHE_TYPE_K=f16 KV_CACHE_TYPE_V=f16
-
Install dependencies
uv sync
Or with pip:
pip install -e . python -m spacy download en_core_web_lg -
Run the app
uv run python app.py
On Windows you can also use
start.bat.
By default, the UI is available at http://127.0.0.1:7860.
Key values in .env:
LLAMA_HOST,LLAMA_PORT: targetllama-serverhost and portLLAMA_SERVER_DIR: directory that containsllama-serverGGUF_DIR: directory scanned for.ggufmodelsUI_HOST,UI_PORT,UI_SHARE: Gradio host, port, and public share modeCTX_SIZE,GPU_LAYERS: default runtime settings forllama-serverKV_CACHE_TYPE_K,KV_CACHE_TYPE_V: KV cache quantization settingsALLOW_REMOTE_TOOLS: whenUI_SHARE=1, set this to1only if you explicitly want remote tool execution
You can add MCP servers in the UI or edit mcp_servers.json directly.
{
"servers": {
"filesystem": {
"name": "filesystem",
"transport": "stdio",
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "C:/docs"],
"enabled": true,
"autostart": false
}
}
}The MCP tab also accepts Claude/Cursor-format imports.
- The model receives chat history plus OpenAI-format tool schemas from connected MCP servers
- If it emits tool calls, the app dispatches them to MCP and records results
- Tool results are appended to conversation context
- The model runs again, up to 5 rounds
- Final output streams back to the chat UI
- The Chat tab
Thinktoggle is forwarded to llama.cpp on each request using:reasoning: "on" | "off"reasoning_budget: -1 | 0chat_template_kwargs: {"enable_thinking": true | false}
- Some Qwen3.5 + llama.cpp builds have known upstream issues where thinking control can be inconsistent.
- If
Think: OFFstill hangs or emits reasoning on your build, update llama.cpp and prefer server startup with explicit reasoning flags (for example--reasoning off).
| File | Purpose |
|---|---|
app.py |
Application entry point and Gradio wiring |
server_runtime.py |
llama-server process lifecycle and model discovery |
chat_engine.py |
Streaming chat and MCP tool-call loop |
mcp_manager.py |
Async MCP client manager and server connections |
mcp_facade.py |
UI-facing MCP actions and response formatting |
privacy_shield.py |
PII redaction and local restyling flow |
config.py |
Environment and .env parsing |
style.css |
UI styling |
Apache 2.0