🤖 A ChatGPT-like web interface for Microsoft BitNet. Run a local AI assistant on your machine with OpenAI-compatible API. 100% private, no cloud required.
⚠️ Important: This project provides a web interface tool for Microsoft's BitNet model. The BitNet model and inference engine are developed by Microsoft. This project adds a user-friendly web UI layer on top of the existing BitNet infrastructure.
This web interface is built on top of the official Microsoft BitNet project:
- Original Repository: microsoft/BitNet
- Research Paper: BitNet: Scaling 1-bit Transformers
- Model: BitNet-b1.58-2B-4T on Hugging Face
I created this web-based ChatGPT-like interface to make BitNet more accessible and user-friendly:
- 🎨 Modern Web UI - ChatGPT-style chat interface with conversation history
- 🔌 OpenAI-Compatible API - REST endpoints that work with LangChain, LlamaIndex, etc.
- 💬 Browser-Based Chat - No command line needed, chat directly in your browser
- 📱 Responsive Design - Works on desktop, tablet, and mobile
- 🐳 Docker Support - One-command deployment
- 🔒 Privacy-Focused - All inference stays on your local machine
| Component | Created By |
|---|---|
| BitNet Model Architecture | Microsoft Research |
| Quantization Kernels (i2_s, TL1, TL2) | Microsoft Research |
| llama.cpp Integration | Microsoft + llama.cpp team |
| Command-line Inference | Microsoft |
| Web Chat Interface | Raphael Tomas Malikian ✋ |
| FastAPI Web Server | Raphael Tomas Malikian ✋ |
| OpenAI-Compatible API Layer | Raphael Tomas Malikian ✋ |
| Docker Configuration | Raphael Tomas Malikian ✋ |
The BitNet Chat welcome screen with suggestion cards for quick start.
BitNet explaining quantum computing in simple terms - all processed locally on your machine.
BitNet generating a Python script to sort numbers - with clear code examples.
# Clone the repository
git clone https://github.com/rtmalikian/bitnet-chat.git
cd bitnet-chat
# Install Python dependencies for web server
pip install fastapi uvicorn pydantic
# Start the web server
python web_server/app.py
# Open your browser to http://localhost:8080# Build and run with Docker Compose
docker-compose up --build
# Access at http://localhost:8080# Automated setup and launch
./start_web_server.sh- Open http://localhost:8080 in your browser
- Type a message or click a suggestion card
- Chat with BitNet locally!
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "bitnet",
"messages": [
{"role": "user", "content": "Hello!"}
],
"temperature": 0.8,
"max_tokens": 512
}'curl http://localhost:8080/v1/modelscurl http://localhost:8080/healthfrom openai import OpenAI
client = OpenAI(
api_key="not-needed",
base_url="http://localhost:8080/v1"
)
response = client.chat.completions.create(
model="bitnet",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to reverse a string"}
],
temperature=0.7,
max_tokens=256
)
print(response.choices[0].message.content)from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
openai_api_key="not-needed",
openai_api_base="http://localhost:8080/v1",
model_name="bitnet",
temperature=0.7
)
response = llm.invoke("Explain quantum computing in simple terms")
print(response.content)from llama_index.llms.openai import OpenAI
llm = OpenAI(
api_key="not-needed",
api_base="http://localhost:8080/v1",
model="bitnet",
temperature=0.7
)
response = llm.complete("What is the capital of France?")
print(response.text)| Variable | Description | Default |
|---|---|---|
MODEL_PATH |
Path to the GGUF model file | models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf |
CTX_SIZE |
Context window size (tokens) | 2048 |
N_PREDICT |
Max tokens to generate | 512 |
TEMPERATURE |
Sampling temperature (0.0-2.0) | 0.8 |
THREADS |
Number of CPU threads | 4 |
# Run with custom configuration
export MODEL_PATH="./models/custom-model.gguf"
export THREADS=8
export CTX_SIZE=4096
export TEMPERATURE=0.5
python web_server/app.py| Metric | Value |
|---|---|
| Model | BitNet-b1.58-2B-4T (Microsoft) |
| Quantization | i2_s (2-bit) |
| Prompt Processing | ~25 tokens/sec |
| Text Generation | ~13-15 tokens/sec |
| Memory Usage | ~1.2 GB RAM |
| Model Size | 1.19 GB |
bitnet-chat/
├── web_server/
│ ├── app.py # FastAPI web server (My contribution)
│ ├── static/
│ │ └── index.html # Web UI - ChatGPT-like interface (My contribution)
│ ├── assets/
│ │ └── screenshots*.png # Demo screenshots
│ ├── requirements.txt # Python dependencies
│ └── README.md # Detailed documentation
├── Dockerfile # Docker image definition (My contribution)
├── docker-compose.yml # Docker Compose configuration (My contribution)
├── start_web_server.sh # Quick start script (My contribution)
├── models/ # BitNet model files (Microsoft)
└── build/ # Compiled llama.cpp binaries (Microsoft/llama.cpp)
All inference happens locally on your machine. No data is sent to external servers, cloud services, or third parties.
- ✅ No API keys required
- ✅ No internet connection needed (after initial setup)
- ✅ All conversation history stored locally in browser
- ✅ Model runs entirely on your hardware
- ✅ No telemetry or analytics
# Kill the process using port 8080
lsof -ti:8080 | xargs kill -9
# Restart the server
python web_server/app.py# Verify model exists
ls -la models/BitNet-b1.58-2B-4T/
# Re-download if necessary
python -c "from huggingface_hub import snapshot_download; snapshot_download('microsoft/BitNet-b1.58-2B-4T-gguf', local_dir='models/BitNet-b1.58-2B-4T')"- Hard refresh:
Cmd + Shift + R(Mac) orCtrl + Shift + R(Windows) - Clear browser cache
- Try incognito/private mode
This project builds upon excellent work by others:
- Microsoft Research - For the amazing BitNet model and inference engine
- llama.cpp - For the efficient C++ inference framework
- OpenAI - For the API design inspiration
- FastAPI - For the excellent web framework
Raphael Tomas Malikian
📍 Palmdale, California, USA
📧 rtmalikian@gmail.com
🔗 GitHub
What I Built:
- Web-based ChatGPT-like interface
- FastAPI server with OpenAI-compatible endpoints
- Docker configuration for easy deployment
- Responsive UI with conversation history
What Microsoft Built:
- BitNet model architecture
- Quantization kernels
- Command-line inference tools
- llama.cpp integration
This web interface project is licensed under the MIT License - see the LICENSE file for details.
The underlying BitNet model and inference engine are subject to Microsoft's licensing terms.
- Original BitNet Repository - Microsoft's official project
- BitNet Paper - Technical details on 1-bit LLMs
- Hugging Face Model - Download models
- llama.cpp Documentation - Inference engine docs
Made with ❤️ by Raphael Tomas Malikian
Run AI locally, keep your data private.


