BitNet Chat

🤖 A ChatGPT-like web interface for Microsoft BitNet. Run a local AI assistant on your machine with OpenAI-compatible API. 100% private, no cloud required.

⚠️ Important: This project provides a web interface tool for Microsoft's BitNet model. The BitNet model and inference engine are developed by Microsoft. This project adds a user-friendly web UI layer on top of the existing BitNet infrastructure.

🔗 Original Project

This web interface is built on top of the official Microsoft BitNet project:

Original Repository: microsoft/BitNet
Research Paper: BitNet: Scaling 1-bit Transformers
Model: BitNet-b1.58-2B-4T on Hugging Face

🎯 What This Project Adds

I created this web-based ChatGPT-like interface to make BitNet more accessible and user-friendly:

✨ New Features (My Contribution)

🎨 Modern Web UI - ChatGPT-style chat interface with conversation history
🔌 OpenAI-Compatible API - REST endpoints that work with LangChain, LlamaIndex, etc.
💬 Browser-Based Chat - No command line needed, chat directly in your browser
📱 Responsive Design - Works on desktop, tablet, and mobile
🐳 Docker Support - One-command deployment
🔒 Privacy-Focused - All inference stays on your local machine

📊 What's Microsoft's vs. What's Mine

Component	Created By
BitNet Model Architecture	Microsoft Research
Quantization Kernels (i2_s, TL1, TL2)	Microsoft Research
llama.cpp Integration	Microsoft + llama.cpp team
Command-line Inference	Microsoft
Web Chat Interface	Raphael Tomas Malikian ✋
FastAPI Web Server	Raphael Tomas Malikian ✋
OpenAI-Compatible API Layer	Raphael Tomas Malikian ✋
Docker Configuration	Raphael Tomas Malikian ✋

📸 Screenshots

Welcome Screen

The BitNet Chat welcome screen with suggestion cards for quick start.

Example: Explaining Quantum Computing

BitNet explaining quantum computing in simple terms - all processed locally on your machine.

Example: Python Code Generation

BitNet generating a Python script to sort numbers - with clear code examples.

🚀 Quick Start

Option 1: Run Locally (Recommended for Development)

# Clone the repository
git clone https://github.com/rtmalikian/bitnet-chat.git
cd bitnet-chat

# Install Python dependencies for web server
pip install fastapi uvicorn pydantic

# Start the web server
python web_server/app.py

# Open your browser to http://localhost:8080

Option 2: Docker (Recommended for Production)

# Build and run with Docker Compose
docker-compose up --build

# Access at http://localhost:8080

Option 3: Quick Start Script

# Automated setup and launch
./start_web_server.sh

💻 Usage

Web Interface

Open http://localhost:8080 in your browser
Type a message or click a suggestion card
Chat with BitNet locally!

API Endpoints

Chat Completions

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bitnet",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.8,
    "max_tokens": 512
  }'

List Models

curl http://localhost:8080/v1/models

Health Check

curl http://localhost:8080/health

🔌 Integration Examples

Python with OpenAI Client

from openai import OpenAI

client = OpenAI(
    api_key="not-needed",
    base_url="http://localhost:8080/v1"
)

response = client.chat.completions.create(
    model="bitnet",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to reverse a string"}
    ],
    temperature=0.7,
    max_tokens=256
)

print(response.choices[0].message.content)

LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    openai_api_key="not-needed",
    openai_api_base="http://localhost:8080/v1",
    model_name="bitnet",
    temperature=0.7
)

response = llm.invoke("Explain quantum computing in simple terms")
print(response.content)

LlamaIndex

from llama_index.llms.openai import OpenAI

llm = OpenAI(
    api_key="not-needed",
    api_base="http://localhost:8080/v1",
    model="bitnet",
    temperature=0.7
)

response = llm.complete("What is the capital of France?")
print(response.text)

⚙️ Configuration

Environment Variables

Variable	Description	Default
`MODEL_PATH`	Path to the GGUF model file	`models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf`
`CTX_SIZE`	Context window size (tokens)	`2048`
`N_PREDICT`	Max tokens to generate	`512`
`TEMPERATURE`	Sampling temperature (0.0-2.0)	`0.8`
`THREADS`	Number of CPU threads	`4`

Example with Custom Settings

# Run with custom configuration
export MODEL_PATH="./models/custom-model.gguf"
export THREADS=8
export CTX_SIZE=4096
export TEMPERATURE=0.5

python web_server/app.py

📊 Performance

Benchmarks on Apple M1 (2021 iMac)

Metric	Value
Model	BitNet-b1.58-2B-4T (Microsoft)
Quantization	i2_s (2-bit)
Prompt Processing	~25 tokens/sec
Text Generation	~13-15 tokens/sec
Memory Usage	~1.2 GB RAM
Model Size	1.19 GB

📁 Project Structure

bitnet-chat/
├── web_server/
│   ├── app.py              # FastAPI web server (My contribution)
│   ├── static/
│   │   └── index.html      # Web UI - ChatGPT-like interface (My contribution)
│   ├── assets/
│   │   └── screenshots*.png # Demo screenshots
│   ├── requirements.txt    # Python dependencies
│   └── README.md           # Detailed documentation
├── Dockerfile              # Docker image definition (My contribution)
├── docker-compose.yml      # Docker Compose configuration (My contribution)
├── start_web_server.sh     # Quick start script (My contribution)
├── models/                 # BitNet model files (Microsoft)
└── build/                  # Compiled llama.cpp binaries (Microsoft/llama.cpp)

🔒 Privacy & Security

All inference happens locally on your machine. No data is sent to external servers, cloud services, or third parties.

✅ No API keys required
✅ No internet connection needed (after initial setup)
✅ All conversation history stored locally in browser
✅ Model runs entirely on your hardware
✅ No telemetry or analytics

🛠️ Troubleshooting

Common Issues

"Address already in use" error

# Kill the process using port 8080
lsof -ti:8080 | xargs kill -9

# Restart the server
python web_server/app.py

"Model file not found" error

# Verify model exists
ls -la models/BitNet-b1.58-2B-4T/

# Re-download if necessary
python -c "from huggingface_hub import snapshot_download; snapshot_download('microsoft/BitNet-b1.58-2B-4T-gguf', local_dir='models/BitNet-b1.58-2B-4T')"

Browser shows old version

Hard refresh: Cmd + Shift + R (Mac) or Ctrl + Shift + R (Windows)
Clear browser cache
Try incognito/private mode

🙏 Acknowledgments

This project builds upon excellent work by others:

Microsoft Research - For the amazing BitNet model and inference engine
llama.cpp - For the efficient C++ inference framework
OpenAI - For the API design inspiration
FastAPI - For the excellent web framework

👨‍💻 Author

Raphael Tomas Malikian
📍 Palmdale, California, USA
📧 rtmalikian@gmail.com
🔗 GitHub

What I Built:

Web-based ChatGPT-like interface
FastAPI server with OpenAI-compatible endpoints
Docker configuration for easy deployment
Responsive UI with conversation history

What Microsoft Built:

BitNet model architecture
Quantization kernels
Command-line inference tools
llama.cpp integration

📄 License

This web interface project is licensed under the MIT License - see the LICENSE file for details.

The underlying BitNet model and inference engine are subject to Microsoft's licensing terms.

📚 Resources

Original BitNet Repository - Microsoft's official project
BitNet Paper - Technical details on 1-bit LLMs
Hugging Face Model - Download models
llama.cpp Documentation - Inference engine docs

Made with ❤️ by Raphael Tomas Malikian

Run AI locally, keep your data private.

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
3rdparty		3rdparty
assets		assets
docs		docs
gpu		gpu
include		include
media		media
preset_kernels		preset_kernels
src		src
utils		utils
web_server		web_server
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
DOCKER_QUICKSTART.md		DOCKER_QUICKSTART.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
download_model.py		download_model.py
download_model_tl1.py		download_model_tl1.py
list_model_files.py		list_model_files.py
requirements.txt		requirements.txt
run_inference.py		run_inference.py
run_inference_server.py		run_inference_server.py
setup_env.py		setup_env.py
start_web_server.sh		start_web_server.sh

Folders and files

Latest commit

History

Repository files navigation

BitNet Chat

🔗 Original Project

🎯 What This Project Adds

✨ New Features (My Contribution)

📊 What's Microsoft's vs. What's Mine

📸 Screenshots

Welcome Screen

Example: Explaining Quantum Computing

Example: Python Code Generation

🚀 Quick Start

Option 1: Run Locally (Recommended for Development)

Option 2: Docker (Recommended for Production)

Option 3: Quick Start Script

💻 Usage

Web Interface

API Endpoints

Chat Completions

List Models

Health Check

🔌 Integration Examples

Python with OpenAI Client

LangChain

LlamaIndex

⚙️ Configuration

Environment Variables

Example with Custom Settings

📊 Performance

Benchmarks on Apple M1 (2021 iMac)

📁 Project Structure

🔒 Privacy & Security

🛠️ Troubleshooting

Common Issues

"Address already in use" error

"Model file not found" error

Browser shows old version

🙏 Acknowledgments

👨‍💻 Author

📄 License

📚 Resources

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages