Skip to content

This is a github pages website for my llcuda python sdk project

Notifications You must be signed in to change notification settings

llcuda/llcuda.github.io

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llcuda v2.2.0 Documentation

Version Python CUDA License

Official documentation website for llcuda v2.2.0 - CUDA 12 inference backend for Unsloth with Graphistry network visualization on Kaggle dual Tesla T4 GPUs.

🌐 Live Documentation: https://llcuda.github.io/

What is llcuda v2.2.0?

llcuda is a CUDA 12 inference backend specifically designed for deploying Unsloth-fine-tuned models on Kaggle's dual Tesla T4 GPUs (30GB total VRAM).

Key Features

  • 🚀 Dual T4 Support: Run on Kaggle's 2× Tesla T4 GPUs (15GB each)
  • 🔥 Split-GPU Architecture: LLM on GPU 0, Graphistry on GPU 1
  • Native CUDA tensor-split: llama.cpp layer distribution (NOT NCCL)
  • 🎯 70B Model Support: Run Llama-70B IQ3_XS on 30GB VRAM
  • 📦 29 GGUF Quantization Formats: K-quants and I-quants
  • 🔧 OpenAI-compatible API: Drop-in replacement via llama-server
  • 🌐 Graphistry Integration: Extract and visualize knowledge graphs

Performance Benchmarks

Model Quantization VRAM Speed Platform
Gemma 2-2B Q4_K_M ~3 GB ~60 tok/s Single T4
Llama-3.2-3B Q4_K_M ~4 GB ~45 tok/s Single T4
Qwen-2.5-7B Q4_K_M ~7 GB ~25 tok/s Single T4
Llama-70B IQ3_XS ~28 GB ~12 tok/s Dual T4

Quick Links

Documentation Structure

  • Getting Started: Installation, quick start, Kaggle setup
  • Kaggle Dual T4: Multi-GPU inference, tensor-split, large models
  • Tutorial Notebooks: 10 comprehensive Kaggle notebooks
  • Architecture: Split-GPU design, LLM + Graphistry
  • Unsloth Integration: Fine-tuning → GGUF → Deployment
  • Graphistry & Visualization: Knowledge graph extraction
  • Performance: Benchmarks, optimization, memory management
  • GGUF & Quantization: K-quants, I-quants, selection guide
  • API Reference: ServerManager, MultiGPU, GGUF tools

Development

Setup

# Install dependencies
pip install mkdocs-material mkdocs-minify-plugin

# Serve locally
mkdocs serve

# View at http://127.0.0.1:8000

Deployment

# Deploy to GitHub Pages
mkdocs gh-deploy

SEO & Keywords

llcuda, CUDA 12, Tesla T4, Kaggle, dual GPU, LLM inference, Unsloth, GGUF, quantization, llama.cpp, multi-GPU, tensor-split, Graphistry, knowledge graphs, FlashAttention, 70B models, split-GPU architecture, Kaggle notebooks, RAPIDS, cuGraph, PyGraphistry

Version

llcuda v2.2.0 - CUDA12 Inference Backend for Unsloth

Released: January 2025

License

MIT License - Copyright © 2024-2026 Waqas Muhammad

About

This is a github pages website for my llcuda python sdk project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •