rullama

Browser-resident Gemma 4 inference in pure Rust → WebAssembly + WebGPU. Loads the same GGUF blobs Ollama already has on disk, runs the forward pass on your local GPU through hand-written WGSL, never touches a remote server.

The intent is a PWA-pluggable inference engine, not a port of Ollama-the-server. Ollama has 275K LOC of Go that wraps llama.cpp via CGO plus model registry, CLI, conversion tooling, multimodal pipelines — almost none of which apply to a browser library. What survives the scope cut is the core inference path over Ollama's storage format.

What works today

✅ gemma4:e2b text inference on the desktop loads end-to-end and generates greedy output bit-identical to Ollama. (gemma4:e4b is shape-compatible — pull and try it.)
✅ gemma4:e2b text inference on iPhone — full Q4_K_M model loaded into iPhone 16e (A18, 8 GB shared RAM) and streaming tokens at ~4.65 tok/s via a Dedicated Worker + sync OPFS path. Multimodal towers stay Mac-only for now; mobile picks the text-only loader (max_context=512).
✅ Vision + audio multimodal on the desktop. ViT (16 blocks, 768 hidden) + Conformer (12 blocks, 1024 hidden) towers run on the same wgpu device as the text path; soft tokens splice into the prompt via <|image> / <|audio> sentinels. Validated bit-identical to Ollama on a fixed image and a 30-second pangram WAV.
✅ Q4_K + Q6_K + F16 + F32 quants (the actual mix in gemma4:e2b Q4_K_M).
✅ Streaming load via HTTP byte-range requests or OPFS sync access handles — the 7 GB GGUF never enters wasm linear memory in bulk. The PWA writes to OPFS once via FileSystemSyncAccessHandle.write() in a worker, and reads tile-by-tile during inference, so the wasm peak stays in the tens of MiB regardless of model size.
✅ Multi-turn chat with system prompt, mid-generation Stop, persistent KV cache.
✅ Encoder chained + per-layer submits (M7 + M15) — one CommandEncoder spans each transformer layer, submitted incrementally so the GPU drains smoothly even on tight-RAM phones.
❌ MoE gemma4:26b / gemma4:31b — out of scope.
❌ Other architectures (llama, mistral, qwen, phi).
🛠️ Mobile multimodal — desktop multimodal works; the iPhone loader currently skips the vision/audio towers to fit in shared RAM. Lazy upload for those is a follow-up.

Quickstart

You need:

Rust ≥ 1.91 + wasm-pack
A WebGPU-capable browser (Chrome 113+, Edge 113+, recent Firefox; iOS Safari 17.4+ for phones)
Ollama installed locally with gemma4:e2b pulled (ollama pull gemma4:e2b)

# Build the wasm bundle (writes /pkg/rullama.js + rullama_bg.wasm)
wasm-pack build --target web --release

# Start the dev HTTPS server (scans ~/.ollama/models, serves blobs with Range
# support, sets the COOP/COEP headers WebGPU + cross-origin-isolated need).
# Self-signed cert at ~/.local/share/rullama/{cert,key}.pem.
CERT_FILE=~/.local/share/rullama/cert.pem KEY_FILE=~/.local/share/rullama/key.pem \
    ./examples/pwa/serve.sh

# Desktop:  https://localhost:8088/examples/pwa/index.html
# iPhone:   https://<mac-lan-ip>:8088/examples/pwa/index.html
# Pick gemma4:e2b → Load → chat.

The first load streams the ~7 GB blob from the local Ollama install through a Dedicated Worker that owns a FileSystemSyncAccessHandle over OPFS — bytes go network → sync handle → disk without ever pinning a Blob in the JS heap. Subsequent loads (within the same Safari session) reuse the cached file.

iPhone scripted runs

The PWA is fully drivable from the Mac via Apple's safaridriver:

# One-time setup on the phone:
#   Settings → Safari → Advanced → Remote Automation = on
#                                  Web Inspector       = on
#                                  Feature Flags → WebGPU = on
# Then on the Mac:
safaridriver -p 4444 &
./examples/pwa/iphone-session-keeper.sh &        # keep an OPFS scope alive
./examples/pwa/run-on-iphone.sh                  # navigate → Load → chat → log perf
./examples/pwa/clean-iphone.sh                   # wipe OPFS between trials

/tmp/rullama-page.log collects beacon traces from the page ([chat], [pe], [tg], [gen], [wkr], [rs]) so any regression in a phone run leaves a server-side trail even after a WebContent crash.

Native sanity checks

The same code paths run natively against host wgpu (Metal on macOS, Vulkan on Linux). Useful for parity testing without a browser:

# Greedy parity vs Ollama (CPU oracle)
cargo run --release --features cpu-reference --example greedy_parity -- \
    ~/.ollama/models/blobs/sha256-<digest> "Hi" 5

# Full-stack chat through the public Model API
cargo run --release --features cpu-reference --example model_api -- \
    ~/.ollama/models/blobs/sha256-<digest> "Hi" --greedy --max=16

# Standalone chained forward (M7 perf path)
cargo run --release --features cpu-reference --example chained_smoke -- \
    ~/.ollama/models/blobs/sha256-<digest> "Hi" --max=8

--features cpu-reference enables the f32 oracle path for parity testing; the production path is always built.

Architecture

PWA (host page) ──┐
                  ▼  postMessage RPC
  ┌──────────────────────────────────────────────────────────────────┐
  │ inference-worker.js (Dedicated Worker)                          │
  │   ▶ owns FileSystemSyncAccessHandle for the GGUF                │
  │   ▶ owns the wasm Model handle                                  │
  │     ┌──────────────────────────────────────────────────────┐    │
  │     │ wasm32 (Rust, our crate)                             │    │
  │     │   Model.loadFromOpfs(read_fn, total)                 │    │
  │     │           │                                          │    │
  │     │           ▼                                          │    │
  │     │   GgufReader (header only, ~5 MB)                    │    │
  │     │           │                                          │    │
  │     │           │ TensorFetcher (OPFS sync read | HTTP Range)│
  │     │           ▼                                          │    │
  │     │   WeightCache  ─────────▶  Forward / VisionForward / │    │
  │     │   (lazy GPU upload,         GpuAudioForward          │    │
  │     │    per-tile range fetch     (per-layer encoder       │    │
  │     │    on big tensors)           submits, GPU-resident   │    │
  │     │                              KV cache)               │    │
  │     │                                  │                   │    │
  │     │                                  ▼                   │    │
  │     │                      wgpu (WebGPU / Metal / Vulkan)  │    │
  │     │                                  │                   │    │
  │     │                                  ▼                   │    │
  │     │      WGSL kernels: matmul Q4_K/Q6_K/F16, rmsnorm,    │    │
  │     │      rmsnorm_per_row, rope_neox, attention (incl.    │    │
  │     │      HPD-f16 + block-local), conv2d, geglu,          │    │
  │     │      softcap, residual_add, scale, top_k, quick_gelu │    │
  │     └──────────────────────────────────────────────────────┘    │
  └──────────────────────────────────────────────────────────────────┘
                  │
                  ▲  postMessage replies (tokens, errors)
PWA renders tokens, manages chat history, handles attachments.

The Worker move (M15) is what unblocked iPhone inference: iOS Safari only exposes FileSystemSyncAccessHandle in Worker contexts, and the Worker isolates inference from main-thread page-watchdog reapers.

Reference Go implementation: /Users/nightness/Source/ollama/model/models/gemma4/. Every op in src/reference/forward.rs (CPU oracle), src/reference/forward_chained.rs (production GPU forward), src/multimodal/vision.rs, and src/multimodal/audio.rs corresponds 1:1.

Performance

Measurements as of M15:

Target	Steady-state tok/s (gen)	Notes
iPhone 16e (A18, iOS 26)	~4.65 tok/s	text-only, `max_context=512`
AMD Radeon Pro 555 (Mac)	~1 tok/s (M7 baseline)	naive kernels, tiled matmul deferred

The architectural foundation (chained encoder, GPU-resident KV cache, per-layer submits, per-tile range fetch from OPFS) is in place. Kernels are still naive matvec; reaching the ≥10 tok/s target on both Mac and phone needs tiled matmul + bind-group caching + kernel fusion. See the M8 milestones in /Users/nightness/.claude/plans/greenfield-port-of-source-ollama-velvet-treehouse.md.

The iPhone A18 advertises 1 GiB for both max_buffer_size and max_storage_buffer_binding_size — four times the WebGPU spec floor — so there's real headroom for fewer/larger weight buffers (currently 455 of them resident, see M15 follow-ups).

Other capability notes captured during iPhone validation:

shader-f16 ✓ — packed FP16 MADs engage on A18.
timestamp-query ✓ — Pro 555 doesn't expose this; could wire GPU-side per-pass timing.
subgroups ✗ — A18 has SIMDgroup hardware but Safari's WebGPU doesn't surface WGSL subgroup ops yet. Vision attention falls through to the no-subgroup HPD-f16 kernel automatically.

Layout

src/
├── api.rs                    # JS-facing Model: load / loadFromUrl / loadFromOpfs[TextOnly]
├── backend/
│   ├── context.rs            # WgpuCtx (device, queue, adapter limits)
│   ├── dispatch.rs           # cached + chained kernel dispatchers
│   ├── pipelines.rs          # one ComputePipeline per kernel (built once)
│   ├── weight_cache.rs       # lazy GPU upload, per-tile range fetch on big tensors
│   ├── matmul.rs / elementwise.rs / spike.rs    # one-shot dispatchers (parity tests)
├── gguf/
│   ├── reader.rs             # GGUF v3 parser (header + tensor descriptors)
│   ├── fetcher.rs            # TensorFetcher trait + InMemoryFetcher + HttpRangeFetcher + OpfsFetcher
│   ├── tensor.rs             # dequant_tensor_to_f32 / dequant_row_to_f32 (sync + async)
│   ├── quant.rs / dtype.rs / value.rs
├── kernels/wgsl/             # 25+ hand-written compute shaders (text + vision + audio)
├── model/config.rs           # Gemma4Config: parses gemma4.* metadata keys
├── multimodal/
│   ├── vision.rs             # ViT forward (16 blocks, 768d, ClippableLinear)
│   ├── audio.rs              # Conformer forward (12 blocks, 1024d, block-local attention)
│   └── audio_features.rs     # WAV → 128-bin log-mel (realfft)
├── reference/
│   ├── forward.rs            # CPU f32 forward (parity oracle)
│   ├── forward_gpu.rs        # M3-era GPU forward with per-kernel readbacks (oracle)
│   ├── forward_chained.rs    # M7 production GPU forward, per-layer submits (M15)
│   ├── ops.rs / weights.rs
├── sampling.rs               # temperature, top-k, top-p, rep penalty
├── template/gemma4_small.rs  # chat-template renderer (matches Ollama)
└── tokenizer/                # GGUF BPE tokenizer (Ollama-bit-exact)

examples/
├── pwa/
│   ├── index.html                  # the demo PWA
│   ├── inference-worker.js         # M15 Dedicated Worker, owns Model + sync OPFS handle
│   ├── opfs-store.js               # OPFS download + read API (main-thread)
│   ├── opfs-writer-worker.js       # streams GGUF → OPFS via SyncAccessHandle.write
│   ├── serve.sh                    # dev HTTPS server + /api/log /api/blob endpoints
│   ├── run-on-iphone.sh            # navigate → Load → chat over safaridriver
│   ├── iphone-session-keeper.sh    # long-lived session so OPFS persists across runs
│   ├── clean-iphone.sh             # wipe OPFS + IDB rullama-models
│   ├── bench-on-iphone.sh          # kernel-level WebGPU bench harness
│   ├── chat.js / image_preprocess.js / cache.js
│   └── bench.html
├── greedy_parity.rs          # CPU forward greedy vs Ollama
├── chained_smoke.rs          # standalone Forward driver
├── model_api.rs              # public Model API end-to-end
├── vision_parity.rs          # vision tower vs Ollama (M11)
├── audio_parity.rs           # audio tower vs Ollama (M13)
└── inspect.rs / decode_ids.rs / encode_check.rs / forward_smoke.rs / list_tensors.rs

License

Dual-licensed under either of:

Apache License 2.0 (LICENSE-APACHE)
MIT License (LICENSE-MIT)

at your option.

Contributions are accepted under the same dual-license terms.

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
docker		docker
docs		docs
examples		examples
scripts		scripts
src		src
tools/ios-bench		tools/ios-bench
.dockerignore		.dockerignore
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
compose.yaml		compose.yaml
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rullama

What works today

Quickstart

iPhone scripted runs

Native sanity checks

Architecture

Performance

Layout

License

About

Licenses found

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rullama

What works today

Quickstart

iPhone scripted runs

Native sanity checks

Architecture

Performance

Layout

License

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages