Research-grade evaluation of running Google's Gemma 4 model family entirely on consumer hardware, with an optional cloud GPU burst tier for heavy workloads.
This project investigates whether a consumer-grade Apple Silicon machine (Mac Mini M4, 16 GB unified memory, $599) can serve as a viable self-hosted AI inference node for Google's Gemma 4 model family. We implement a three-tier inference architecture spanning local hardware and an on-demand cloud GPU, measure throughput and latency across model sizes, and evaluate the cost-performance tradeoff against hosted API services.
Key findings:
- The 2.3B parameter E2B model achieves 126 tokens/second on a Mac Mini M4 using MLX, competitive with hosted endpoints and sufficient for real-time conversational AI.
- The 4.5B parameter E4B model achieves 32 tokens/second, adequate for code generation and summarization workloads.
- The 26B parameter MoE model cannot run at interactive speeds on 16 GB devices (1.59 tok/s CPU-only), but achieves 50-70 tok/s on a Cloud Run NVIDIA L4 GPU ($0.23/hr, scale-to-zero).
- MLX on Apple Silicon delivers 5-8x better throughput and 55-63% less memory compared to llama.cpp/Ollama for equivalent quantized models.
- The total infrastructure cost for 24/7 availability of two local models plus on-demand cloud GPU is ~$3-5/month (electricity + minimal Cloud Run usage), compared to $20-100+/month for equivalent hosted API quotas.
- Motivation & Research Context
- System Architecture
- Hardware & Model Selection
- Inference Engine Analysis
- Network Topology & Routing
- Cloud Burst Tier
- Authentication & Access Control
- Multimodal Media Pipeline
- Performance Results
- Cost Analysis
- Architecture Pros & Cons
- Reproducibility
- Future Directions
- File Structure
Large language model inference is increasingly commoditized. Open-weight models from Google (Gemma), Meta (Llama), and others now match or exceed GPT-3.5-class performance. Simultaneously, consumer hardware — particularly Apple Silicon with unified memory — has reached the point where sub-10B parameter models run at interactive speeds without discrete GPUs.
This creates an opportunity: self-hosted AI inference at marginal cost, with full data sovereignty, no rate limits, and no vendor lock-in. But how viable is this in practice?
| # | Question | Finding |
|---|---|---|
| RQ1 | Can Gemma 4 models run at interactive speeds on consumer Apple Silicon? | Yes (E2B: 126 tok/s, E4B: 32 tok/s on M4 16 GB) |
| RQ2 | What is the minimum viable hardware for each model tier? | E2B: 4 GB free RAM; E4B: 6 GB; 26B: 24 GB VRAM (requires GPU) |
| RQ3 | How does MLX compare to other inference engines on Apple Silicon? | 5-8x faster than llama.cpp, 55-63% less RAM |
| RQ4 | Can a hybrid local+cloud architecture provide full model coverage? | Yes, with Cloud Run GPU scale-to-zero at ~$0.003/request |
| RQ5 | What is the total cost of ownership vs. hosted APIs? | $3-5/mo vs $20-100+/mo for equivalent usage |
Academic applications:
- Reproducible NLP experiments without GPU cluster access
- Private inference for sensitive research data (medical, legal, educational)
- Curriculum development: students can run state-of-the-art models on personal hardware
- Benchmark framework for evaluating model-hardware efficiency frontiers
Business applications:
- Edge AI deployment pattern for privacy-sensitive industries (healthcare, finance, legal)
- Cost-optimized inference for startups and small teams
- Hybrid cloud architecture template (local-first, cloud-burst)
- Internal LLM infrastructure without SaaS vendor dependency
PUBLIC INTERNET TAILSCALE MESH (WireGuard, encrypted)
| |
v |
GCP Cloud Run Proxy |
(Firebase Auth, .------------------------+
rate limiting, | |
circuit breaker) Mac Mini M4 16GB Cloud Run GPU (L4)
| (always-on, $0/mo) (scale-to-zero)
| :8082 E2B 126 tok/s gemma4-heavy
| :8083 E4B 32 tok/s 26B-A4B 50-70 tok/s
| :8080 Gateway (FastAPI)
| |routes|
+----------> iPhone / Browser / API clients
| Component | Location | Technology | Purpose |
|---|---|---|---|
| Gateway | Mac Mini :8080 | FastAPI + uvicorn | Request classification, tier routing, circuit breakers, health aggregation |
| Fast Tier | Mac Mini :8082 | MLX mlx_lm.server |
E2B 2.3B — greetings, simple Q&A, real-time chat |
| Primary Tier | Mac Mini :8083 | MLX mlx_lm.server |
E4B 4.5B — summarization, code assist, moderate reasoning |
| Heavy Tier | Cloud Run GPU | llama-cpp-python + CUDA | 26B-A4B MoE — deep analysis, complex code gen, long-context |
| Cloud Proxy | Cloud Run (CPU) | FastAPI + Tailscale sidecar | Public HTTPS endpoint, Firebase auth, Tailscale tunnel to Mac Mini |
| Web UI | Served by proxy | Single-file SPA (vanilla JS) | Apple HIG-inspired chat interface, multimodal upload |
| Auth | Firebase + Cloud Run | Firebase Auth + Firestore | Google Sign-In, per-user quotas, usage tracking |
1. Client sends POST /v1/chat/completions (OpenAI-compatible)
|
2. Cloud Proxy authenticates (Firebase JWT or API key)
Rate-limits (30 text/min, 10 media/min per IP)
|
3. Proxy forwards via Tailscale SOCKS5 to Mac Mini gateway
|
4. Gateway classifies request complexity:
- Simple/greeting -> Fast (E2B)
- Moderate/code -> Primary (E4B)
- Complex/analysis -> Heavy (26B, Cloud Run GPU)
|
5. Circuit breaker checks tier health:
- CLOSED: route normally
- OPEN: skip to next tier
- HALF-OPEN: probe with single request
|
6. Tier processes inference, returns OpenAI-format response
|
7. Gateway annotates response with _routing metadata:
{tier, model, latency_ms, tokens, fallback_chain}
| Mac Mini M4 | Cloud Run L4 | |
|---|---|---|
| CPU | 10-core (4P + 6E) | 8 vCPU |
| GPU | 10-core Metal | NVIDIA L4 (24 GB VRAM) |
| Memory | 16 GB unified | 24 GB system + 24 GB VRAM |
| Memory Bandwidth | 120 GB/s | 300 GB/s (GDDR6) |
| Cost | $599 one-time | $0.23/hr (scale-to-zero) |
| Always On | Yes (6W idle) | No (0 cost when idle) |
Gemma 4 E2B (2.3B active / 5.1B total)
- MoE with 2 experts, 1 active per token
- 4-bit MLX quantization: 1.15 GB working set
- Multimodal: text + vision + audio
- Use case: real-time chat, simple questions, triage
Gemma 4 E4B (4.5B active / 8.0B total)
- MoE with 4 experts, 2 active per token
- 4-bit MLX quantization: 2.1 GB working set
- Multimodal: text + vision + audio
- Use case: code generation, summarization, moderate reasoning
Gemma 4 26B-A4B (3.8B active / 26B total)
- Dense MoE with larger expert capacity
- 4-bit GGUF (Unsloth Dynamic UD-Q4_K_XL): 17.1 GB
- Multimodal: text + vision
- Use case: complex analysis, long-context (256K), heavy code generation
The 26B-A4B MoE activates only 3.8B parameters per token despite having 26B total. Compared to the 31B dense variant:
| Metric | 26B-A4B MoE | 31B Dense |
|---|---|---|
| Active params/token | 3.8B | 30.7B |
| 4-bit disk size | 17.1 GB (UD) | 18.4 GB |
| L4 throughput | 50-70 tok/s | 20-30 tok/s |
| Context window | 256K tokens | 128K |
| AIME 2026 | 88.3% | 89.2% |
The MoE delivers 97% of dense quality at 2-3x throughput — a decisive advantage for serving.
MLX is Apple's machine learning framework optimized for unified memory:
Advantages:
- Zero-copy Metal GPU dispatch (no PCIe transfer overhead)
- Lazy evaluation with automatic kernel fusion
- 4-bit quantized KV cache (
--kv-bits 4for E2B,--kv-bits 3.5for E4B) - Native
mlx_lm.serverwith OpenAI-compatible API - Integrated with macOS LaunchAgent for auto-start on boot
Limitations:
- Apple Silicon only (no x86, no Windows, no Linux GPU)
- Model ecosystem smaller than GGUF/Transformers
- Cannot run 26B on 16 GB (needs ~15 GB Metal buffer, only 12.7 GB available)
- Requires installing from GitHub main for Gemma 4 support (PyPI 0.31.1 lacks it)
llama.cpp is a C++ inference engine with broad hardware support:
Advantages:
- Runs on NVIDIA GPUs, AMD GPUs, Apple Metal, CPU
- Broad GGUF model ecosystem (Unsloth, TheBloke, bartowski)
- Mature quantization: Q4_K_M, Q5_K, UD (Unsloth Dynamic/imatrix)
- Memory-mapped model loading (mmap) for CPU fallback
- Docker + CUDA build for cloud deployment
Limitations:
- 5-8x slower than MLX on equivalent Apple Silicon hardware
- CUDA build requires stub symlink workaround in Cloud Build (no GPU driver)
- Cold start on Cloud Run: ~110s (download model + load into VRAM)
- More complex deployment (multi-stage Dockerfile, Cloud Build)
| Engine | Model | Device | tok/s | RAM | Notes |
|---|---|---|---|---|---|
| MLX | E2B 4-bit | Mac Mini M4 | 126 | 2.6 GB | Best speed |
| MLX | E4B 4-bit | Mac Mini M4 | 32 | 4.3 GB | Good quality/speed |
| llama.cpp (CPU) | 26B Q4_K_M | Mac Mini M4 | 1.59 | 11.2 GB | Unusable interactively |
| llama.cpp (GPU) | 26B Q4_K_M | Mac Mini M4 | crash | - | OOM on 16 GB |
| llama.cpp (CUDA) | 26B UD-Q4_K_XL | Cloud Run L4 | 50-70 | 17.1 GB VRAM | Production viable |
| Ollama | E4B Q4 | Mac Mini M4 | 6.1 | 7.8 GB | 5x slower than MLX |
Key insight: MLX is the clear winner for Apple Silicon. For models that exceed local memory, cloud GPU with llama.cpp is the only practical option on a budget.
All inter-device communication uses Tailscale (WireGuard-based mesh VPN):
- Mac Mini:
100.75.223.113 - Cloud Run Proxy: Tailscale sidecar container via SOCKS5
:1055 - Encryption: WireGuard (ChaCha20-Poly1305), zero-trust networking
- No port forwarding, no public IP exposure for local devices
The gateway implements intelligent request classification:
def classify_request(messages) -> str:
"""Classify request complexity for tier routing."""
# Simple heuristics based on message content/length
# greeting, fyi, simple question -> "fast"
# summarize, code, rewrite -> "primary"
# analysis, complex, long-form -> "heavy"Each tier has an independent circuit breaker to prevent cascading failures:
CLOSED (healthy)
| (3 failures in 30s window)
v
OPEN (rejecting, skip tier)
| (30s cooldown)
v
HALF-OPEN (probe with 1 request)
|-- success -> CLOSED
|-- failure -> OPEN (restart timer)
Heavy (26B) -> Primary (E4B) -> Fast (E2B) -> 503 Service Unavailable
Response metadata includes "fallback": "heavy->primary" when a fallback occurred, providing transparency to clients.
The heavy tier runs on Cloud Run with an NVIDIA L4 GPU:
| Setting | Value | Rationale |
|---|---|---|
| GPU | NVIDIA L4 (24 GB VRAM) | Fits 17.1 GB model + KV cache |
| Min instances | 0 | Scale-to-zero ($0 when idle) |
| Max instances | 1 | Budget cap |
| CPU | 4-8 vCPU | L4 minimum requirement |
| Memory | 16-24 GB | Model download buffer |
| Startup probe | 600s | Cold start: download + load |
| Scale-down delay | 15 min | Avoid rapid cold starts |
| Container | CUDA 12.4.1 (multi-stage) | Build: devel, Run: runtime |
- Primary: GCS bucket (
gs://gemma4good-models/) — 80s at ~200 MB/s - Fallback: HuggingFace Hub (
unsloth/gemma-4-26B-A4B-it-GGUF) — 2-5 min
Environment variables HF_REPO and HF_FILE allow switching quantizations without rebuilding the container image.
Cloud Build machines have CUDA toolkit but no GPU driver. The linker fails resolving libcuda.so.1 (only a stub libcuda.so exists). Fix:
RUN ln -sf /usr/local/cuda/lib64/stubs/libcuda.so \
/usr/local/cuda/lib64/stubs/libcuda.so.1 \
&& echo "/usr/local/cuda/lib64/stubs" >> /etc/ld.so.conf.d/cuda-stubs.conf \
&& ldconfigThis is a well-known Docker-build-without-GPU issue documented in NVIDIA forums.
- Google Sign-In via Firebase Auth SDK
- JWT token verification on Cloud Run proxy
- Per-user quotas stored in Firestore:
- 100 text requests/day
- 20 media requests/day
- 500 MB media storage
For programmatic access without browser auth:
- Heavy tier: Bearer token (
HEAVY_API_KEYin Secret Manager) - Gateway: Internal (localhost only, no auth needed)
| Type | Formats | Max Size | Processing |
|---|---|---|---|
| Image | JPEG, PNG, WebP, GIF, BMP, HEIC | 20 MB | Resize to 1280px, JPEG q=85 |
| Audio | WAV, MP3, M4A, AAC, OGG, OPUS, FLAC | 50 MB | Convert to 16kHz mono WAV |
| Video | MP4, MOV, WebM, MKV, AVI | 10 GB | Segment + frame extraction |
| Tier | Segment Duration | Frames/Segment | Use Case |
|---|---|---|---|
| Fast | 30s | 4 | Quick preview |
| Primary | 60s | 8 | Standard analysis |
| Heavy | 120s | 16 | Detailed analysis |
| Tier | Model | Device | Tokens/sec | P50 Latency | P99 Latency |
|---|---|---|---|---|---|
| Fast | E2B 2.3B | Mac Mini M4 | 126 | 12ms/tok | 18ms/tok |
| Primary | E4B 4.5B | Mac Mini M4 | 32 | 31ms/tok | 45ms/tok |
| Heavy | 26B-A4B | Cloud Run L4 | 50-70 | 15ms/tok | 22ms/tok |
| Component | Duration | Notes |
|---|---|---|
| Cloud Run container start | 3-5s | Image pull from Artifact Registry |
| Model download (GCS) | ~80s | 17.1 GB at ~200 MB/s |
| Model download (HF) | 2-5 min | Variable, depends on CDN |
| VRAM load | ~25s | L4 PCIe bandwidth |
| Total cold start | ~110s (GCS) | First request after scale-to-zero |
| Warm request | <1s | Model already in VRAM |
| Model | Quantization | Disk | Runtime RAM | GPU VRAM |
|---|---|---|---|---|
| E2B 4-bit | MLX | 1.15 GB | 2.6 GB | shared (unified) |
| E4B 4-bit | MLX | 2.1 GB | 4.3 GB | shared (unified) |
| 26B UD-Q4_K_XL | GGUF | 17.1 GB | N/A | 17.1 GB (L4) |
| Item | Monthly Cost |
|---|---|
| Mac Mini M4 electricity (6W idle, 15W active) | ~$1.50 |
| Cloud Run proxy (always-on, CPU only) | ~$0 (free tier) |
| Cloud Run GPU (scale-to-zero, est. 2hr/day) | ~$14/mo |
| Artifact Registry + GCS | ~$0.50 |
| Total | $3-16/mo |
| Service | Equivalent Usage | Monthly Cost |
|---|---|---|
| OpenAI GPT-4o | 1M tokens/day | ~$90/mo |
| Google Gemini API | 1M tokens/day | ~$45/mo |
| Anthropic Claude | 1M tokens/day | ~$75/mo |
The Mac Mini M4 ($599) pays for itself in 3-6 months compared to API pricing at moderate usage (500K-1M tokens/day). The E2B and E4B tiers have zero marginal cost after hardware purchase.
| Category | Advantage | Impact |
|---|---|---|
| Cost | Near-zero marginal cost for local tiers | Enables unlimited experimentation |
| Privacy | All data stays on your hardware (local tiers) | HIPAA/GDPR viable |
| Latency | Local inference: 0ms network overhead | Real-time chat experience |
| Availability | No API rate limits, no outages from upstream | 24/7 for local tiers |
| Flexibility | Swap models by changing env vars, no vendor lock-in | Future-proof |
| Scale-to-zero | Heavy tier costs $0 when idle | Budget-friendly cloud burst |
| Multimodal | Text, vision, audio, video in one stack | Unified pipeline |
| Reproducible | Fully open-source models + infra | Academic citation-ready |
| Category | Disadvantage | Mitigation |
|---|---|---|
| Hardware lock-in | MLX requires Apple Silicon | llama.cpp for cross-platform |
| 16 GB ceiling | 26B cannot run locally on Mac Mini | Cloud burst tier |
| Cold start | 110s for heavy tier from zero | 15-min scale-down delay |
| Single point of failure | Mac Mini offline = local tiers unavailable | Cloud Run proxy caches health |
| Model freshness | No fine-tuning pipeline | Future work |
| Concurrency | GPU inference is sequential | Scale via Cloud Run instances |
| Complexity | Multi-service architecture | LaunchAgents + deploy scripts |
| Network dependency | Heavy tier requires internet | Local tiers work offline |
- Gemma 4 MLX support requires
mlx-lm >= 0.31.2from GitHub (not PyPI as of April 2026) - 26B on 16 GB Apple Silicon hangs with GPU layers; CPU-only yields 1.59 tok/s (unusable)
- Tailscale on macOS requires the official
.pkginstaller (Homebrew version lacks Network Extension daemon) - Cloud Build CUDA needs
libcuda.so.1stub symlink (no GPU driver in build environment) - HuggingFace gated models require accepted license + valid
HF_TOKEN
See notebooks/gemma4_local_inference.ipynb for a guided setup notebook that:
- Detects your OS (macOS/Windows) and hardware (RAM, GPU)
- Recommends which models your system can run
- Installs the appropriate inference engine (MLX or llama.cpp)
- Downloads models and runs benchmark inference
- Optionally sets up the full gateway
# 1. Clone and enter project
git clone <this-repo> && cd gemma4-stack
# 2. Set up local tiers (macOS + Apple Silicon only)
python3 -m venv gateway-venv && source gateway-venv/bin/activate
pip install mlx-lm fastapi uvicorn
# 3. Start local inference servers
bash scripts/start_fast.sh & # E2B on :8082
bash scripts/start_primary.sh & # E4B on :8083
python scripts/gateway.py & # Gateway on :8080
# 4. (Optional) Deploy cloud proxy
export HF_TOKEN="hf_..."
bash cloud/proxy/deploy.sh gemma4good
# 5. (Optional) Deploy heavy tier
export HEAVY_API_KEY=$(openssl rand -hex 32)
bash cloud/heavy/deploy.sh gemma4good- LoRA fine-tuning pipeline: Domain-specific adaptation using MLX on Apple Silicon
- Streaming responses: Server-Sent Events for real-time token delivery
- Model warm-swap: Hot-reload models without restarting servers
- Prometheus metrics: Structured observability for all tiers
- Multi-node scaling: Additional Mac Minis via Tailscale for horizontal throughput
- Speculative decoding: E2B as draft model for 26B verification
- RAG integration: Local vector store (ChromaDB/FAISS) for domain knowledge
- Edge caching: Frequently asked queries cached at proxy layer
- Federated inference: Privacy-preserving multi-party model serving
- Quantization-aware training: Custom GGUF quants optimized for specific hardware
- Energy efficiency benchmarks: Tokens per watt across model sizes and hardware
- Academic benchmark suite: Standardized evaluation of local vs. cloud inference
gemma4-stack/
+-- README.md # This document
+-- notebooks/
| +-- gemma4_local_inference.ipynb # Guided setup notebook (Windows + Mac)
+-- scripts/
| +-- gateway.py # FastAPI gateway (tier routing, circuit breakers)
| +-- start_fast.sh # Launch E2B MLX server
| +-- start_primary.sh # Launch E4B MLX server
| +-- benchmark.py # Throughput benchmarking tool
| +-- media.py # Media processing (resize, convert, chunk)
+-- cloud/
| +-- proxy/
| | +-- main.py # Cloud Run proxy (auth, rate limit, Tailscale tunnel)
| | +-- auth.py # Firebase authentication
| | +-- db.py # Firestore user/usage storage
| | +-- storage.py # GCS media storage
| | +-- deploy.sh # Proxy deployment script
| +-- heavy/
| | +-- Dockerfile # Multi-stage CUDA build (devel -> runtime)
| | +-- main.py # llama-cpp-python FastAPI server
| | +-- service.yaml # Cloud Run GPU service definition
| | +-- deploy.sh # Heavy tier deployment script (6 steps)
| +-- web/
| +-- index.html # Single-file SPA (Apple HIG, responsive)
+-- tests/
| +-- test_e2e.py # End-to-end tests (health, media, cloud)
| +-- conftest.py # Test fixtures
+-- docs/
+-- API.md # OpenAI-compatible API reference
+-- IMPLEMENTATION_LOG.md # Detailed implementation journal
If you use this architecture or findings in academic work:
@misc{gemma4localstack2026,
title={Gemma 4 Local Inference Stack: Evaluating Consumer Hardware
for Self-Hosted Large Language Model Serving},
author={Jemma Project Contributors},
year={2026},
howpublished={\url{https://github.com/jemma/gemma4-stack}},
note={Three-tier MLX + llama.cpp inference architecture on Apple Silicon
and Cloud Run GPU}
}Apache 2.0. Model weights are subject to Google's Gemma Terms of Use.