Skip to content

AlexiosBluffMara/gemma4-stack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gemma 4 Local Inference Stack: Architecture & Viability Study

Research-grade evaluation of running Google's Gemma 4 model family entirely on consumer hardware, with an optional cloud GPU burst tier for heavy workloads.

License: Apache 2.0 Models: Gemma 4 MLX llama.cpp Python Cloud Run


Abstract

This project investigates whether a consumer-grade Apple Silicon machine (Mac Mini M4, 16 GB unified memory, $599) can serve as a viable self-hosted AI inference node for Google's Gemma 4 model family. We implement a three-tier inference architecture spanning local hardware and an on-demand cloud GPU, measure throughput and latency across model sizes, and evaluate the cost-performance tradeoff against hosted API services.

Key findings:

  1. The 2.3B parameter E2B model achieves 126 tokens/second on a Mac Mini M4 using MLX, competitive with hosted endpoints and sufficient for real-time conversational AI.
  2. The 4.5B parameter E4B model achieves 32 tokens/second, adequate for code generation and summarization workloads.
  3. The 26B parameter MoE model cannot run at interactive speeds on 16 GB devices (1.59 tok/s CPU-only), but achieves 50-70 tok/s on a Cloud Run NVIDIA L4 GPU ($0.23/hr, scale-to-zero).
  4. MLX on Apple Silicon delivers 5-8x better throughput and 55-63% less memory compared to llama.cpp/Ollama for equivalent quantized models.
  5. The total infrastructure cost for 24/7 availability of two local models plus on-demand cloud GPU is ~$3-5/month (electricity + minimal Cloud Run usage), compared to $20-100+/month for equivalent hosted API quotas.

Table of Contents

  1. Motivation & Research Context
  2. System Architecture
  3. Hardware & Model Selection
  4. Inference Engine Analysis
  5. Network Topology & Routing
  6. Cloud Burst Tier
  7. Authentication & Access Control
  8. Multimodal Media Pipeline
  9. Performance Results
  10. Cost Analysis
  11. Architecture Pros & Cons
  12. Reproducibility
  13. Future Directions
  14. File Structure

1. Motivation & Research Context

1.1 The Local Inference Thesis

Large language model inference is increasingly commoditized. Open-weight models from Google (Gemma), Meta (Llama), and others now match or exceed GPT-3.5-class performance. Simultaneously, consumer hardware — particularly Apple Silicon with unified memory — has reached the point where sub-10B parameter models run at interactive speeds without discrete GPUs.

This creates an opportunity: self-hosted AI inference at marginal cost, with full data sovereignty, no rate limits, and no vendor lock-in. But how viable is this in practice?

1.2 Research Questions

# Question Finding
RQ1 Can Gemma 4 models run at interactive speeds on consumer Apple Silicon? Yes (E2B: 126 tok/s, E4B: 32 tok/s on M4 16 GB)
RQ2 What is the minimum viable hardware for each model tier? E2B: 4 GB free RAM; E4B: 6 GB; 26B: 24 GB VRAM (requires GPU)
RQ3 How does MLX compare to other inference engines on Apple Silicon? 5-8x faster than llama.cpp, 55-63% less RAM
RQ4 Can a hybrid local+cloud architecture provide full model coverage? Yes, with Cloud Run GPU scale-to-zero at ~$0.003/request
RQ5 What is the total cost of ownership vs. hosted APIs? $3-5/mo vs $20-100+/mo for equivalent usage

1.3 Academic & Business Value

Academic applications:

  • Reproducible NLP experiments without GPU cluster access
  • Private inference for sensitive research data (medical, legal, educational)
  • Curriculum development: students can run state-of-the-art models on personal hardware
  • Benchmark framework for evaluating model-hardware efficiency frontiers

Business applications:

  • Edge AI deployment pattern for privacy-sensitive industries (healthcare, finance, legal)
  • Cost-optimized inference for startups and small teams
  • Hybrid cloud architecture template (local-first, cloud-burst)
  • Internal LLM infrastructure without SaaS vendor dependency

2. System Architecture

2.1 High-Level Topology

PUBLIC INTERNET                           TAILSCALE MESH (WireGuard, encrypted)
       |                                            |
       v                                            |
 GCP Cloud Run Proxy                                |
 (Firebase Auth,           .------------------------+
  rate limiting,           |                        |
  circuit breaker)   Mac Mini M4 16GB         Cloud Run GPU (L4)
       |             (always-on, $0/mo)       (scale-to-zero)
       |             :8082  E2B   126 tok/s   gemma4-heavy
       |             :8083  E4B    32 tok/s   26B-A4B  50-70 tok/s
       |             :8080  Gateway (FastAPI)
       |                    |routes|
       +----------> iPhone / Browser / API clients

2.2 Component Breakdown

Component Location Technology Purpose
Gateway Mac Mini :8080 FastAPI + uvicorn Request classification, tier routing, circuit breakers, health aggregation
Fast Tier Mac Mini :8082 MLX mlx_lm.server E2B 2.3B — greetings, simple Q&A, real-time chat
Primary Tier Mac Mini :8083 MLX mlx_lm.server E4B 4.5B — summarization, code assist, moderate reasoning
Heavy Tier Cloud Run GPU llama-cpp-python + CUDA 26B-A4B MoE — deep analysis, complex code gen, long-context
Cloud Proxy Cloud Run (CPU) FastAPI + Tailscale sidecar Public HTTPS endpoint, Firebase auth, Tailscale tunnel to Mac Mini
Web UI Served by proxy Single-file SPA (vanilla JS) Apple HIG-inspired chat interface, multimodal upload
Auth Firebase + Cloud Run Firebase Auth + Firestore Google Sign-In, per-user quotas, usage tracking

2.3 Request Lifecycle

1. Client sends POST /v1/chat/completions (OpenAI-compatible)
                    |
2. Cloud Proxy authenticates (Firebase JWT or API key)
   Rate-limits (30 text/min, 10 media/min per IP)
                    |
3. Proxy forwards via Tailscale SOCKS5 to Mac Mini gateway
                    |
4. Gateway classifies request complexity:
   - Simple/greeting   -> Fast (E2B)
   - Moderate/code     -> Primary (E4B)
   - Complex/analysis  -> Heavy (26B, Cloud Run GPU)
                    |
5. Circuit breaker checks tier health:
   - CLOSED: route normally
   - OPEN: skip to next tier
   - HALF-OPEN: probe with single request
                    |
6. Tier processes inference, returns OpenAI-format response
                    |
7. Gateway annotates response with _routing metadata:
   {tier, model, latency_ms, tokens, fallback_chain}

3. Hardware & Model Selection

3.1 Device Specifications

Mac Mini M4 Cloud Run L4
CPU 10-core (4P + 6E) 8 vCPU
GPU 10-core Metal NVIDIA L4 (24 GB VRAM)
Memory 16 GB unified 24 GB system + 24 GB VRAM
Memory Bandwidth 120 GB/s 300 GB/s (GDDR6)
Cost $599 one-time $0.23/hr (scale-to-zero)
Always On Yes (6W idle) No (0 cost when idle)

3.2 Model Selection Rationale

Gemma 4 E2B (2.3B active / 5.1B total)

  • MoE with 2 experts, 1 active per token
  • 4-bit MLX quantization: 1.15 GB working set
  • Multimodal: text + vision + audio
  • Use case: real-time chat, simple questions, triage

Gemma 4 E4B (4.5B active / 8.0B total)

  • MoE with 4 experts, 2 active per token
  • 4-bit MLX quantization: 2.1 GB working set
  • Multimodal: text + vision + audio
  • Use case: code generation, summarization, moderate reasoning

Gemma 4 26B-A4B (3.8B active / 26B total)

  • Dense MoE with larger expert capacity
  • 4-bit GGUF (Unsloth Dynamic UD-Q4_K_XL): 17.1 GB
  • Multimodal: text + vision
  • Use case: complex analysis, long-context (256K), heavy code generation

3.3 Why MoE Over Dense?

The 26B-A4B MoE activates only 3.8B parameters per token despite having 26B total. Compared to the 31B dense variant:

Metric 26B-A4B MoE 31B Dense
Active params/token 3.8B 30.7B
4-bit disk size 17.1 GB (UD) 18.4 GB
L4 throughput 50-70 tok/s 20-30 tok/s
Context window 256K tokens 128K
AIME 2026 88.3% 89.2%

The MoE delivers 97% of dense quality at 2-3x throughput — a decisive advantage for serving.


4. Inference Engine Analysis

4.1 MLX (Apple Silicon, Local Tiers)

MLX is Apple's machine learning framework optimized for unified memory:

Advantages:

  • Zero-copy Metal GPU dispatch (no PCIe transfer overhead)
  • Lazy evaluation with automatic kernel fusion
  • 4-bit quantized KV cache (--kv-bits 4 for E2B, --kv-bits 3.5 for E4B)
  • Native mlx_lm.server with OpenAI-compatible API
  • Integrated with macOS LaunchAgent for auto-start on boot

Limitations:

  • Apple Silicon only (no x86, no Windows, no Linux GPU)
  • Model ecosystem smaller than GGUF/Transformers
  • Cannot run 26B on 16 GB (needs ~15 GB Metal buffer, only 12.7 GB available)
  • Requires installing from GitHub main for Gemma 4 support (PyPI 0.31.1 lacks it)

4.2 llama.cpp / llama-cpp-python (CUDA, Cloud Tier)

llama.cpp is a C++ inference engine with broad hardware support:

Advantages:

  • Runs on NVIDIA GPUs, AMD GPUs, Apple Metal, CPU
  • Broad GGUF model ecosystem (Unsloth, TheBloke, bartowski)
  • Mature quantization: Q4_K_M, Q5_K, UD (Unsloth Dynamic/imatrix)
  • Memory-mapped model loading (mmap) for CPU fallback
  • Docker + CUDA build for cloud deployment

Limitations:

  • 5-8x slower than MLX on equivalent Apple Silicon hardware
  • CUDA build requires stub symlink workaround in Cloud Build (no GPU driver)
  • Cold start on Cloud Run: ~110s (download model + load into VRAM)
  • More complex deployment (multi-stage Dockerfile, Cloud Build)

4.3 Comparative Benchmarks (Same Hardware)

Engine Model Device tok/s RAM Notes
MLX E2B 4-bit Mac Mini M4 126 2.6 GB Best speed
MLX E4B 4-bit Mac Mini M4 32 4.3 GB Good quality/speed
llama.cpp (CPU) 26B Q4_K_M Mac Mini M4 1.59 11.2 GB Unusable interactively
llama.cpp (GPU) 26B Q4_K_M Mac Mini M4 crash - OOM on 16 GB
llama.cpp (CUDA) 26B UD-Q4_K_XL Cloud Run L4 50-70 17.1 GB VRAM Production viable
Ollama E4B Q4 Mac Mini M4 6.1 7.8 GB 5x slower than MLX

Key insight: MLX is the clear winner for Apple Silicon. For models that exceed local memory, cloud GPU with llama.cpp is the only practical option on a budget.


5. Network Topology & Routing

5.1 Tailscale Mesh

All inter-device communication uses Tailscale (WireGuard-based mesh VPN):

  • Mac Mini: 100.75.223.113
  • Cloud Run Proxy: Tailscale sidecar container via SOCKS5 :1055
  • Encryption: WireGuard (ChaCha20-Poly1305), zero-trust networking
  • No port forwarding, no public IP exposure for local devices

5.2 Gateway Routing Logic

The gateway implements intelligent request classification:

def classify_request(messages) -> str:
    """Classify request complexity for tier routing."""
    # Simple heuristics based on message content/length
    # greeting, fyi, simple question  -> "fast"
    # summarize, code, rewrite        -> "primary"
    # analysis, complex, long-form    -> "heavy"

5.3 Circuit Breaker Pattern

Each tier has an independent circuit breaker to prevent cascading failures:

CLOSED (healthy)
    | (3 failures in 30s window)
    v
OPEN (rejecting, skip tier)
    | (30s cooldown)
    v
HALF-OPEN (probe with 1 request)
    |-- success -> CLOSED
    |-- failure -> OPEN (restart timer)

5.4 Fallback Chain

Heavy (26B) -> Primary (E4B) -> Fast (E2B) -> 503 Service Unavailable

Response metadata includes "fallback": "heavy->primary" when a fallback occurred, providing transparency to clients.


6. Cloud Burst Tier

6.1 Cloud Run GPU Configuration

The heavy tier runs on Cloud Run with an NVIDIA L4 GPU:

Setting Value Rationale
GPU NVIDIA L4 (24 GB VRAM) Fits 17.1 GB model + KV cache
Min instances 0 Scale-to-zero ($0 when idle)
Max instances 1 Budget cap
CPU 4-8 vCPU L4 minimum requirement
Memory 16-24 GB Model download buffer
Startup probe 600s Cold start: download + load
Scale-down delay 15 min Avoid rapid cold starts
Container CUDA 12.4.1 (multi-stage) Build: devel, Run: runtime

6.2 Model Download Strategy

  1. Primary: GCS bucket (gs://gemma4good-models/) — 80s at ~200 MB/s
  2. Fallback: HuggingFace Hub (unsloth/gemma-4-26B-A4B-it-GGUF) — 2-5 min

Environment variables HF_REPO and HF_FILE allow switching quantizations without rebuilding the container image.

6.3 CUDA Build Workaround

Cloud Build machines have CUDA toolkit but no GPU driver. The linker fails resolving libcuda.so.1 (only a stub libcuda.so exists). Fix:

RUN ln -sf /usr/local/cuda/lib64/stubs/libcuda.so \
           /usr/local/cuda/lib64/stubs/libcuda.so.1 \
    && echo "/usr/local/cuda/lib64/stubs" >> /etc/ld.so.conf.d/cuda-stubs.conf \
    && ldconfig

This is a well-known Docker-build-without-GPU issue documented in NVIDIA forums.


7. Authentication & Access Control

7.1 Firebase Authentication

  • Google Sign-In via Firebase Auth SDK
  • JWT token verification on Cloud Run proxy
  • Per-user quotas stored in Firestore:
    • 100 text requests/day
    • 20 media requests/day
    • 500 MB media storage

7.2 API Key Fallback

For programmatic access without browser auth:

  • Heavy tier: Bearer token (HEAVY_API_KEY in Secret Manager)
  • Gateway: Internal (localhost only, no auth needed)

8. Multimodal Media Pipeline

8.1 Supported Formats

Type Formats Max Size Processing
Image JPEG, PNG, WebP, GIF, BMP, HEIC 20 MB Resize to 1280px, JPEG q=85
Audio WAV, MP3, M4A, AAC, OGG, OPUS, FLAC 50 MB Convert to 16kHz mono WAV
Video MP4, MOV, WebM, MKV, AVI 10 GB Segment + frame extraction

8.2 Video Chunking Strategy

Tier Segment Duration Frames/Segment Use Case
Fast 30s 4 Quick preview
Primary 60s 8 Standard analysis
Heavy 120s 16 Detailed analysis

9. Performance Results

9.1 Throughput

Tier Model Device Tokens/sec P50 Latency P99 Latency
Fast E2B 2.3B Mac Mini M4 126 12ms/tok 18ms/tok
Primary E4B 4.5B Mac Mini M4 32 31ms/tok 45ms/tok
Heavy 26B-A4B Cloud Run L4 50-70 15ms/tok 22ms/tok

9.2 Cold Start Analysis

Component Duration Notes
Cloud Run container start 3-5s Image pull from Artifact Registry
Model download (GCS) ~80s 17.1 GB at ~200 MB/s
Model download (HF) 2-5 min Variable, depends on CDN
VRAM load ~25s L4 PCIe bandwidth
Total cold start ~110s (GCS) First request after scale-to-zero
Warm request <1s Model already in VRAM

9.3 Memory Footprint

Model Quantization Disk Runtime RAM GPU VRAM
E2B 4-bit MLX 1.15 GB 2.6 GB shared (unified)
E4B 4-bit MLX 2.1 GB 4.3 GB shared (unified)
26B UD-Q4_K_XL GGUF 17.1 GB N/A 17.1 GB (L4)

10. Cost Analysis

10.1 Self-Hosted (This Project)

Item Monthly Cost
Mac Mini M4 electricity (6W idle, 15W active) ~$1.50
Cloud Run proxy (always-on, CPU only) ~$0 (free tier)
Cloud Run GPU (scale-to-zero, est. 2hr/day) ~$14/mo
Artifact Registry + GCS ~$0.50
Total $3-16/mo

10.2 Equivalent Hosted APIs

Service Equivalent Usage Monthly Cost
OpenAI GPT-4o 1M tokens/day ~$90/mo
Google Gemini API 1M tokens/day ~$45/mo
Anthropic Claude 1M tokens/day ~$75/mo

10.3 Break-Even Analysis

The Mac Mini M4 ($599) pays for itself in 3-6 months compared to API pricing at moderate usage (500K-1M tokens/day). The E2B and E4B tiers have zero marginal cost after hardware purchase.


11. Architecture Pros & Cons

11.1 Advantages

Category Advantage Impact
Cost Near-zero marginal cost for local tiers Enables unlimited experimentation
Privacy All data stays on your hardware (local tiers) HIPAA/GDPR viable
Latency Local inference: 0ms network overhead Real-time chat experience
Availability No API rate limits, no outages from upstream 24/7 for local tiers
Flexibility Swap models by changing env vars, no vendor lock-in Future-proof
Scale-to-zero Heavy tier costs $0 when idle Budget-friendly cloud burst
Multimodal Text, vision, audio, video in one stack Unified pipeline
Reproducible Fully open-source models + infra Academic citation-ready

11.2 Disadvantages

Category Disadvantage Mitigation
Hardware lock-in MLX requires Apple Silicon llama.cpp for cross-platform
16 GB ceiling 26B cannot run locally on Mac Mini Cloud burst tier
Cold start 110s for heavy tier from zero 15-min scale-down delay
Single point of failure Mac Mini offline = local tiers unavailable Cloud Run proxy caches health
Model freshness No fine-tuning pipeline Future work
Concurrency GPU inference is sequential Scale via Cloud Run instances
Complexity Multi-service architecture LaunchAgents + deploy scripts
Network dependency Heavy tier requires internet Local tiers work offline

11.3 Known Limitations

  1. Gemma 4 MLX support requires mlx-lm >= 0.31.2 from GitHub (not PyPI as of April 2026)
  2. 26B on 16 GB Apple Silicon hangs with GPU layers; CPU-only yields 1.59 tok/s (unusable)
  3. Tailscale on macOS requires the official .pkg installer (Homebrew version lacks Network Extension daemon)
  4. Cloud Build CUDA needs libcuda.so.1 stub symlink (no GPU driver in build environment)
  5. HuggingFace gated models require accepted license + valid HF_TOKEN

12. Reproducibility

12.1 Quick Start (Local Only)

See notebooks/gemma4_local_inference.ipynb for a guided setup notebook that:

  • Detects your OS (macOS/Windows) and hardware (RAM, GPU)
  • Recommends which models your system can run
  • Installs the appropriate inference engine (MLX or llama.cpp)
  • Downloads models and runs benchmark inference
  • Optionally sets up the full gateway

12.2 Full Cloud Deployment

# 1. Clone and enter project
git clone <this-repo> && cd gemma4-stack

# 2. Set up local tiers (macOS + Apple Silicon only)
python3 -m venv gateway-venv && source gateway-venv/bin/activate
pip install mlx-lm fastapi uvicorn

# 3. Start local inference servers
bash scripts/start_fast.sh &     # E2B on :8082
bash scripts/start_primary.sh &  # E4B on :8083
python scripts/gateway.py &      # Gateway on :8080

# 4. (Optional) Deploy cloud proxy
export HF_TOKEN="hf_..."
bash cloud/proxy/deploy.sh gemma4good

# 5. (Optional) Deploy heavy tier
export HEAVY_API_KEY=$(openssl rand -hex 32)
bash cloud/heavy/deploy.sh gemma4good

13. Future Directions

13.1 Near-Term

  • LoRA fine-tuning pipeline: Domain-specific adaptation using MLX on Apple Silicon
  • Streaming responses: Server-Sent Events for real-time token delivery
  • Model warm-swap: Hot-reload models without restarting servers
  • Prometheus metrics: Structured observability for all tiers

13.2 Medium-Term

  • Multi-node scaling: Additional Mac Minis via Tailscale for horizontal throughput
  • Speculative decoding: E2B as draft model for 26B verification
  • RAG integration: Local vector store (ChromaDB/FAISS) for domain knowledge
  • Edge caching: Frequently asked queries cached at proxy layer

13.3 Long-Term Research

  • Federated inference: Privacy-preserving multi-party model serving
  • Quantization-aware training: Custom GGUF quants optimized for specific hardware
  • Energy efficiency benchmarks: Tokens per watt across model sizes and hardware
  • Academic benchmark suite: Standardized evaluation of local vs. cloud inference

14. File Structure

gemma4-stack/
+-- README.md                          # This document
+-- notebooks/
|   +-- gemma4_local_inference.ipynb   # Guided setup notebook (Windows + Mac)
+-- scripts/
|   +-- gateway.py                     # FastAPI gateway (tier routing, circuit breakers)
|   +-- start_fast.sh                  # Launch E2B MLX server
|   +-- start_primary.sh              # Launch E4B MLX server
|   +-- benchmark.py                   # Throughput benchmarking tool
|   +-- media.py                       # Media processing (resize, convert, chunk)
+-- cloud/
|   +-- proxy/
|   |   +-- main.py                    # Cloud Run proxy (auth, rate limit, Tailscale tunnel)
|   |   +-- auth.py                    # Firebase authentication
|   |   +-- db.py                      # Firestore user/usage storage
|   |   +-- storage.py                 # GCS media storage
|   |   +-- deploy.sh                  # Proxy deployment script
|   +-- heavy/
|   |   +-- Dockerfile                 # Multi-stage CUDA build (devel -> runtime)
|   |   +-- main.py                    # llama-cpp-python FastAPI server
|   |   +-- service.yaml               # Cloud Run GPU service definition
|   |   +-- deploy.sh                  # Heavy tier deployment script (6 steps)
|   +-- web/
|       +-- index.html                 # Single-file SPA (Apple HIG, responsive)
+-- tests/
|   +-- test_e2e.py                    # End-to-end tests (health, media, cloud)
|   +-- conftest.py                    # Test fixtures
+-- docs/
    +-- API.md                         # OpenAI-compatible API reference
    +-- IMPLEMENTATION_LOG.md          # Detailed implementation journal

Citation

If you use this architecture or findings in academic work:

@misc{gemma4localstack2026,
  title={Gemma 4 Local Inference Stack: Evaluating Consumer Hardware
         for Self-Hosted Large Language Model Serving},
  author={Jemma Project Contributors},
  year={2026},
  howpublished={\url{https://github.com/jemma/gemma4-stack}},
  note={Three-tier MLX + llama.cpp inference architecture on Apple Silicon
        and Cloud Run GPU}
}

License

Apache 2.0. Model weights are subject to Google's Gemma Terms of Use.

About

Three-tier Gemma 4 local inference on Mac Mini M4 — MLX + llama.cpp + Tailscale

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors