Gemma 4 Local Inference Stack: Architecture & Viability Study

Research-grade evaluation of running Google's Gemma 4 model family entirely on consumer hardware, with an optional cloud GPU burst tier for heavy workloads.

Abstract

This project investigates whether a consumer-grade Apple Silicon machine (Mac Mini M4, 16 GB unified memory, $599) can serve as a viable self-hosted AI inference node for Google's Gemma 4 model family. We implement a three-tier inference architecture spanning local hardware and an on-demand cloud GPU, measure throughput and latency across model sizes, and evaluate the cost-performance tradeoff against hosted API services.

Key findings:

The 2.3B parameter E2B model achieves 126 tokens/second on a Mac Mini M4 using MLX, competitive with hosted endpoints and sufficient for real-time conversational AI.
The 4.5B parameter E4B model achieves 32 tokens/second, adequate for code generation and summarization workloads.
The 26B parameter MoE model cannot run at interactive speeds on 16 GB devices (1.59 tok/s CPU-only), but achieves 50-70 tok/s on a Cloud Run NVIDIA L4 GPU ($0.23/hr, scale-to-zero).
MLX on Apple Silicon delivers 5-8x better throughput and 55-63% less memory compared to llama.cpp/Ollama for equivalent quantized models.
The total infrastructure cost for 24/7 availability of two local models plus on-demand cloud GPU is ~$3-5/month (electricity + minimal Cloud Run usage), compared to $20-100+/month for equivalent hosted API quotas.

1. Motivation & Research Context

1.1 The Local Inference Thesis

Large language model inference is increasingly commoditized. Open-weight models from Google (Gemma), Meta (Llama), and others now match or exceed GPT-3.5-class performance. Simultaneously, consumer hardware — particularly Apple Silicon with unified memory — has reached the point where sub-10B parameter models run at interactive speeds without discrete GPUs.

This creates an opportunity: self-hosted AI inference at marginal cost, with full data sovereignty, no rate limits, and no vendor lock-in. But how viable is this in practice?

1.2 Research Questions

#	Question	Finding
RQ1	Can Gemma 4 models run at interactive speeds on consumer Apple Silicon?	Yes (E2B: 126 tok/s, E4B: 32 tok/s on M4 16 GB)
RQ2	What is the minimum viable hardware for each model tier?	E2B: 4 GB free RAM; E4B: 6 GB; 26B: 24 GB VRAM (requires GPU)
RQ3	How does MLX compare to other inference engines on Apple Silicon?	5-8x faster than llama.cpp, 55-63% less RAM
RQ4	Can a hybrid local+cloud architecture provide full model coverage?	Yes, with Cloud Run GPU scale-to-zero at ~$0.003/request
RQ5	What is the total cost of ownership vs. hosted APIs?	$3-5/mo vs $20-100+/mo for equivalent usage

1.3 Academic & Business Value

Academic applications:

Reproducible NLP experiments without GPU cluster access
Private inference for sensitive research data (medical, legal, educational)
Curriculum development: students can run state-of-the-art models on personal hardware
Benchmark framework for evaluating model-hardware efficiency frontiers

Business applications:

Edge AI deployment pattern for privacy-sensitive industries (healthcare, finance, legal)
Cost-optimized inference for startups and small teams
Hybrid cloud architecture template (local-first, cloud-burst)
Internal LLM infrastructure without SaaS vendor dependency

2. System Architecture

2.1 High-Level Topology

PUBLIC INTERNET                           TAILSCALE MESH (WireGuard, encrypted)
       |                                            |
       v                                            |
 GCP Cloud Run Proxy                                |
 (Firebase Auth,           .------------------------+
  rate limiting,           |                        |
  circuit breaker)   Mac Mini M4 16GB         Cloud Run GPU (L4)
       |             (always-on, $0/mo)       (scale-to-zero)
       |             :8082  E2B   126 tok/s   gemma4-heavy
       |             :8083  E4B    32 tok/s   26B-A4B  50-70 tok/s
       |             :8080  Gateway (FastAPI)
       |                    |routes|
       +----------> iPhone / Browser / API clients

2.2 Component Breakdown

Component	Location	Technology	Purpose
Gateway	Mac Mini :8080	FastAPI + uvicorn	Request classification, tier routing, circuit breakers, health aggregation
Fast Tier	Mac Mini :8082	MLX `mlx_lm.server`	E2B 2.3B — greetings, simple Q&A, real-time chat
Primary Tier	Mac Mini :8083	MLX `mlx_lm.server`	E4B 4.5B — summarization, code assist, moderate reasoning
Heavy Tier	Cloud Run GPU	llama-cpp-python + CUDA	26B-A4B MoE — deep analysis, complex code gen, long-context
Cloud Proxy	Cloud Run (CPU)	FastAPI + Tailscale sidecar	Public HTTPS endpoint, Firebase auth, Tailscale tunnel to Mac Mini
Web UI	Served by proxy	Single-file SPA (vanilla JS)	Apple HIG-inspired chat interface, multimodal upload
Auth	Firebase + Cloud Run	Firebase Auth + Firestore	Google Sign-In, per-user quotas, usage tracking

2.3 Request Lifecycle

1. Client sends POST /v1/chat/completions (OpenAI-compatible)
                    |
2. Cloud Proxy authenticates (Firebase JWT or API key)
   Rate-limits (30 text/min, 10 media/min per IP)
                    |
3. Proxy forwards via Tailscale SOCKS5 to Mac Mini gateway
                    |
4. Gateway classifies request complexity:
   - Simple/greeting   -> Fast (E2B)
   - Moderate/code     -> Primary (E4B)
   - Complex/analysis  -> Heavy (26B, Cloud Run GPU)
                    |
5. Circuit breaker checks tier health:
   - CLOSED: route normally
   - OPEN: skip to next tier
   - HALF-OPEN: probe with single request
                    |
6. Tier processes inference, returns OpenAI-format response
                    |
7. Gateway annotates response with _routing metadata:
   {tier, model, latency_ms, tokens, fallback_chain}

3. Hardware & Model Selection

3.1 Device Specifications

	Mac Mini M4	Cloud Run L4
CPU	10-core (4P + 6E)	8 vCPU
GPU	10-core Metal	NVIDIA L4 (24 GB VRAM)
Memory	16 GB unified	24 GB system + 24 GB VRAM
Memory Bandwidth	120 GB/s	300 GB/s (GDDR6)
Cost	$599 one-time	$0.23/hr (scale-to-zero)
Always On	Yes (6W idle)	No (0 cost when idle)

3.2 Model Selection Rationale

Gemma 4 E2B (2.3B active / 5.1B total)

MoE with 2 experts, 1 active per token
4-bit MLX quantization: 1.15 GB working set
Multimodal: text + vision + audio
Use case: real-time chat, simple questions, triage

Gemma 4 E4B (4.5B active / 8.0B total)

MoE with 4 experts, 2 active per token
4-bit MLX quantization: 2.1 GB working set
Multimodal: text + vision + audio
Use case: code generation, summarization, moderate reasoning

Gemma 4 26B-A4B (3.8B active / 26B total)

Dense MoE with larger expert capacity
4-bit GGUF (Unsloth Dynamic UD-Q4_K_XL): 17.1 GB
Multimodal: text + vision
Use case: complex analysis, long-context (256K), heavy code generation

3.3 Why MoE Over Dense?

The 26B-A4B MoE activates only 3.8B parameters per token despite having 26B total. Compared to the 31B dense variant:

Metric	26B-A4B MoE	31B Dense
Active params/token	3.8B	30.7B
4-bit disk size	17.1 GB (UD)	18.4 GB
L4 throughput	50-70 tok/s	20-30 tok/s
Context window	256K tokens	128K
AIME 2026	88.3%	89.2%

The MoE delivers 97% of dense quality at 2-3x throughput — a decisive advantage for serving.

4. Inference Engine Analysis

4.1 MLX (Apple Silicon, Local Tiers)

MLX is Apple's machine learning framework optimized for unified memory:

Advantages:

Zero-copy Metal GPU dispatch (no PCIe transfer overhead)
Lazy evaluation with automatic kernel fusion
4-bit quantized KV cache (--kv-bits 4 for E2B, --kv-bits 3.5 for E4B)
Native mlx_lm.server with OpenAI-compatible API
Integrated with macOS LaunchAgent for auto-start on boot

Limitations:

Apple Silicon only (no x86, no Windows, no Linux GPU)
Model ecosystem smaller than GGUF/Transformers
Cannot run 26B on 16 GB (needs ~15 GB Metal buffer, only 12.7 GB available)
Requires installing from GitHub main for Gemma 4 support (PyPI 0.31.1 lacks it)

4.2 llama.cpp / llama-cpp-python (CUDA, Cloud Tier)

llama.cpp is a C++ inference engine with broad hardware support:

Advantages:

Runs on NVIDIA GPUs, AMD GPUs, Apple Metal, CPU
Broad GGUF model ecosystem (Unsloth, TheBloke, bartowski)
Mature quantization: Q4_K_M, Q5_K, UD (Unsloth Dynamic/imatrix)
Memory-mapped model loading (mmap) for CPU fallback
Docker + CUDA build for cloud deployment

Limitations:

5-8x slower than MLX on equivalent Apple Silicon hardware
CUDA build requires stub symlink workaround in Cloud Build (no GPU driver)
Cold start on Cloud Run: ~110s (download model + load into VRAM)
More complex deployment (multi-stage Dockerfile, Cloud Build)

4.3 Comparative Benchmarks (Same Hardware)

Engine	Model	Device	tok/s	RAM	Notes
MLX	E2B 4-bit	Mac Mini M4	126	2.6 GB	Best speed
MLX	E4B 4-bit	Mac Mini M4	32	4.3 GB	Good quality/speed
llama.cpp (CPU)	26B Q4_K_M	Mac Mini M4	1.59	11.2 GB	Unusable interactively
llama.cpp (GPU)	26B Q4_K_M	Mac Mini M4	crash	-	OOM on 16 GB
llama.cpp (CUDA)	26B UD-Q4_K_XL	Cloud Run L4	50-70	17.1 GB VRAM	Production viable
Ollama	E4B Q4	Mac Mini M4	6.1	7.8 GB	5x slower than MLX

Key insight: MLX is the clear winner for Apple Silicon. For models that exceed local memory, cloud GPU with llama.cpp is the only practical option on a budget.

5. Network Topology & Routing

5.1 Tailscale Mesh

All inter-device communication uses Tailscale (WireGuard-based mesh VPN):

Mac Mini: 100.75.223.113
Cloud Run Proxy: Tailscale sidecar container via SOCKS5 :1055
Encryption: WireGuard (ChaCha20-Poly1305), zero-trust networking
No port forwarding, no public IP exposure for local devices

5.2 Gateway Routing Logic

The gateway implements intelligent request classification:

def classify_request(messages) -> str:
    """Classify request complexity for tier routing."""
    # Simple heuristics based on message content/length
    # greeting, fyi, simple question  -> "fast"
    # summarize, code, rewrite        -> "primary"
    # analysis, complex, long-form    -> "heavy"

5.3 Circuit Breaker Pattern

Each tier has an independent circuit breaker to prevent cascading failures:

CLOSED (healthy)
    | (3 failures in 30s window)
    v
OPEN (rejecting, skip tier)
    | (30s cooldown)
    v
HALF-OPEN (probe with 1 request)
    |-- success -> CLOSED
    |-- failure -> OPEN (restart timer)

5.4 Fallback Chain

Heavy (26B) -> Primary (E4B) -> Fast (E2B) -> 503 Service Unavailable

Response metadata includes "fallback": "heavy->primary" when a fallback occurred, providing transparency to clients.

6. Cloud Burst Tier

6.1 Cloud Run GPU Configuration

The heavy tier runs on Cloud Run with an NVIDIA L4 GPU:

Setting	Value	Rationale
GPU	NVIDIA L4 (24 GB VRAM)	Fits 17.1 GB model + KV cache
Min instances	0	Scale-to-zero ($0 when idle)
Max instances	1	Budget cap
CPU	4-8 vCPU	L4 minimum requirement
Memory	16-24 GB	Model download buffer
Startup probe	600s	Cold start: download + load
Scale-down delay	15 min	Avoid rapid cold starts
Container	CUDA 12.4.1 (multi-stage)	Build: devel, Run: runtime

6.2 Model Download Strategy

Primary: GCS bucket (gs://gemma4good-models/) — 80s at ~200 MB/s
Fallback: HuggingFace Hub (unsloth/gemma-4-26B-A4B-it-GGUF) — 2-5 min

Environment variables HF_REPO and HF_FILE allow switching quantizations without rebuilding the container image.

6.3 CUDA Build Workaround

Cloud Build machines have CUDA toolkit but no GPU driver. The linker fails resolving libcuda.so.1 (only a stub libcuda.so exists). Fix:

RUN ln -sf /usr/local/cuda/lib64/stubs/libcuda.so \
           /usr/local/cuda/lib64/stubs/libcuda.so.1 \
    && echo "/usr/local/cuda/lib64/stubs" >> /etc/ld.so.conf.d/cuda-stubs.conf \
    && ldconfig

This is a well-known Docker-build-without-GPU issue documented in NVIDIA forums.

7. Authentication & Access Control

7.1 Firebase Authentication

Google Sign-In via Firebase Auth SDK
JWT token verification on Cloud Run proxy
Per-user quotas stored in Firestore:
- 100 text requests/day
- 20 media requests/day
- 500 MB media storage

7.2 API Key Fallback

For programmatic access without browser auth:

Heavy tier: Bearer token (HEAVY_API_KEY in Secret Manager)
Gateway: Internal (localhost only, no auth needed)

8. Multimodal Media Pipeline

8.1 Supported Formats

Type	Formats	Max Size	Processing
Image	JPEG, PNG, WebP, GIF, BMP, HEIC	20 MB	Resize to 1280px, JPEG q=85
Audio	WAV, MP3, M4A, AAC, OGG, OPUS, FLAC	50 MB	Convert to 16kHz mono WAV
Video	MP4, MOV, WebM, MKV, AVI	10 GB	Segment + frame extraction

8.2 Video Chunking Strategy

Tier	Segment Duration	Frames/Segment	Use Case
Fast	30s	4	Quick preview
Primary	60s	8	Standard analysis
Heavy	120s	16	Detailed analysis

9. Performance Results

9.1 Throughput

Tier	Model	Device	Tokens/sec	P50 Latency	P99 Latency
Fast	E2B 2.3B	Mac Mini M4	126	12ms/tok	18ms/tok
Primary	E4B 4.5B	Mac Mini M4	32	31ms/tok	45ms/tok
Heavy	26B-A4B	Cloud Run L4	50-70	15ms/tok	22ms/tok

9.2 Cold Start Analysis

Component	Duration	Notes
Cloud Run container start	3-5s	Image pull from Artifact Registry
Model download (GCS)	~80s	17.1 GB at ~200 MB/s
Model download (HF)	2-5 min	Variable, depends on CDN
VRAM load	~25s	L4 PCIe bandwidth
Total cold start	~110s (GCS)	First request after scale-to-zero
Warm request	<1s	Model already in VRAM

9.3 Memory Footprint

Model	Quantization	Disk	Runtime RAM	GPU VRAM
E2B 4-bit	MLX	1.15 GB	2.6 GB	shared (unified)
E4B 4-bit	MLX	2.1 GB	4.3 GB	shared (unified)
26B UD-Q4_K_XL	GGUF	17.1 GB	N/A	17.1 GB (L4)

10. Cost Analysis

10.1 Self-Hosted (This Project)

Item	Monthly Cost
Mac Mini M4 electricity (6W idle, 15W active)	~$1.50
Cloud Run proxy (always-on, CPU only)	~$0 (free tier)
Cloud Run GPU (scale-to-zero, est. 2hr/day)	~$14/mo
Artifact Registry + GCS	~$0.50
Total	$3-16/mo

10.2 Equivalent Hosted APIs

Service	Equivalent Usage	Monthly Cost
OpenAI GPT-4o	1M tokens/day	~$90/mo
Google Gemini API	1M tokens/day	~$45/mo
Anthropic Claude	1M tokens/day	~$75/mo

10.3 Break-Even Analysis

The Mac Mini M4 ($599) pays for itself in 3-6 months compared to API pricing at moderate usage (500K-1M tokens/day). The E2B and E4B tiers have zero marginal cost after hardware purchase.

11. Architecture Pros & Cons

11.1 Advantages

Category	Advantage	Impact
Cost	Near-zero marginal cost for local tiers	Enables unlimited experimentation
Privacy	All data stays on your hardware (local tiers)	HIPAA/GDPR viable
Latency	Local inference: 0ms network overhead	Real-time chat experience
Availability	No API rate limits, no outages from upstream	24/7 for local tiers
Flexibility	Swap models by changing env vars, no vendor lock-in	Future-proof
Scale-to-zero	Heavy tier costs $0 when idle	Budget-friendly cloud burst
Multimodal	Text, vision, audio, video in one stack	Unified pipeline
Reproducible	Fully open-source models + infra	Academic citation-ready

11.2 Disadvantages

Category	Disadvantage	Mitigation
Hardware lock-in	MLX requires Apple Silicon	llama.cpp for cross-platform
16 GB ceiling	26B cannot run locally on Mac Mini	Cloud burst tier
Cold start	110s for heavy tier from zero	15-min scale-down delay
Single point of failure	Mac Mini offline = local tiers unavailable	Cloud Run proxy caches health
Model freshness	No fine-tuning pipeline	Future work
Concurrency	GPU inference is sequential	Scale via Cloud Run instances
Complexity	Multi-service architecture	LaunchAgents + deploy scripts
Network dependency	Heavy tier requires internet	Local tiers work offline

11.3 Known Limitations

Gemma 4 MLX support requires mlx-lm >= 0.31.2 from GitHub (not PyPI as of April 2026)
26B on 16 GB Apple Silicon hangs with GPU layers; CPU-only yields 1.59 tok/s (unusable)
Tailscale on macOS requires the official .pkg installer (Homebrew version lacks Network Extension daemon)
Cloud Build CUDA needs libcuda.so.1 stub symlink (no GPU driver in build environment)
HuggingFace gated models require accepted license + valid HF_TOKEN

12. Reproducibility

12.1 Quick Start (Local Only)

See notebooks/gemma4_local_inference.ipynb for a guided setup notebook that:

Detects your OS (macOS/Windows) and hardware (RAM, GPU)
Recommends which models your system can run
Installs the appropriate inference engine (MLX or llama.cpp)
Downloads models and runs benchmark inference
Optionally sets up the full gateway

12.2 Full Cloud Deployment

# 1. Clone and enter project
git clone <this-repo> && cd gemma4-stack

# 2. Set up local tiers (macOS + Apple Silicon only)
python3 -m venv gateway-venv && source gateway-venv/bin/activate
pip install mlx-lm fastapi uvicorn

# 3. Start local inference servers
bash scripts/start_fast.sh &     # E2B on :8082
bash scripts/start_primary.sh &  # E4B on :8083
python scripts/gateway.py &      # Gateway on :8080

# 4. (Optional) Deploy cloud proxy
export HF_TOKEN="hf_..."
bash cloud/proxy/deploy.sh gemma4good

# 5. (Optional) Deploy heavy tier
export HEAVY_API_KEY=$(openssl rand -hex 32)
bash cloud/heavy/deploy.sh gemma4good

13. Future Directions

13.1 Near-Term

LoRA fine-tuning pipeline: Domain-specific adaptation using MLX on Apple Silicon
Streaming responses: Server-Sent Events for real-time token delivery
Model warm-swap: Hot-reload models without restarting servers
Prometheus metrics: Structured observability for all tiers

13.2 Medium-Term

Multi-node scaling: Additional Mac Minis via Tailscale for horizontal throughput
Speculative decoding: E2B as draft model for 26B verification
RAG integration: Local vector store (ChromaDB/FAISS) for domain knowledge
Edge caching: Frequently asked queries cached at proxy layer

13.3 Long-Term Research

Federated inference: Privacy-preserving multi-party model serving
Quantization-aware training: Custom GGUF quants optimized for specific hardware
Energy efficiency benchmarks: Tokens per watt across model sizes and hardware
Academic benchmark suite: Standardized evaluation of local vs. cloud inference

14. File Structure

gemma4-stack/
+-- README.md                          # This document
+-- notebooks/
|   +-- gemma4_local_inference.ipynb   # Guided setup notebook (Windows + Mac)
+-- scripts/
|   +-- gateway.py                     # FastAPI gateway (tier routing, circuit breakers)
|   +-- start_fast.sh                  # Launch E2B MLX server
|   +-- start_primary.sh              # Launch E4B MLX server
|   +-- benchmark.py                   # Throughput benchmarking tool
|   +-- media.py                       # Media processing (resize, convert, chunk)
+-- cloud/
|   +-- proxy/
|   |   +-- main.py                    # Cloud Run proxy (auth, rate limit, Tailscale tunnel)
|   |   +-- auth.py                    # Firebase authentication
|   |   +-- db.py                      # Firestore user/usage storage
|   |   +-- storage.py                 # GCS media storage
|   |   +-- deploy.sh                  # Proxy deployment script
|   +-- heavy/
|   |   +-- Dockerfile                 # Multi-stage CUDA build (devel -> runtime)
|   |   +-- main.py                    # llama-cpp-python FastAPI server
|   |   +-- service.yaml               # Cloud Run GPU service definition
|   |   +-- deploy.sh                  # Heavy tier deployment script (6 steps)
|   +-- web/
|       +-- index.html                 # Single-file SPA (Apple HIG, responsive)
+-- tests/
|   +-- test_e2e.py                    # End-to-end tests (health, media, cloud)
|   +-- conftest.py                    # Test fixtures
+-- docs/
    +-- API.md                         # OpenAI-compatible API reference
    +-- IMPLEMENTATION_LOG.md          # Detailed implementation journal

Citation

If you use this architecture or findings in academic work:

@misc{gemma4localstack2026,
  title={Gemma 4 Local Inference Stack: Evaluating Consumer Hardware
         for Self-Hosted Large Language Model Serving},
  author={Jemma Project Contributors},
  year={2026},
  howpublished={\url{https://github.com/jemma/gemma4-stack}},
  note={Three-tier MLX + llama.cpp inference architecture on Apple Silicon
        and Cloud Run GPU}
}

License

Apache 2.0. Model weights are subject to Google's Gemma Terms of Use.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
cloud		cloud
docs		docs
notebooks		notebooks
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Gemma 4 Local Inference Stack: Architecture & Viability Study

Abstract

Table of Contents

1. Motivation & Research Context

1.1 The Local Inference Thesis

1.2 Research Questions

1.3 Academic & Business Value

2. System Architecture

2.1 High-Level Topology

2.2 Component Breakdown

2.3 Request Lifecycle

3. Hardware & Model Selection

3.1 Device Specifications

3.2 Model Selection Rationale

3.3 Why MoE Over Dense?

4. Inference Engine Analysis

4.1 MLX (Apple Silicon, Local Tiers)

4.2 llama.cpp / llama-cpp-python (CUDA, Cloud Tier)

4.3 Comparative Benchmarks (Same Hardware)

5. Network Topology & Routing

5.1 Tailscale Mesh

5.2 Gateway Routing Logic

5.3 Circuit Breaker Pattern

5.4 Fallback Chain

6. Cloud Burst Tier

6.1 Cloud Run GPU Configuration

6.2 Model Download Strategy

6.3 CUDA Build Workaround

7. Authentication & Access Control

7.1 Firebase Authentication

7.2 API Key Fallback

8. Multimodal Media Pipeline

8.1 Supported Formats

8.2 Video Chunking Strategy

9. Performance Results

9.1 Throughput

9.2 Cold Start Analysis

9.3 Memory Footprint

10. Cost Analysis

10.1 Self-Hosted (This Project)

10.2 Equivalent Hosted APIs

10.3 Break-Even Analysis

11. Architecture Pros & Cons

11.1 Advantages

11.2 Disadvantages

11.3 Known Limitations

12. Reproducibility

12.1 Quick Start (Local Only)

12.2 Full Cloud Deployment

13. Future Directions

13.1 Near-Term

13.2 Medium-Term

13.3 Long-Term Research

14. File Structure

Citation

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages