Skip to content

Model request: Gemma 4 E4B (google/gemma-4-E4B-it) on iOS #20

@AndreiAniukou

Description

@AndreiAniukou

Source link

https://huggingface.co/google/gemma-4-E4B-it

Target platform

iOS

Use case

Use case:
We ship a production iOS app with an on-device AI
agent: tool calling against wallet/market tools, streaming chat, strict privacy
requirement — portfolio data must never leave the device, so on-device inference
is the product, not an optimization.

Today we run Gemma 4 E4B-it (GGUF, imatrix Q4_K_M / Q5_K_M) via llama.cpp with
Metal on 8–12 GB iPhones (context 4096). It works, but CPU/GPU decoding costs
battery and thermals on long agent sessions. A Core AI iOS preset for the same
model would let us migrate as a pure runtime swap — same weights, same prompts,
same tool-call format — and pick up the ANE static-shape path, framework-managed
KV cache, and FoundationModels tool calling / guided generation.

Preferred precision / compression

mixed 4/8-bit (like the qwen3-4b iOS preset), max context 4096

Additional context

Additional context:

  • Gemma 4 E4B is the current flagship open on-device model (Apache 2.0,
    agentic/tool-calling focus, native function calling + configurable thinking),
    so iOS support would likely serve many apps beyond ours.
  • The catalog currently has Gemma 3 4B/12B as macOS-only and no Gemma on iOS;
    the iOS LLM presets are Qwen-only. I assume the blocker for the E-series is
    the per-layer embeddings (PLE) tensor — for reference, llama.cpp exports and
    runs it fine on iPhone hardware (the per_layer_token_embd tensor maps as a
    plain weight), and at 4-bit the E4B footprint (~5 GB) fits 8 GB devices.
  • E2B would be a nice-to-have for lower-RAM devices, but E4B is the ask here.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions