Source link
https://huggingface.co/google/gemma-4-E4B-it
Target platform
iOS
Use case
Use case:
We ship a production iOS app with an on-device AI
agent: tool calling against wallet/market tools, streaming chat, strict privacy
requirement — portfolio data must never leave the device, so on-device inference
is the product, not an optimization.
Today we run Gemma 4 E4B-it (GGUF, imatrix Q4_K_M / Q5_K_M) via llama.cpp with
Metal on 8–12 GB iPhones (context 4096). It works, but CPU/GPU decoding costs
battery and thermals on long agent sessions. A Core AI iOS preset for the same
model would let us migrate as a pure runtime swap — same weights, same prompts,
same tool-call format — and pick up the ANE static-shape path, framework-managed
KV cache, and FoundationModels tool calling / guided generation.
Preferred precision / compression
mixed 4/8-bit (like the qwen3-4b iOS preset), max context 4096
Additional context
Additional context:
- Gemma 4 E4B is the current flagship open on-device model (Apache 2.0,
agentic/tool-calling focus, native function calling + configurable thinking),
so iOS support would likely serve many apps beyond ours.
- The catalog currently has Gemma 3 4B/12B as macOS-only and no Gemma on iOS;
the iOS LLM presets are Qwen-only. I assume the blocker for the E-series is
the per-layer embeddings (PLE) tensor — for reference, llama.cpp exports and
runs it fine on iPhone hardware (the per_layer_token_embd tensor maps as a
plain weight), and at 4-bit the E4B footprint (~5 GB) fits 8 GB devices.
- E2B would be a nice-to-have for lower-RAM devices, but E4B is the ask here.
Source link
https://huggingface.co/google/gemma-4-E4B-it
Target platform
iOS
Use case
Use case:
We ship a production iOS app with an on-device AI
agent: tool calling against wallet/market tools, streaming chat, strict privacy
requirement — portfolio data must never leave the device, so on-device inference
is the product, not an optimization.
Today we run Gemma 4 E4B-it (GGUF, imatrix Q4_K_M / Q5_K_M) via llama.cpp with
Metal on 8–12 GB iPhones (context 4096). It works, but CPU/GPU decoding costs
battery and thermals on long agent sessions. A Core AI iOS preset for the same
model would let us migrate as a pure runtime swap — same weights, same prompts,
same tool-call format — and pick up the ANE static-shape path, framework-managed
KV cache, and FoundationModels tool calling / guided generation.
Preferred precision / compression
mixed 4/8-bit (like the qwen3-4b iOS preset), max context 4096
Additional context
Additional context:
agentic/tool-calling focus, native function calling + configurable thinking),
so iOS support would likely serve many apps beyond ours.
the iOS LLM presets are Qwen-only. I assume the blocker for the E-series is
the per-layer embeddings (PLE) tensor — for reference, llama.cpp exports and
runs it fine on iPhone hardware (the per_layer_token_embd tensor maps as a
plain weight), and at 4-bit the E4B footprint (~5 GB) fits 8 GB devices.