Skip to content

mukel/gemma4.java

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Gemma4.java

Java 21+ License: Apache 2.0 GraalVM Platform

Fast, zero-dependency, inference engine for Gemma 4 in pure Java.


Features

  • Single file, no dependencies
  • GGUF format parser
  • Gemma 4 tokenizer
  • Supports all Gemma 4 model families: E2B, E4B, 31B, and 26B-A4B (MoE)
  • Mixture of Experts routing and execution
  • Sliding Window Attention (SWA) and full-attention layers
  • Per-layer KV cache sharing and per-head Q/K RMS normalization
  • Supported dtypes/quantizations: F16, BF16, F32, Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, Q8_0
  • Thinking mode control with --think off|on|inline
  • Matrix-vector kernels using Java's Vector API
  • CLI with --chat and --instruct modes
  • GraalVM Native Image support
  • AOT model preloading for lower time-to-first-token

Setup

Download GGUF models from Hugging Face:

Model Architecture GGUF Repository
E2B Dense, ~5B total params unsloth/gemma-4-E2B-it-GGUF
E4B Dense, ~8B total params unsloth/gemma-4-E4B-it-GGUF
31B Dense unsloth/gemma-4-31B-it-GGUF
26B-A4B Mixture of Experts (MoE) unsloth/gemma-4-26B-A4B-it-GGUF

Optional: pure quantizations

Q4_0 files are often mixed-quant in practice (for example, token_embd.weight and output.weight may use Q6_K). A pure quantization is not required, but can be generated from an F32/F16/BF16 GGUF source with llama-quantize from llama.cpp:

./llama-quantize --pure ./gemma-4-E2B-it-BF16.gguf ./gemma-4-E2B-it-Q4_0.gguf Q4_0

Pick any supported target quantization, for example Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, or Q8_0.

Build and run

Java 21+ is required, in particular for the MemorySegment mmap-ing feature.

jbang is a good fit for this use case:

jbang Gemma4.java --help
jbang Gemma4.java --model ./gemma-4-E2B-it-Q4_0.gguf --chat
jbang Gemma4.java --model ./gemma-4-E2B-it-Q4_0.gguf --prompt "Explain quantum computing like I'm five"

Or run it directly (still via jbang):

chmod +x Gemma4.java
./Gemma4.java --help

Optional: Makefile

A simple Makefile is provided. Run make jar to produce gemma4.jar.

Run the resulting gemma4.jar as follows:

java --enable-preview --add-modules jdk.incubator.vector -jar gemma4.jar --help

GraalVM Native Image

Compile with make native to produce a gemma4 executable, then:

./gemma4 --model ./gemma-4-E2B-it-Q4_0.gguf --chat

AOT model preloading

Gemma4.java supports AOT model preloading to reduce parse overhead and time-to-first-token (TTFT).

To AOT pre-load a GGUF model:

PRELOAD_GGUF=/path/to/model.gguf make native

A larger specialized binary is generated with parse overhead removed for that specific model. It can still run other models with normal parsing behavior.

Performance

GraalVM 25+ is recommended for the absolute best performance (JIT), it provides partial, but good support for the Vector API.

By default, the "preferred" vector size is used, it can be force-set with -Dllama.VectorBitSize=0|128|256|512, 0 means disabled.

License

Apache 2.0

Contributors