UNC — Universal Neural Compiler

Compiles HuggingFace transformer models into optimised native Metal inference binaries. No runtime framework, no Python — just a compiled binary that runs your model at near-hardware-limit speed on Apple Silicon, using 25% less GPU power and 1.7x better energy efficiency than mlx-lm. See the full resource & power report.

Performance

TinyLlama 1.1B on Apple M1 Pro (16GB, 200 GB/s):

UNC Q4_0  ████████████████████████████████████████████████████████████  152.0 tok/s
mlx-lm Q4 ████████████████████████████████████████████                 112.7 tok/s
UNC Q8_0  ███████████████████████████████                               76.6 tok/s
UNC F16   ███████████████████                                           47.9 tok/s

Qwen3-4B on Apple M1 Pro (Q4_0):

mlx-lm Q4 ████████████████████████████████████████████████████  49.2 tok/s
UNC Q4_0  ██████████████████████████████████████████            38.7 tok/s

Energy Efficiency (Q4_0 TinyLlama 1.1B, measured via macmon)

Metric	UNC Metal	mlx-lm Q4
Throughput	152 tok/s	113 tok/s
GPU power (decode)	11.3W	14.1W
Energy per token	74 mJ	125 mJ
Tokens per watt-hour	12,800	8,000
CPU instructions (200 tok)	5.3B	31.4B
Peak memory	4.2 GB	0.9 GB

UNC is 1.35x faster while using 25% less GPU power, resulting in 1.7x better energy efficiency. The compiled approach eliminates Python runtime and framework dispatch overhead entirely — 8.4x fewer CPU instructions means less heat, less power, and more headroom for the GPU. See unc_resource_usage.md for full methodology and traces.

Architecture

HuggingFace model
       |
  [ Frontend ]        Parse config.json + safetensors
       |
  [ IR Graph ]        Hardware-agnostic tensor graph
       |
  [ Compiler ]        Fusion, quantization, memory planning
       |
       +------------------+------------------+------------------+
       |                  |                  |                  |
  [ Metal ]          [ CUDA ]          [ ROCm ]          [ WASM ]
  Obj-C + Metal      PTX kernels       HIP kernels       WebGPU shaders
  shaders            (planned)         (planned)         (planned)
       |
  Native binary
  Mach-O (AOT) or
  .unc bundle (JIT)

IR: Hardware-agnostic typed tensor graph with BatchMatMul, QuantizedMatVec, RMSNorm, LayerNorm, QKNorm, RoPE, SDPA, SwiGLU, KVCacheAppend, Gather, etc. The IR is target-independent — the same graph can be lowered to Metal (current), CUDA, ROCm, WASM, or CPU-only backends with acceleration providers like Intel oneDNN.

Compiler passes: Weight binding, dead code elimination, QKV fusion, Gate+Up fusion, SwiGLU fusion, Add+RMSNorm fusion, RoPE+KV fusion, PSQ pipeline, dual-path (GEMM/GEMV), kernel matching, barrier analysis, memory planning with buffer aliasing.

Output modes:

Mode	Output	Use case
JIT (default)	`.unc` bundle — JIT-compiled via clang at first run, cached thereafter	Development, iteration
AOT (`--binary`)	Standalone Mach-O with embedded weights — zero dependencies	Deployment, distribution
GGUF export (`quantize`)	GGUF v3 file — compatible with llama.cpp, ollama, LM Studio	Interop, distribution

Setup

# Prerequisites: Rust toolchain, Xcode Command Line Tools (macOS)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Clone and build
git clone <repo-url> && cd unc
cargo build --release

Usage

Compile a model

# JIT bundle (default)
unc compile --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --quant q4-0 -o ./tinyllama
unc compile --model Qwen/Qwen3-4B --quant q4-0 -o ./qwen3

# AOT standalone binary (single Mach-O, zero dependencies)
unc compile --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --quant q4-0 --binary -o ./tinyllama

Run inference

unc run ./tinyllama.unc --prompt "The history of" --max-tokens 200

Export quantized GGUF (for llama.cpp, ollama, LM Studio)

# Quantize and export to GGUF v3
unc quantize --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --quant q4-0 -o tinyllama-q4_0.gguf
unc quantize --model Qwen/Qwen3-4B --quant q8-0 -o qwen3-q8_0.gguf

# Use with llama.cpp / ollama
llama-cli -m tinyllama-q4_0.gguf -p "Hello" -n 100
ollama create mymodel -f Modelfile  # with: FROM ./tinyllama-q4_0.gguf

GPU-accelerated quantization via Metal compute kernels with streaming double-buffered I/O. 1.8s end-to-end for 1.1B params — 9x faster than llama.cpp's full HF→GGUF pipeline (16.5s). No Python required.

Tool	HF safetensors → Q4_0	Notes
UNC	1.8s	GPU Metal, streaming, no Python
mlx-lm	2.1s	Outputs MLX format, not GGUF
llama.cpp	16.5s	Python convert + CPU quantize

See benchmark/quantization/results.md for full methodology.

Quantization options

Flag	Precision	Size (1.1B)	Inference	Export
`f16`	16-bit float	2.2 GB	47.9 tok/s	compile only
`q8-0`	8-bit	1.1 GB	76.6 tok/s	compile + GGUF
`q4-0`	4-bit	0.6 GB	152.0 tok/s	compile + GGUF

Supported architectures

unc list-architectures

LLaMA, Mistral, Qwen, Phi, Gemma.

Project Structure

src/
  frontend/    HuggingFace config parsing, model templates
  ir/          Typed tensor IR (ops, graph, types)
  compile/     Optimization passes, memory planner
  kernel/      Kernel registry, Metal kernel definitions
  emit/        Metal orchestrator codegen, AOT binary emission
  export/      GGUF writer, GPU quantization pipeline, tensor name mapping
  runtime/     JIT compilation, weight loading, tokenizer
  target/      Apple Silicon target detection
  unc_format/  .unc bundle serialization
kernel_sources/
  metal/
    unc_kernels/   Custom Metal shaders (fused GEMV, SDPA, RoPE, RMSNorm, etc.)
    upstream_mlx/  MLX reference kernels (QMV, sdpa_vector headers)

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.zed		.zed
benchmark		benchmark
kernel_sources/metal		kernel_sources/metal
runtime		runtime
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
unc_resource_usage.md		unc_resource_usage.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UNC — Universal Neural Compiler

Performance

Energy Efficiency (Q4_0 TinyLlama 1.1B, measured via macmon)

Architecture

Setup

Usage

Compile a model

Run inference

Export quantized GGUF (for llama.cpp, ollama, LM Studio)

Quantization options

Supported architectures

Project Structure

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UNC — Universal Neural Compiler

Performance

Energy Efficiency (Q4_0 TinyLlama 1.1B, measured via macmon)

Architecture

Setup

Usage

Compile a model

Run inference

Export quantized GGUF (for llama.cpp, ollama, LM Studio)

Quantization options

Supported architectures

Project Structure

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages