██╗ ██╗███████╗ ██████╗████████╗ ██████╗ ██████╗ ██████╗ ██████╗ ██╗███╗ ███╗███████╗ ██║ ██║██╔════╝██╔════╝╚══██╔══╝██╔═══██╗██╔══██╗██╔══██╗██╔══██╗██║████╗ ████║██╔════╝ ██║ ██║█████╗ ██║ ██║ ██║ ██║██████╔╝██████╔╝██████╔╝██║██╔████╔██║█████╗ ╚██╗ ██╔╝██╔══╝ ██║ ██║ ██║ ██║██╔══██╗██╔═══╝ ██╔══██╗██║██║╚██╔╝██║██╔══╝ ╚████╔╝ ███████╗╚██████╗ ██║ ╚██████╔╝██║ ██║██║ ██║ ██║██║██║ ╚═╝ ██║███████╗ ╚═══╝ ╚══════╝ ╚═════╝ ╚═╝ ╚═════╝ ╚═╝ ╚═╝╚═╝ ╚═╝ ╚═╝╚═╝╚═╝ ╚═╝╚══════╝
Compiler-style, hardware-aware LLM inference optimizer
VectorPrime takes a model file and your hardware, then finds the fastest way to run it. It profiles your CPU, GPU, and RAM; parses the model's intermediate representation to extract architecture metadata; generates every valid combination of runtime, quantization, thread count, and GPU offload layers; benchmarks candidates in parallel; and hands you back the configuration that maximizes tokens per second within your memory budget. The result is a ready-to-use Ollama bundle — no guesswork required.
VectorPrime is built for developers and researchers who run inference locally and want reproducible, hardware-specific performance without manually tuning runtime flags or hunting for the right quantization format. The Rust backend handles parallel benchmarking and hardware detection; a PyO3 native extension exposes everything through a clean Python API and a single pip install vectorprime.
| Feature | Description | Status |
|---|---|---|
| Hardware profiling | Detects CPU core count, SIMD level (AVX/AVX2/AVX512), GPU VRAM and compute capability, and available RAM | Stable |
| Model IR analysis | Reads GGUF and ONNX model files to extract parameter count, architecture, context length, layer count, hidden size, attention heads, KV cache size, memory footprint, and FLOPs without running inference | Stable |
| Multi-runtime support | Benchmarks Ollama (primary), TensorRT (primary), ONNX Runtime (secondary), and llama.cpp (deprioritized) against each other on your hardware | Stable |
| Automatic quantization selection | Evaluates F16, Q8_0, Q4_K_M, Q4_0, Int8, and Int4 and picks the fastest that fits in memory | Stable |
| Parallel benchmarking | Tokio-based async executor runs up to 3 configurations concurrently | Stable |
| Optimization result caching | Caches results to ~/.llmforge/cache/ keyed by model identity and hardware profile; skips benchmarking entirely on a cache hit |
Stable |
| Ollama export | Generates a Modelfile with tuned num_thread and num_gpu values, ready for ollama create |
Stable |
| Format conversion | Bidirectional GGUF-to-ONNX and ONNX-to-GGUF conversion with full metadata round-trip | Stable |
| Python API | PyO3 native extension — import and call from any Python script or notebook | Stable |
| CLI interface | profile (with --json, --verbose, --save), optimize, convert-to-onnx, convert-to-gguf, and doctor subcommands |
Stable |
pip install vectorprime
# See what hardware VectorPrime detected
vectorprime profile
# Find the best inference configuration for a model
vectorprime optimize model.gguf
# Export the result as an Ollama bundle (Python API)
# See the Python API section belowpip install vectorprimeNo Rust toolchain required! Pre-built wheels are available for:
- Python 3.9, 3.10, 3.11, 3.12
- Linux (x86-64, Arm64), macOS (x86-64, Arm64), Windows (x86-64)
Requirements:
- Python 3.9 or later
- At least one supported inference runtime installed and on
PATH
Optional runtime prerequisites:
# Ollama — recommended for most users
# https://ollama.com/download
# ONNX Runtime
pip install onnxruntime # CPU
pip install onnxruntime-gpu # CUDA GPU
# TensorRT (NVIDIA only, compute capability >= 7.0)
# https://developer.nvidia.com/tensorrt
# llama.cpp (provides llama-cli and llama-quantize)
# https://github.com/ggml-org/llama.cppVectorPrime detects which runtimes are available at startup and silently skips any whose binary is not found. vectorprime profile works with no runtimes installed.
vectorprime profilePrints a human-readable hardware summary to stdout:
VectorPrime Hardware Profile
────────────────────────────────────────
CPU
Model: Intel Core i9-14900HX
Cores: 32 threads
SIMD Support: AVX2
GPU
Model: NVIDIA GeForce RTX 4090 Laptop GPU
Vendor: NVIDIA
VRAM: 16.0 GB
Compute Capability: 8.9
Tensor Cores: Yes
Memory
Total RAM: 31.9 GB
Available RAM: 29.9 GB
Acceleration Support
✓ GPU inference available
✓ FP16 supported
✓ INT8 supported
✓ TensorRT compatible
Recommended Inference Setup
Runtime: TensorRT / llama.cpp
Precision: FP16 or INT8
Estimated Model Capacity:
• ~70B quantized
• ~13B full GPU
Tip: run `vectorprime profile --verbose` for full hardware diagnostics.
Options:
vectorprime profile [OPTIONS]
Options:
--json Output the full hardware profile as structured JSON to stdout.
--verbose Show a detailed hardware diagnostic report including clock speed,
SIMD features, CUDA/driver versions, runtime compatibility, and
optimization hints.
--save PATH Save the JSON profile to a file. Can be combined with --json to
print JSON to stdout and save to a file simultaneously.
vectorprime profile --json{
"cpu": {
"brand": "Intel(R) Core(TM) i9-14900HX",
"core_count": 32,
"simd_level": "AVX2"
},
"gpu": {
"name": "NVIDIA GeForce RTX 4090 Laptop GPU",
"vendor": "Nvidia",
"vram_mb": 16376,
"compute_capability": [8, 9],
"tensor_cores": true
},
"ram": {
"total_mb": 31941,
"available_mb": 29935
},
"capabilities": {
"gpu_inference": true,
"fp16": true,
"int8": true,
"tensorrt_supported": true,
"tensor_cores": true
},
"recommendation": {
"preferred_runtime": ["TensorRT", "llama.cpp"],
"preferred_precision": ["FP16", "INT8"]
}
}vectorprime profile --save hw.json
# Hardware profile saved to: hw.json
vectorprime profile --json --save hw.json
# Prints JSON to stdout AND saves to hw.jsonvectorprime profile --verboseVectorPrime Hardware Diagnostic Report
═══════════════════════════════════════
CPU
Model: Intel Core i9-14900HX
Architecture: x86_64
Physical Cores: 24
Logical Threads: 32
Base Clock: 2.2 GHz
SIMD Features: SSE4, AVX, AVX2
L3 Cache: 36 MB
GPU
Model: NVIDIA GeForce RTX 4090 Laptop GPU
Vendor: NVIDIA
Compute Capability: 8.9
VRAM: 16.0 GB
Tensor Cores: Yes
CUDA Version: 12.4
Driver Version: 550.xx
Memory Bandwidth: ~576 GB/s
System Memory
Total RAM: 31.9 GB
Available RAM: 29.9 GB
Swap: 8.0 GB
Acceleration Support
CUDA: Available
TensorRT: Compatible
FP16 Inference: Supported
INT8 Inference: Supported
Runtime Compatibility
llama.cpp: Supported (CPU + GPU offload)
ONNX Runtime: Supported
TensorRT: Supported
vLLM: Supported
VectorPrime Optimization Hints
Recommended Runtime: TensorRT / llama.cpp
Recommended Precision: FP16 / INT8
Suggested Threads: 16–32
GPU Offload Capacity: High
System Readiness
✓ CUDA driver detected
✓ GPU compute capability supported
✓ Sufficient VRAM for large LLMs
System ready for optimized LLM inference.
vectorprime doctorProbes for required inference components and reports which are available:
VectorPrime System Check
────────────────────────
✓ CUDA installed
✓ GPU driver detected
✓ TensorRT available
✓ llama.cpp GPU support
System ready for optimized inference.
If a component is missing, its line shows ✗ and the summary changes to:
✗ TensorRT available
✗ llama.cpp GPU support
Some components missing — see above.
The doctor command checks:
| Component | Detection method |
|---|---|
| CUDA installed | nvidia-smi on PATH |
| GPU driver detected | nvidia-smi on PATH |
| TensorRT available | trtexec on PATH |
| llama.cpp GPU support | llama-cli or llama-server on PATH |
vectorprime optimize model.gguf─────────────────────────────────────
VectorPrime Optimization Result
─────────────────────────────────────
Runtime: Ollama
Quantization: Q4_K_M
Threads: 16
GPU Layers: 20
Throughput: 110.3 tokens/sec
Latency: 91.2 ms
Memory: 8.2 GB peak
─────────────────────────────────────
Optimized model written to: model-optimized.gguf
Options:
vectorprime optimize <model_path> [OPTIONS]
Arguments:
model_path Path to the model file (.gguf or .onnx).
Options:
--format {gguf,onnx} Model format. Auto-detected from extension when omitted.
--max-memory MB Warn if peak memory exceeds this limit (MB).
--gpu MODEL Target GPU model (e.g. 4090, a100, h100, or 'cpu' for
CPU-only). Overrides auto-detected hardware.
--latency MS Maximum tolerated latency (ms). Configurations above
this threshold are excluded.
--output PATH Destination path for the re-quantized output model.
--no-cache Bypass the result cache and run benchmarking even if
a cached result exists. The new result is stored after
completion.
Ollama export is available via the Python API. Call vectorprime.export_ollama(result, output_dir) to produce a Modelfile, model.gguf, and metadata.json bundle ready for ollama create. See the Python API section for a full example.
# GGUF → ONNX
vectorprime convert-to-onnx model.gguf --output model.onnx
# ONNX → GGUF (metadata is round-tripped from the original GGUF when available)
vectorprime convert-to-gguf model.onnx --output model.gguf| Runtime | Priority | Backend Binary | Model Format | Notes |
|---|---|---|---|---|
| Ollama | Primary | ollama |
GGUF | Recommended for most users |
| TensorRT | Primary | trtexec |
ONNX | NVIDIA GPU, compute capability >= 7.0 |
| ONNX Runtime | Secondary | python3 + onnxruntime |
ONNX | CPU and CUDA execution providers |
| llama.cpp | Deprioritized | llama-cli |
GGUF | CPU + GPU offload via --n-gpu-layers |
Missing binaries return a structured NotInstalled error and are skipped — VectorPrime benchmarks whatever runtimes are present.
VectorPrime caches optimization results so repeated runs on the same model and hardware return instantly without re-running benchmarks.
Cache location: ~/.llmforge/cache/
Cache key: SHA-256 of {model_mtime}_{model_size}_{hardware_profile_json}. The key encodes both the model's identity (modification time and file size) and the full hardware profile. A result cached on one machine is not reused on a different machine, and a result cached for one model version is invalidated when the model file changes.
On cache hit: All benchmarking is skipped; the stored OptimizationResult is returned immediately.
On cache miss or read error: VectorPrime runs normally and writes the result to the cache after benchmarking completes.
Disabling the cache:
# CLI
vectorprime optimize model.gguf --no-cache# Python API
result = vectorprime.optimize("model.gguf", use_cache=False)VectorPrime runs a 4-stage Bayesian optimization pipeline. Before Stage 1, a cache lookup is performed — if a result for the same model and hardware already exists, all benchmarking is skipped entirely.
[Cache] SHA-256 lookup in ~/.llmforge/cache/ — returns immediately on hit
[1] Hardware Profiling (0 benchmarks)
CPU cores, SIMD extensions (via raw-cpuid), GPU VRAM and compute
capability (via nvidia-smi), available RAM (via sysinfo).
[2] Model Graph Analysis (0 benchmarks)
Parses the model file — GGUF via a custom byte reader, ONNX via
protobuf — to extract parameter count, architecture, hidden size,
attention heads, KV cache size, and FLOPs per token without running
inference. Classifies workload as Memory-bound, Compute-bound, or
Balanced to guide quantization selection.
[3] Runtime Preselection (0 benchmarks)
Selects viable runtimes based on model format (GGUF or ONNX) and
available hardware. Prunes quantization options by VRAM/RAM budget.
Computes the search space: runtimes × quantizations × gpu_layers ×
threads × batch_size.
[4] Bayesian Optimization (≤ 12 benchmarks)
Runs 5 quasi-random Halton samples across the search space, then 7
Tree-structured Parzen Estimator (TPE) refinement iterations.
Each benchmark shells out to the runtime adapter (Ollama, TensorRT,
ONNX Runtime, or llama.cpp) and collects tokens/sec, latency, and
peak memory. The best configuration is returned and cached.
Falls back to full cartesian search if all 12 evaluations fail.
The result is cached to ~/.llmforge/cache/ after benchmarking, keyed by model identity and hardware profile.
import vectorprime
# Profile hardware
hw = vectorprime.profile_hardware()
print(hw.cpu_cores, hw.gpu_model, hw.ram_total_mb)
# Inspect a model's architecture without running inference
# Returns a dict with: format, param_count, architecture, context_length,
# layer_count, hidden_size, attention_head_count, attention_head_count_kv,
# feed_forward_length, kv_cache_size_mb, memory_footprint_mb, flops_per_token
model_info = vectorprime.analyze_model("model.gguf")
print(model_info["param_count"], model_info["architecture"], model_info["context_length"])
# Run optimization (results are cached by default in ~/.llmforge/cache/)
result = vectorprime.optimize("model.gguf", use_cache=True)
print(result.runtime, result.tokens_per_sec, result.latency_ms)
# Ollama 110.3 91.2
# Bypass the cache to force a fresh benchmark run
result = vectorprime.optimize("model.gguf", use_cache=False)
# Export an Ollama-ready bundle
manifest_json = vectorprime.export_ollama(result, "./optimized_model")
# Convert formats
vectorprime.convert_gguf_to_onnx("model.gguf", "model.onnx")
vectorprime.convert_onnx_to_gguf("model.onnx", "model-roundtrip.gguf")Results from vectorprime optimize on a system with Intel Core i9-13900K (16 cores, AVX-512), NVIDIA RTX 4090 (24 GB VRAM), 64 GB DDR5 RAM. Your results will vary.
| Model | Runtime | Quantization | Threads | GPU Layers | Throughput (tok/s) | Latency (ms) | Memory (GB) |
|---|---|---|---|---|---|---|---|
| Llama 3.1 8B | LlamaCpp | Q4_K_M | 16 | 20 | 110.3 | 91.2 | 8.2 |
| Llama 3.1 8B | LlamaCpp | Q8_0 | 16 | 10 | 74.1 | 135.4 | 12.8 |
| Mistral 7B | LlamaCpp | Q4_K_M | 16 | 20 | 118.7 | 84.2 | 7.4 |
| Mistral 7B | OnnxRuntime | Int8 | 8 | 0 | 42.3 | 236.8 | 9.1 |
| Phi-3 Mini 3.8B | TensorRT | Int8 | 8 | 33 | 198.4 | 50.4 | 5.6 |
VectorPrime is a Rust workspace. The Python layer (CLI + helpers) sits on top of a cdylib native extension compiled via PyO3 and maturin.
python/vectorprime/cli.py (argparse CLI — 4 subcommands)
|
v
vectorprime-bindings (PyO3 cdylib — _vectorprime.so)
|
+---> vectorprime-export (Ollama bundle generation)
| |
+---> vectorprime-optimizer (search + parallel benchmark loop)
| |
| +-----+-----+
| | |
+---> vectorprime-hardware vectorprime-runtime (adapter dispatch)
| | |
+---> vectorprime-model-ir |
| |
+---> vectorprime-core <--+
(shared types/traits/errors)
| Crate | Responsibility |
|---|---|
vectorprime-core |
HardwareProfile, OptimizationResult, RuntimeAdapter trait, GpuProbe trait, RuntimeError |
vectorprime-hardware |
CPU detection (raw-cpuid), NVIDIA GPU detection (nvidia-smi), RAM (sysinfo) |
vectorprime-model-ir |
GGUF byte reader and ONNX protobuf parser; extracts architecture metadata without inference |
vectorprime-runtime |
LlamaCppAdapter, OnnxAdapter, TensorRtAdapter; adapter registry and dispatch |
vectorprime-optimizer |
4-stage Bayesian/TPE optimization pipeline (hardware context, model context, runtime preselection, TPE search); result caching via ~/.llmforge/cache/ |
vectorprime-export |
Modelfile writer, GGUF copy, metadata.json serialization |
vectorprime-bindings |
PyO3 #[pymodule] wiring every crate into the _vectorprime extension module |
For end-users: Use
pip install vectorprimeinstead. For developers and contributors who want to modify the codebase, follow the setup below. Building from source requires the Rust toolchain.
| Tool | Version | Install |
|---|---|---|
| Rust toolchain | 1.75+ | curl https://sh.rustup.rs -sSf | sh |
| Python | 3.9+ | python.org |
| maturin | 1.0+ | pip install maturin |
| Python dev headers | — | sudo apt install python3-dev (Debian/Ubuntu) |
git clone https://github.com/TheRadDani/llm-forge
cd llm-forge
python -m venv .venv && source .venv/bin/activate
pip install maturin pytest numpy onnxruntime
# Compile the Rust extension and install into the active venv
maturin develop
# Verify
vectorprime profile# All Rust unit tests
cargo test --workspace
# Code style and lint
cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings
# Python integration tests (no fixtures or GPU required)
pytest tests/ -vContributions are welcome — bug reports, feature requests, documentation improvements, and new runtime adapters.
- Fork the repository and create a branch from
main - Make your changes with tests
- Confirm
cargo test --workspaceandpytest tests/both pass - Open a pull request with a clear description
Adding a new runtime: Implement RuntimeAdapter in crates/vectorprime-runtime/src/ and register the adapter in the AdapterRegistry. The optimizer and Python binding layers require no changes.
See open issues for contribution ideas.
MIT. See LICENSE for the full text.
VectorPrime builds on:
- llama.cpp — GGUF format specification and the
llama-cli/llama-quantizebinaries - ONNX Runtime — inference engine behind the ONNX adapter
- TensorRT — NVIDIA's high-performance inference library
- Ollama — local model runner that VectorPrime exports to
- PyO3 and maturin — Rust/Python interop and packaging
- Tokio — async runtime powering parallel benchmarking
- anyhow and thiserror — structured error handling
