Juno

Java Unified Neural Orchestration

Distributed LLM inference and fine-tuning. Pure Java - No Python, no GIL, no Spring.

1. What is Juno

Distributed inference

Pipeline parallel — contiguous layer blocks across JVM nodes; activations flow serially over gRPC.
Tensor parallel — full depth on each node with head/FFN slices; coordinator AllReduce on logits.
Zero sidecar processes: coordinator (juno-master) and workers (juno-node) are shaded JVM jars.

GPU acceleration

NVIDIA CUDA 12.x / cuBLAS and AMD ROCm 6+ / rocBLAS via Panama FFI (java.lang.foreign).
Auto-selection at startup: CUDA → ROCm → CPU. Override with -Djuno.gpu.backend=cuda|rocm|auto.
Device-resident FP16 weights; automatic CPU quantised fallback on VRAM OOM.

LoRA fine-tuning

In-process training REPL: ./juno lora
Inference overlay: --lora-play PATH (local, cluster, AWS)
Native merge to standalone GGUF: ./juno merge (patched tensors stored as F32)

OpenAI-compatible REST

POST /v1/chat/completions (blocking + SSE)
GET /v1/models, GET /v1/models/{model}
Enable with --api-port N on ./juno local or cluster mode
Juno extensions: x_juno_priority, x_juno_session_id, x_juno_top_k

JVM integration

Maven BOM: cab.ml:juno-bom:0.1.0
Facade API: JunoPlayer, LoraTrainer, JunoHttpClient
See docs/howto.md JVM integration section

Observability

Custom JFR events across matmul, forward pass, token generation, LoRA training
Health dashboard with per-node CPU load, coordinator P99 latency, node throughput
Performance matrix: docs/juno_test_matrix.html

Please see full feature list here

2. How to use

2.1 JVM Integration

Integrate on **cab.ml artifacts at version 0.1.0** from Maven Central:

	<dependencyManagement>
	  <dependencies>
	    <dependency>
	      <groupId>cab.ml</groupId>
	      <artifactId>juno-bom</artifactId>
	      <version>0.1.0</version>
	      <type>pom</type>
	      <scope>import</scope>
	    </dependency>
	  </dependencies>
	</dependencyManagement>

for more info please refer to Juno cookbook

Then simply use LocalChat.java if you are going to use only current JVM, reading a model:


	private static LocalChat lc;

	@BeforeAll
	static void buildPipline() throws Exception {
		lc = LocalChat.builder(Path.of(MODEL_PATH)).nodeCount(1).useGpu(false)
				.samplingParams(SamplingParams.defaults().withMaxTokens(64).withTemperature(0.7f)).build();
	}

	@AfterAll
	static void closePipeline() {
		if (lc != null) {
			lc.close();
		}
	}

	@Test
	@Order(1)
	@DisplayName("single turn returns a non-empty reply")
	void singleTurnReturnsNonEmptyReply() {
		String reply = lc.chat("Hello, how are you?");
		assertThat(reply).isNotBlank();
	}

Then follow docs/howto.md JVM integration section.

2.2 Local player and LoRA (including Hugging Face–origin weights)

Contributors and enthusiasts can build from source: mvn clean package -DskipTests.

Clone the repo:

git clone https://github.com/ml-cab/juno.git && cd juno
mvn clean package -DskipTests

Download a GGUF (replace the URL with your chosen model):

cd juno/models
wget https://huggingface.co/.../tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

then run local Juno interactive console to try and train inference

./juno local --model-path models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

**--model-path** is relative for juno project dir. REST alongside REPL: add **--api-port 8080**.

Training: ./juno lora --model-path ... (see docs/LoRA.md).

Optional **./juno merge** bakes a trained .lora into a new GGUF, so that inference needs no sidecar adapter

More at howto.md.

2.3 On-prem orchestration

Run **juno-master** as the coordinator and **juno-node** on each worker with gRPC between them (systemd or your own process manager). Parallelism modes and byte-order flags match local cluster harness behaviour described in docs/howto.md; topology and components are in docs/arch.md. AWS automation under **scripts/aws/** is optional cloud packaging of the same roles.

3. Stack

Node coordination and inference RPCs use gRPC with protobuf contracts from the api module. GPU matmul is backed by Panama FFI (java.lang.foreign) against two vendor libraries:

NVIDIA: CUDA 12.x + cuBLAS — CudaBindings resolves libcudart.so.12 and libcublas.so.12; CudaMatVec owns all device memory and stream lifecycle.
AMD: ROCm 6+ + rocBLAS — RocmBindings resolves libamdhip64.so and librocblas.so; RocmMatVec mirrors the same device-resident FP32/FP16 paths.

Backend is auto-selected at startup: CUDA first, then ROCm, then CPU. Override with -Djuno.gpu.backend=cuda|rocm|auto. A CPU quantised path is used when GPU is off or unavailable. The coordinator HTTP surface (REST and SSE) is implemented with Javalin.

4. Useful refs

Release notes: RELEASE_NOTES.md
Security: SECURITY.md
Performance matrix: docs/juno_test_matrix.html - methodology companion docs/performance.md
Legal Q&A: docs/legal.md

Requirements

JDK 25+, Maven 3.9+. GPU nodes: CUDA 12.x + NVIDIA driver or ROCm 6+ + AMD driver (optional — CPU-only inference requires neither).

Supported models

GGUF with LLaMA-compatible architectures (quantizations include F32, F16, BF16, Q8_0, Q4_0, Q2_K, Q3_K, Q4_K, Q5_K, Q6_K). Chat templates: llama3, mistral, gemma, tinyllama/zephyr, chatml, phi3. phi3 (Phi-3 / Phi-3.5) is supported via a dedicated handler and template. Gemma, Qwen 2, Qwen3, and Qwen3.5 (gemma, qwen2, qwen3, qwen3moe, qwen35) are under development — template and handler groundwork exists for some paths; end-to-end validation is in progress (no LoRA on Gemma/Qwen). Examples (heap hints): TinyLlama Q4_K_M (~637 MB, 2g), Mistral-7B Q4_K_M (~4.1 GB, 8g), Phi-3.5-mini Q4_K_M (~2.2 GB, 4g), Llama-3.1-70B Q4_K_M distributed.

Modules (overview)

Module	Role
`juno-bom`	Maven BOM — aligned versions for all `cab.ml` artifacts
`api`	OpenAPI spec, protobuf/gRPC API
`registry`	Shard planning, model registry
`coordinator`	Scheduler, generation loop, REST
`node`	Transformer handlers, GGUF, GPU matmul (CUDA + ROCm via Panama FFI)
`lora`	Adapter tensors, optimizer
`tokenizer`, `sampler`, `kvcache`, `health`, `metrics`	Shared infrastructure
`juno-player`	CLI REPL and cluster harness
`juno-node`, `juno-master`	Shaded deploy jars

Details: docs/arch.md.

License

Apache 2.0 - see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Juno

1. What is Juno

Distributed inference

GPU acceleration

LoRA fine-tuning

OpenAI-compatible REST

JVM integration

Observability

2. How to use

2.1 JVM Integration

2.2 Local player and LoRA (including Hugging Face–origin weights)

2.3 On-prem orchestration

3. Stack

4. Useful refs

Requirements

Supported models

Modules (overview)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
api		api
coordinator		coordinator
docs		docs
health		health
juno-bom		juno-bom
juno-master		juno-master
juno-node		juno-node
juno-player		juno-player
kvcache		kvcache
lora		lora
metrics		metrics
node		node
registry		registry
sampler		sampler
scripts		scripts
tokenizer		tokenizer
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTORS.md		CONTRIBUTORS.md
FUNDING.md		FUNDING.md
LICENSE		LICENSE
Legal.md		Legal.md
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
SECURITY.md		SECURITY.md
juno		juno
juno.bat		juno.bat
pom.xml		pom.xml

Folders and files

Latest commit

History

Repository files navigation

Juno

1. What is Juno

Distributed inference

GPU acceleration

LoRA fine-tuning

OpenAI-compatible REST

JVM integration

Observability

2. How to use

2.1 JVM Integration

2.2 Local player and LoRA (including Hugging Face–origin weights)

2.3 On-prem orchestration

3. Stack

4. Useful refs

Requirements

Supported models

Modules (overview)

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages