BeeLlama Docker images are published by the push/manual Docker workflow to
ghcr.io/anbeeld/beellama.cpp. CUDA images are built from the stock
.devops/cuda.Dockerfile; pass CUDA_DOCKER_ARCH when building locally for a
specific GPU generation, for example 120 for Blackwell or 86 for RTX 3090.
- Docker must be installed and running on your system.
- Create a folder to store big models & intermediate files (ex. /llama/models)
The CI workflow publishes llama-server images to
ghcr.io/anbeeld/beellama.cpp:
server/server-cpu: CPU backend. (linux/amd64,linux/arm64)server-cuda/server-cuda12: CUDA 12.4 backend. (linux/amd64)server-cuda13: CUDA 13.1 backend. (linux/amd64)server-rocm: ROCm backend. (linux/amd64)server-vulkan: Vulkan backend. (linux/amd64)server-sycl: SYCL backend for Intel GPUs. (linux/amd64)
Release tags are published as both floating tags such as server-sycl and
versioned tags such as server-sycl-v0.3.0. Branch builds use development tags
such as server-sycl-v0.3.0-dev and commit-specific tags such as
server-sycl-v0.3.0-<short-sha>.
The SYCL image is built as a generic Intel SYCL target. It intentionally does
not set GGML_SYCL_DEVICE_ARCH, so device code is selected by the oneAPI
runtime for the host Intel GPU instead of being pinned to one GPU generation.
Replace /path/to/models below with the actual path where you downloaded the models.
docker run --rm -it \
-v /path/to/models:/models \
-p 8080:8080 \
ghcr.io/anbeeld/beellama.cpp:server \
-m /models/model.gguf --port 8080 --host 0.0.0.0 -n 512Use the backend-specific server tag for GPU acceleration. For example:
docker run --gpus all \
-v /path/to/models:/models \
-p 8080:8080 \
ghcr.io/anbeeld/beellama.cpp:server-cuda13 \
-m /models/model.gguf --port 8080 --host 0.0.0.0 -ngl 999Assuming one has the nvidia-container-toolkit properly installed on Linux, or is using a GPU enabled cloud, cuBLAS should be accessible inside the container.
The SYCL image targets Intel GPUs through oneAPI and Level Zero. The host still needs a working Intel GPU driver stack, and the container needs access to the DRI devices:
docker run --rm -it \
--device /dev/dri \
-v /path/to/models:/models \
-p 8080:8080 \
ghcr.io/anbeeld/beellama.cpp:server-sycl \
-m /models/model.gguf --port 8080 --host 0.0.0.0 -ngl 999Set ONEAPI_DEVICE_SELECTOR=level_zero:<index> if the host has multiple Intel
GPU devices and you need to choose one explicitly.
docker build -t local/beellama.cpp:full-cuda --target full -f .devops/cuda.Dockerfile .
docker build -t local/beellama.cpp:light-cuda --target light -f .devops/cuda.Dockerfile .
docker build -t local/beellama.cpp:server-cuda --target server -f .devops/cuda.Dockerfile .You may want to pass in some different ARGS, depending on the CUDA environment supported by your container host, as well as the GPU architecture.
The defaults are:
CUDA_VERSIONset to12.4.1UBUNTU_VERSIONset to22.04CUDA_DOCKER_ARCHset to the cmake build default, which includes all the supported architectures
The resulting images, are essentially the same as the non-CUDA images:
local/beellama.cpp:full-cuda: This image includes both thellama-cliandllama-completionexecutables and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.local/beellama.cpp:light-cuda: This image only includes thellama-cliandllama-completionexecutables.local/beellama.cpp:server-cuda: This image only includes thellama-serverexecutable.
After building locally, Usage is similar to the non-CUDA examples, but you'll need to add the --gpus flag. You will also want to use the --n-gpu-layers flag.
docker run --gpus all -v /path/to/models:/models local/beellama.cpp:full-cuda --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run --gpus all -v /path/to/models:/models local/beellama.cpp:light-cuda -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run --gpus all -v /path/to/models:/models local/beellama.cpp:server-cuda -m /models/7B/ggml-model-q4_0.gguf --port 8080 --host 0.0.0.0 -n 512 --n-gpu-layers 1Assuming one has the mt-container-toolkit properly installed on Linux, muBLAS should be accessible inside the container.
docker build -t local/beellama.cpp:full-musa --target full -f .devops/musa.Dockerfile .
docker build -t local/beellama.cpp:light-musa --target light -f .devops/musa.Dockerfile .
docker build -t local/beellama.cpp:server-musa --target server -f .devops/musa.Dockerfile .You may want to pass in some different ARGS, depending on the MUSA environment supported by your container host, as well as the GPU architecture.
The defaults are:
MUSA_VERSIONset torc4.3.0
The resulting images, are essentially the same as the non-MUSA images:
local/beellama.cpp:full-musa: This image includes both thellama-cliandllama-completionexecutables and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.local/beellama.cpp:light-musa: This image only includes thellama-cliandllama-completionexecutables.local/beellama.cpp:server-musa: This image only includes thellama-serverexecutable.
After building locally, Usage is similar to the non-MUSA examples, but you'll need to set mthreads as default Docker runtime. This can be done by executing (cd /usr/bin/musa && sudo ./docker setup $PWD) and verifying the changes by executing docker info | grep mthreads on the host machine. You will also want to use the --n-gpu-layers flag.
docker run -v /path/to/models:/models local/beellama.cpp:full-musa --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run -v /path/to/models:/models local/beellama.cpp:light-musa -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run -v /path/to/models:/models local/beellama.cpp:server-musa -m /models/7B/ggml-model-q4_0.gguf --port 8080 --host 0.0.0.0 -n 512 --n-gpu-layers 1