From a77a93a1120db8a9c4e96827788873bb22a187d9 Mon Sep 17 00:00:00 2001 From: Datta Rajpure Date: Thu, 26 Mar 2026 13:37:38 -0700 Subject: [PATCH] Blog: AI Inference on AKS Arc - Part 0 and Part 1 --- .../index.md | 59 ++++ .../index.md | 169 +++++++++ .../index.md | 321 ++++++++++++++++++ website/blog/authors.yml | 12 +- website/blog/tags.yml | 12 +- 5 files changed, 571 insertions(+), 2 deletions(-) create mode 100644 website/blog/2026-04-01-ai-inference-on-aks-arc/index.md create mode 100644 website/blog/2026-04-02-ai-inference-on-aks-arc-part-0/index.md create mode 100644 website/blog/2026-04-03-ai-inference-on-aks-arc-part-1/index.md diff --git a/website/blog/2026-04-01-ai-inference-on-aks-arc/index.md b/website/blog/2026-04-01-ai-inference-on-aks-arc/index.md new file mode 100644 index 000000000..f039232a1 --- /dev/null +++ b/website/blog/2026-04-01-ai-inference-on-aks-arc/index.md @@ -0,0 +1,59 @@ +--- +title: "AI Inference on AKS Arc: Empowering Customers to Explore AI at the Edge" +date: 2026-04-01 +description: "Discover how to bring AI inference closer to your data with Azure Arc–enabled AKS, and explore practical scenarios for hybrid and edge deployments." +authors: +- datta-rajpure +tags: ["aks-arc", "ai", "ai-inference"] +--- +This blog post explores why AI inferencing on AKS Arc is critical for hybrid and edge deployments, enabling low-latency, secure, and scalable AI workloads close to where data is generated. It introduces practical, step-by-step guidance for running generative and predictive AI inference workloads on Azure Arc–enabled AKS clusters using CPUs, GPUs, and NPUs in repeatable, production‑oriented scenarios. + + + +## Introduction + +As organizations increasingly seek to run artificial intelligence (AI) closer to where their data is generated – from factory floors and retail stores to hospital data centers – they face unique challenges around connectivity, latency, and data governance. High-end cloud GPUs are not always practical in these on-premises or edge locations due to cost, power, or privacy constraints. At the same time, there is an explosion of demand for hybrid AI: enterprises want to deploy advanced models wherever their data lives, yet with cloud-like performance and manageability. +Azure Arc–enabled Kubernetes is designed to meet this need. It extends Azure’s management capabilities to distributed Kubernetes clusters, enabling customers to deploy and operate AI workloads on infrastructure running in datacenters, branch offices, or edge locations. This blog post explores the strategic importance of AI inferencing on AKS Arc–enabled Azure Local and introduces a hands-on tutorial series that empowers customers to explore and validate AI workloads in real-world hybrid scenarios. + +## Why AI Inferencing on AKS Arc Matters + +Running AI inference on Arc-enabled Kubernetes clusters addresses several urgent customer needs and industry trends: + +- **Low Latency & Data Residency –** +Inference workloads can run locally on-premises or at the edge, ensuring real-time responsiveness and compliance with data sovereignty requirements. This is essential for scenarios like factory automation, medical imaging, or retail analytics, where data must remain on-site and latency is a key constraint. + +- **Existing Hardware Utilization –** +Many organizations operate in environments without access to GPUs. By deploying optimized AI runtimes such as Intel OpenVINO or ONNX Runtime on Arc-managed clusters, customers can run inference workloads on CPU-only servers or other available hardware. This allows them to leverage existing infrastructure while maintaining flexibility to scale with GPUs or other accelerators in the future. + +- **Hybrid & Disconnected Operations –** +AKS Arc provides a consistent deployment and governance experience across connected and disconnected environments. Customers can centrally manage AI workloads from Azure while ensuring local execution continues even during network outages. + +- **Aligned with Industry Trends –** +The shift toward hybrid and edge AI is driven by trends like data gravity, regulatory compliance, and the need for real-time insights. AKS Arc aligns with these trends by enabling scalable, secure, and flexible AI deployments across industries such as manufacturing, healthcare, retail, and logistics. + +## A Platform for Distributed AI Operations + +AKS Arc enables customers to bring their own AI runtimes and models to Kubernetes clusters running in hybrid environments. It provides: + +- A consistent DevOps experience for deploying and managing AI models across environments +- Centralized governance, monitoring, and security via Azure +- Integration with Azure ML and Microsoft Foundry for model lifecycle management +- Support for diverse hardware configurations, including CPUs, GPUs, and NPUs + +By managing Kubernetes clusters across hybrid and edge environments, AKS Arc helps customers operationalize AI workloads using the tools and runtimes that best fit their infrastructure and use cases. + +## Explore AI Inference with Step-by-Step Tutorials + +To help customers explore and validate AI inference on AKS Arc, we’ve created a series of scenario-driven tutorials that demonstrate how to run both generative and predictive AI inference on AKS Arc–enabled clusters. This series walks through concrete examples step-by-step, using open-source tools and real models to showcase Arc’s hybrid AI capabilities in action. Each tutorial focuses on a different AI inference pattern and technology stack, reflecting the diverse options available for edge inferencing: + +- Deploy open-source large language models (LLMs) using GPU-accelerated inference engines +- Serve predictive models like ResNet-50 using a unified model server +- Configure and validate inference workloads across different hardware types +- Manage and monitor inference services using Azure-native tools + +These tutorials are designed to help you build confidence in running AI at the edge using their existing Kubernetes skills and Arc-enabled infrastructure. The examples use off-the-shelf assets (open-source models and containers) to highlight Arc’s open and flexible approach: you can bring your own models and choose the best inference engine for the task, whether it’s a lightweight CPU-friendly runtime or a vendor-optimized GPU server. + +## Get Started + +AI inferencing on AKS Arc empowers you to experiment with cutting-edge AI in your own environment, free from cloud limitations but still under Azure’s management umbrella. With data staying where it’s most useful – whether for compliance, latency, or efficiency – you can unlock new scenarios and value from AI that were previously out of reach. The convergence of cloud-trained models and edge deployment via Arc represents a significant industry shift toward hybrid AI solutions that meet enterprises where they are. +To get started, follow the accompanying tutorial series. By the end of the series, you’ll have first-hand experience operationalizing AI models across hybrid cloud and edge – gaining practical skills to bring the “AI anywhere” vision to life on Azure Arc. diff --git a/website/blog/2026-04-02-ai-inference-on-aks-arc-part-0/index.md b/website/blog/2026-04-02-ai-inference-on-aks-arc-part-0/index.md new file mode 100644 index 000000000..bd61ecf58 --- /dev/null +++ b/website/blog/2026-04-02-ai-inference-on-aks-arc-part-0/index.md @@ -0,0 +1,169 @@ +--- +title: "AI Inference on AKS Arc — Part 0: Introduction, Audience, and Series Scope" +date: 2026-04-02 +description: "Scenario-driven series for generative and predictive AI inference on Azure Arc–enabled AKS, covering CPUs, GPUs, and NPUs in on-premises and edge environments." +authors: +- datta-rajpure +tags: ["aks-arc", "ai", "ai-inference"] +--- +This blog series provides **practical, step-by-step guidance** for running generative and predictive AI inference workloads on Azure Arc–enabled AKS clusters using CPUs, GPUs, and NPUs. The scenarios are designed to run in on‑premises and edge environments—specifically Azure Local (Azure Stack HCI)—and focus on **repeatable, production‑oriented validation scenarios** rather than abstract examples. + + + +## Introduction + +This series explores emerging patterns for running generative and predictive AI inference workloads on Azure Arc–enabled AKS clusters in on-premises and edge environments. As organizations increasingly look to deploy AI closer to where data is generated—on factory floors, in retail stores, across manufacturing lines, and within infrastructure monitoring systems—they face unique challenges: limited connectivity, diverse hardware, and constrained resources. +High-end GPUs may not always be available or practical in these environments due to cost, power, or space limitations. This has led to growing interest in leveraging existing infrastructure—such as CPU-based clusters—or exploring new accelerators like NPUs to enable scalable, low-latency inference at the edge. +The series focuses on scenario-driven experimentation with AI inference on AKS Arc, validating real-world deployments that go beyond traditional cloud-centric patterns. From deploying open-source LLM servers like **Ollama** and **vLLM** to integrating **NVIDIA Triton** with custom backends, each entry provides a structured approach to evaluating feasibility, performance, and operational readiness. The goal is to equip practitioners with practical insights and repeatable strategies for enabling AI inference in hybrid and edge-native environments. + +## Audience and Assumptions + +This series is written for readers who meet all of the following criteria: + +- You are already familiar with Kubernetes concepts such as pods, deployments, services, and node scheduling. +- You are operating, or plan to operate, AKS enabled by Azure Arc on Azure Local or a comparable on‑premises / edge environment. +- You are comfortable using command‑line tools such as kubectl, Azure CLI, and Helm. +- You are evaluating AI inference workloads (LLMs or predictive models) from an infrastructure and platform perspective, not from a data science or model‑training perspective. + +### Explicit Non‑Goals + +To keep this series focused and actionable, the following topics are intentionally **not** covered: + +- **Kubernetes fundamentals or onboarding:** + Readers new to Kubernetes should complete foundational material first: + - [Introduction to Kubernetes (Microsoft Learn)](https://learn.microsoft.com/training/modules/intro-to-kubernetes/) + - [Kubernetes Basics Tutorial (Upstream)](https://kubernetes.io/docs/tutorials/kubernetes-basics/) + +- **Azure Arc conceptual overview or onboarding:** + This series assumes you already understand what Azure Arc provides and how Arc-enabled Kubernetes works: + - [Azure Arc–enabled Kubernetes overview](https://learn.microsoft.com/azure/azure-arc/kubernetes/overview) + - [AKS enabled by Azure Arc documentation](https://learn.microsoft.com/azure/aks/aksarc/) + +- **Model training, fine‑tuning, or data preparation:** + All scenarios assume models are already trained and packaged in formats supported by the selected inference engine. + +- **Deep internals of inference engines:** + Engine-specific internals are referenced only where required for deployment or configuration. For deeper learning: + - [NVIDIA Triton Inference Server documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/) + - [NVIDIA GPU Operator documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html) + +If you’re looking for conceptual comparisons, performance benchmarks, or model‑level optimizations, those topics are intentionally out of scope for this series. + +## Series Ground Rules (What This Series Guarantees) + +Part 0 outlines a set of baseline guarantees and assumptions that apply to all subsequent parts of the series: + +- All scenarios use the same Arc–enabled AKS cluster environment unless explicitly noted otherwise. +- Azure Arc is used as the management and control plane only; inference execution always occurs locally on the cluster. +- No managed Azure AI services are used to execute inference. +- Each scenario follows a consistent, repeatable structure so results can be compared across inference engines and hardware types. + +### Standard Workflow + +We will follow the same high‑level workflow in each scenario: + +- **Connect & Verify:** + Log in to Azure and get cluster credentials. Inspect available compute resources (CPU, GPU, NPU) and node labels/capabilities + +- **Prepare the Accelerator (If Required):** + Install or validate the required accelerator enablement based on the scenario. + - GPU: NVIDIA GPU Operator + - NPU: Vendor‑specific enablement (future) + - CPU: No accelerator setup required +- **Step 3: Deploy the Inference Workload:** + - Deploy the model server or inference pipeline (LLM server, Triton, or other engine) + - Configure runtime parameters appropriate to the selected hardware +- **Validate Inference:** + - Send a test request (prompt, image, or payload) + - Confirm functional and expected inference output +- **Clean Up Resources:** + - Remove deployed workloads + - Release cluster resources (compute, storage, accelerator allocations) + +## Series Outline + +In this blog series, we explore a range of AI inference patterns on Azure Arc–enabled Kubernetes clusters, spanning both generative and predictive AI workloads. The series is designed to evolve over time, and additional topics will be added as new scenarios, runtimes, and architectures are validated. + +### Topics covered in this series + +| Topic | AI Type (Generative/Predictive) | Description | +| ---------------------------------------------------- | ------------------------------- | ---------------------------------------------------------------------------------------------------------------- | +| AI Inference with **Ollama** on Azure Arc | Generative | Deploying an open‑source LLM server (Ollama) on an Arc–enabled cluster | +| AI Inference with **vLLM** on Azure Arc | Generative | Using the high‑throughput vLLM engine to serve large language models on Arc | +| AI Inference with Triton (**ONNX**) on Azure Arc | Predictive | Running an ONNX‑based ResNet‑50 vision model on Arc using **NVIDIA Triton** | +| AI Inference with Triton (**TensorRT‑LLM**) on Arc | Generative | Deploying a TensorRT‑LLM pipeline for optimized large‑model inference on Arc | +| AI Inference with Triton (**vLLM backend**) on Arc | Generative | Serving vision‑language and large language models on Arc using **Triton** with the **vLLM** backend | + +This series will continue to grow as we introduce new inference engines, hardware configurations, and real‑world deployment patterns across edge, on‑premises, and hybrid environments. + +## Prerequisites + +All scenarios in this series run on a common Arc–enabled AKS cluster environment. Before you begin, make sure you have the following in place: + +- **Arc-enabled AKS cluster with a GPU node:** A Kubernetes cluster enabled for Azure Arc on Azure Local (Azure Stack HCI) with at least one GPU node and appropriate NVIDIA drivers installed. The GPU node needs the NVIDIA device plugin (via the NVIDIA GPU Operator) running so pods can access nvidia.com/gpu resources. + +- **Azure CLI with Arc extensions:** The [Azure CLI](https://learn.microsoft.com/cli/azure/install-azure-cli) installed on your admin machine and either the `aksarc` or `connectedk8s` extensions (for Arc-enabled Kubernetes). Use `az extension list -o table` to confirm these are installed. + +- **kubectl:** The Kubernetes CLI installed on your workstation for applying manifests and managing cluster resources. + +- **Helm:** The [Helm](https://helm.sh/docs/intro/install/) package manager installed (v3), for deploying the GPU Operator and helm charts as needed. + +- **PowerShell 7+ (optional):** If using PowerShell for CLI steps and REST calls, upgrade to PowerShell 7.4 or later (older Windows PowerShell 5.1 may cause JSON quoting issues in our examples). + +- **Cluster access:** Ensure you can connect to your Arc-enabled cluster (e.g. same network or VPN to the Azure Local environment). After logging in to Azure and retrieving cluster credentials, verify access by listing nodes: + +```powershell +az login +az aks get-credentials --resource-group --name +kubectl get nodes +#This should show your cluster’s nodes, including any GPU node(s). +``` + +Note: On Windows 11, you can use `winget` to quickly install prerequisites. For example: + +```powershell +# Install PowerShell +winget install -e --id Microsoft.PowerShell +pwsh -v + +# Install or Update - Azure CLI, Kubectl, Helm, Git +winget install -e --id Microsoft.AzureCLI +winget install -e --id Kubernetes.kubectl +winget install -e --id Helm.Helm +winget install -e --id Git.Git +winget update -e --id Microsoft.AzureCLI +winget update -e --id Kubernetes.kubectl +winget update -e --id Helm.Helm +winget update -e --id Git.Git + +# Install or Update – Azure CLI Extensions (AKS Arc) +az extension add --name aksarc +az extension add --name connectedk8s +az extension update --name aksarc +az extension update --name connectedk8s +``` + +### Install the NVIDIA GPU operator + +Next, install the NVIDIA GPU Operator on the cluster. This operator installs the necessary drivers and Kubernetes device plugin to expose GPU resources to your workloads. vLLM requires the NVIDIA Kubernetes plugin to access the GPU hardware. + +- **Add the NVIDIA Helm repository:** If you haven’t already, add NVIDIA’s Helm chart repository and update it: + +```powershell +helm repo add nvidia https://helm.ngc.nvidia.com/nvidia +helm repo update +``` + +This adds the official NVIDIA chart source (which contains the GPU operator chart) to your Helm client. + +- **Install the GPU operator:** Use Helm to install the NVIDIA GPU Operator onto your cluster: + +```powershell +helm install --wait --generate-name nvidia/gpu-operator +``` + +This will install the GPU operator into your cluster (in its default namespace) and wait for all components to be ready. The --generate-name flag automatically assigns a name to the Helm release. The operator will set up the NVIDIA device plugin and drivers on your cluster nodes. + +:::note +Ensure your cluster nodes have internet connectivity to pull the necessary container images for the operator. This may take a few minutes the first time as images are downloaded. +::: diff --git a/website/blog/2026-04-03-ai-inference-on-aks-arc-part-1/index.md b/website/blog/2026-04-03-ai-inference-on-aks-arc-part-1/index.md new file mode 100644 index 000000000..459020f05 --- /dev/null +++ b/website/blog/2026-04-03-ai-inference-on-aks-arc-part-1/index.md @@ -0,0 +1,321 @@ +--- +title: "AI Inference on AKS Arc - Part 1: Generative AI with Open‑Source LLM Server" +date: 2026-04-03 +description: "Deploying an open-source Large Language Model on an Azure Arc–enabled AKS cluster using the Ollama runtime and a GPU for generative AI inference." +authors: +- datta-rajpure +tags: ["aks-arc", "ai", "ai-inference"] +--- + +This part of the series explores how to deploy and run generative AI inference workloads using open‑source large language model (LLM) servers on AKS clusters enabled by Azure Arc. The focus is on executing these workloads locally—on-premises or at the edge—using GPU acceleration, while leveraging Azure Arc for centralized management. This approach is particularly valuable for scenarios where cloud-based AI services are not viable due to constraints like data sovereignty, latency sensitivity, cost, or limited internet connectivity. + + +## Introduction + +In Part 1, we dive into the practicalities of running generative AI inference using open‑source LLM servers such as Ollama and vLLM on Arc‑enabled Kubernetes clusters. Rather than optimizing for performance or benchmarking throughput, the emphasis here is on establishing a clear, repeatable, and debuggable foundation for GPU‑accelerated inference in hybrid environments. +By deploying standalone LLM servers directly as Kubernetes workloads, we avoid platform-specific abstractions and managed services. This practical approach promotes transparency and operational insight—giving you a clear view into how model serving, GPU scheduling, and inference requests function within an Arc-managed Kubernetes environment. These foundational insights will prepare you for more advanced inference architectures in later parts of the series. + +:::note +Before you begin, ensure the prerequisites described in **Part 0: Introduction, Audience, and Series Scope** are fully met. +You should have an Arc-enabled AKS cluster (on Azure Local or similar) with a **GPU node** available and configured for **nvidia.com/gpu**. +The cluster nodes must have **internet access** to download the model. If restricted, you must manually provide the model files via a Persistent Volume. **Expect a delay** during the initial deployment while the **pod downloads** and caches the large model files. +::: + +## AI Inference with Ollama on Azure Arc (Generative LLM) + +With the environment ready, we can now deploy the Ollama model server on the cluster. We will use Ollama’s official container image to set up a server that can host a large language model. In this example, we’ll target a relatively small LLM (**Phi-3 Mini** model with 4-bit quantization, ~2.2 GB footprint) so it can run on a single **16 GB GPU**. The deployment provides a unified endpoint supporting both Ollama’s native REST API and an OpenAI-compatible interface for interacting with the model. + +### Deploying the Ollama Model Server + +First, ensure you have connected to your Arc-enabled cluster (see Prerequisites) and that it has a GPU node with the NVIDIA device plugin ready (the GPU Operator should be installed). If your cluster has multiple GPU nodes, apply the accelerator=nvidia-gpu label to a node to ensure the Ollama pod schedules on your target hardware. + +```powershell +# 1. FIND THE GPU NODE NAME +# This script performs a deep search of your cluster's hardware resources. +kubectl get nodes -o json | ConvertFrom-Json | ForEach-Object { $_.items } | Where-Object { $_.status.allocatable.'nvidia.com/gpu' -gt 0 } | Select-Object -ExpandProperty metadata | Select-Object -ExpandProperty name + +# 2. APPLY THE APPLICATION LABEL +# This identifies the node as the designated "home" for the Ollama application. +# Useful for organizational filtering: 'kubectl get nodes -l app=ollama' +kubectl label node app=ollama + +# 3. APPLY THE HARDWARE LABEL FOR SCHEDULING AFFINITY +# This matches the 'nodeSelector' in the Deployment YAML below. +# This creates a "Hardware Requirement" tag on the Node's metadata. +kubectl label node accelerator=nvidia-gpu +``` + +Next, create a Kubernetes manifest (e.g. ollama-deployment.yaml) for the Ollama Deployment and Service: + +```yaml +# 1. NAMESPACE: Creates a dedicated logical "room" for your Ollama resources. +# This prevents your models and services from cluttering the 'default' namespace. +apiVersion: v1 +kind: Namespace +metadata: + name: ollama-inference +--- +# 2. DEPLOYMENT: This manages the lifecycle of your Ollama container. +# It ensures that if the pod crashes, a new one is automatically started. +apiVersion: apps/v1 +kind: Deployment +metadata: + name: ollama + namespace: ollama-inference +spec: + replicas: 1 + # SELECTOR: The Deployment controller uses this to find the Pods it owns. + selector: + matchLabels: + app: ollama + template: + metadata: + # POD LABELS: These are the "tags" applied to the actual running container. + # These MUST match the selector above and the Service selector below. + labels: + app: ollama + spec: + # NODE SELECTOR: This is the hardware "constraint." + # It forces the pod to land ONLY on a node you have labeled 'accelerator: nvidia-gpu'. + nodeSelector: + accelerator: nvidia-gpu + containers: + - name: ollama + image: ollama/ollama:0.18.3 + ports: + # PORT NAME + # Note: Kubernetes still uses TCP under the hood (the default protocol). + - name: http + containerPort: 11434 + resources: + # RESOURCE LIMITS: This is the actual "handshake" with the NVIDIA driver. + # It tells the cluster to carve out 1 physical GPU for this pod. + limits: + nvidia.com/gpu: 1 +--- +# 3. SERVICE: This acts as the "Front Door" or Load Balancer for the Pod. +# It provides a stable IP address so you can talk to the Ollama API. +apiVersion: v1 +kind: Service +metadata: + name: ollama-service + namespace: ollama-inference +spec: + # TYPE: LoadBalancer requests a public/external IP from your cloud provider. + # Use 'ClusterIP' instead if you only want to access this from inside the cluster. + type: LoadBalancer + # SERVICE SELECTOR: This tells the Service, "Find any pod with the label 'app: ollama' + # in this namespace and send traffic to it." + selector: + app: ollama + ports: + - name: http + port: 11434 # The port you hit on the LoadBalancer IP. + targetPort: 11434 # The port the Ollama application is listening on inside the pod. +``` + +This defines a **Deployment** running one instance of the ollama/ollama:0.18.3 container image, exposing the server on port **11434**, and requesting 1 GPU (nvidia.com/gpu: 1) so it runs on your GPU node. A LoadBalancer Service on port 11434 forwards requests to the pod; on Azure Stack HCI, if no external load balancer is available, you can use port-forwarding to access the service. Apply the manifest to start the Ollama server: + +```powershell +kubectl apply -f ollama-deployment.yaml # apply deployment yaml +kubectl get pods -l app=ollama -n ollama-inference -w # watch pod status +``` + +Wait until the ollama pod is Running! + +### Loading a Model and Testing Inference + +Once the server is running, load a test model and send an inference API request. The example below uses a small (~2.2 GB) model called “phi3”. Run the following to pull the model weights inside the running Ollama pod: + +```powershell +$podName = kubectl get pods -n ollama-inference -l app=ollama -o jsonpath='{.items[0].metadata.name}' +kubectl exec -it $podName -n ollama-inference -- ollama pull phi3 +``` + +After the ollama pull command prints “success,” the model is ready. Now issue a test generate request to the server’s HTTP API (port 11434). For example, using PowerShell: + +```powershell +# Setup Port Forwarding if your client machine and AKS Arc clusters are not on the same network +kubectl port-forward svc/ollama-service -n ollama-inference 11434 + +# Use localhost with port-forward (if using external IP, replace URI accordingly): +# To use the OpenAI-compatible interface, switch the URI to "http://localhost:11434/v1/chat/completions" +Invoke-RestMethod -Method Post -Uri "http://localhost:11434/api/generate" ` + -ContentType "application/json" ` + -Body '{"model": "phi3", "prompt": "What is Azure Kubernetes Service (AKS) Arc?", "stream": false}' + +# Example output: +model : phi3 +created_at : 2026-03-26T16:41:37.52900295Z +response : Azure Kubernetes Service (AKS) Arc, also known as AKS Arc managed cluster or simply "Arc" in some + discussions within the Microsoft community and among early access program participants for preview + features of upcoming services. ....... +done : True +done_reason : stop +context : {32010, 29871, 13, 5618...} +total_duration : 6388662955 +load_duration : 181600130 +prompt_eval_count : 21 +prompt_eval_duration : 34590990 +eval_count : 328 +eval_duration : 5894037253 +``` + +### Clean Up + +When finished, remove the Ollama resources to free up the GPU. + +```powershell +# REMOVE ALL RESOURCES: +# Deletes the 'ollama-inference' namespace and everything inside it. +# This includes the Deployment (Ollama pod), the Service (LoadBalancer/IP), +# and any local configurations. This is the "factory reset" for this app. +kubectl delete namespace ollama-inference + +# remove node labels if added +$nodeName = (kubectl get nodes -l app=ollama -o jsonpath='{.items[0].metadata.name}') +kubectl label node $nodeName app- +$nodeName = (kubectl get nodes -l accelerator=nvidia-gpu -o jsonpath='{.items[0].metadata.name}') +kubectl label node $nodeName accelerator- +``` + +## AI Inference with vLLM on Azure Arc (Generative LLM) + +**Scenario:** Serve a local large language model using the vLLM inference engine on an Arc-enabled AKS cluster. vLLM is a high-performance LLM serving engine that uses an optimized memory management algorithm (PagedAttention) to support efficient text generation with large models. Here we deploy a sample Mistral 7B model (quantized ~4 GB) on Arc using vLLM’s OpenAI-like API, then query it with a prompt to verify the response. + +### Deploying the vLLM Model Server + +After connecting to your Arc-enabled cluster (see Prerequisites), confirm the cluster’s GPU node is ready and run the NVIDIA GPU Operator if not already installed (to provide the device plugin). + +```powershell +# 1. FIND THE GPU NODE NAME +# This script performs a deep search of your cluster's hardware resources. +kubectl get nodes -o json | ConvertFrom-Json | ForEach-Object { $_.items } | Where-Object { $_.status.allocatable.'nvidia.com/gpu' -gt 0 } | Select-Object -ExpandProperty metadata | Select-Object -ExpandProperty name + +# 2. APPLY THE APPLICATION LABEL +kubectl label node app=vllm-mistral + +# 3. APPLY THE HARDWARE LABEL FOR SCHEDULING AFFINITY +kubectl label node accelerator=nvidia-gpu +``` + +Next prepare a Kubernetes manifest (e.g. vllm-deploy.yaml) to run the vLLM server and expose it: + +```yaml +# 1. NAMESPACE: Creates a dedicated "room" for vLLM resources. +apiVersion: v1 +kind: Namespace +metadata: + name: vllm-inference +--- +# 2. DEPLOYMENT: Manages the vLLM inference engine pod. +apiVersion: apps/v1 +kind: Deployment +metadata: + name: vllm-mistral + namespace: vllm-inference # Scopes this deployment to the vllm-inference namespace +spec: + replicas: 1 + # SELECTOR: Connects the Deployment controller to the specific Pods it manages. + selector: + matchLabels: + app: vllm-mistral + template: + metadata: + # POD LABELS: Provides identity for the Pod. + # The Service uses 'app: vllm-mistral' to route incoming API requests here. + labels: + app: vllm-mistral + spec: + # NODE SELECTOR: Hardware targeting. + # Forces the Pod to land on a node you have labeled 'accelerator: nvidia-gpu'. + nodeSelector: + accelerator: nvidia-gpu + containers: + - name: vllm-container + image: vllm/vllm-openai:v0.18.0 + command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] + args: ["--model", "TheBloke/Mistral-7B-v0.1-AWQ", + "--quantization", "awq", "--dtype", "float16", + "--host", "0.0.0.0", "--port", "8000", + "--max-model-len", "4096", "--gpu-memory-utilization", "0.80", + "--enforce-eager"] + ports: + - name: http + containerPort: 8000 + resources: + # RESOURCE LIMITS: Ensures 1 physical GPU is reserved for this pod. + limits: + nvidia.com/gpu: 1 + volumeMounts: + - name: shm + mountPath: /dev/shm + volumes: + - name: shm + # SHM (Shared Memory): Required by vLLM/PyTorch for fast data exchange + # between GPU and CPU. 'Memory' medium uses RAM instead of disk. + emptyDir: + medium: Memory + sizeLimit: "2Gi" +--- +# 3. SERVICE: Provides a stable entry point for the vLLM API. +apiVersion: v1 +kind: Service +metadata: + name: vllm-service + namespace: vllm-inference +spec: + # TYPE: LoadBalancer requests an external IP from your provider. + type: LoadBalancer + # SERVICE SELECTOR: Routes traffic to any pod carrying the 'app: vllm-mistral' label. + selector: + app: vllm-mistral + ports: + - name: http + protocol: TCP + port: 80 # The port you access externally (e.g., http://EXTERNAL_IP:80) + targetPort: 8000 # The port the vLLM container is actually listening on +``` + +This Deployment launches one vllm/vllm-openai:v0.18.0 container that runs vLLM’s OpenAI-compatible API server for the Mistral-7B model (TheBloke/Mistral-7B-v0.1-AWQ from Hugging Face). The container is configured with a 4096 token context, uses 80% of GPU memory (--gpu-memory-utilization 0.80), and employs AWQ 4-bit quantized weights (to fit in a ~16 GB GPU). It requests 1 GPU, and mounts a 2 GiB emptyDir at /dev/shm for fast memory access. A Service vllm-service is used to forward port 80 to the container’s port 8000 (the API) as a LoadBalancer. + +Apply the manifest to start the vLLM server: + +```powershell +kubectl apply -f vllm-deploy.yaml # apply deployment yaml +kubectl get pods -l app=vllm-mistral -n vllm-inference -w # wait for vllm-mistral pod to run +``` + +Kubernetes will pull the container image and start the server. Wait for the vllm-mistral pod to reach Running. Once running, if no external IP address is assigned to vllm-service, open a terminal and port-forward it (e.g. `kubectl port-forward svc/vllm-service -n vllm-inference 8080:80`) to access the API at `http://localhost:8080`. + +### Testing the LLM Endpoint + +With the vLLM server ready, send a test completion request to verify the deployed model. Using PowerShell’s Invoke-RestMethod, call the /v1/completions endpoint with a JSON body specifying the model and a prompt: + +```powershell +# Using localhost with port-forward; replace $SERVICE_IP if using external LB + Invoke-RestMethod -Method Post -Uri "http://localhost:8080/v1/completions" ` + -ContentType "application/json" ` + -Body '{"model": "TheBloke/Mistral-7B-v0.1-AWQ", "prompt": "What is Azure Kubernetes Service (AKS) Arc", "max_tokens": 100}' | + Select-Object -ExpandProperty choices | Select-Object -ExpandProperty text + +# Example output: +Azure Kubernetes Service (AKS) Arc is a managed service provided by Microsoft that allows you to manage your Kubernetes deployment and monitor metrics across multiple clusters using Azure Portal. +``` + +This OpenAI-style API call asks the model (Mistral-7B) to complete the prompt “What is AKS Arc” with up to 100 tokens. The server should return a JSON with a "choices" array containing the model’s generated text (e.g., a sentence about What is AKS Arc as an on-premises cloud). The health endpoint (GET /health) can also be checked for an OK status to confirm the service is up. + +### Clean Up vLLM + +```powershell +# When finished, remove the vllm resources to free up the GPU. +kubectl delete namespace vllm-inference + +# Remove node labels if added +$nodeName = (kubectl get nodes -l app=vllm-mistral -o jsonpath='{.items[0].metadata.name}') +kubectl label node $nodeName app- +$nodeName = (kubectl get nodes -l accelerator=nvidia-gpu -o jsonpath='{.items[0].metadata.name}') +kubectl label node $nodeName accelerator- +``` + +This removes the vllm-mistral Deployment (stopping the pod) and the Service. If no more GPU inference is needed, you may also remove the GPU Operator (`helm uninstall `) to reclaim cluster resources. diff --git a/website/blog/authors.yml b/website/blog/authors.yml index fd0f6b60b..05b4e756c 100644 --- a/website/blog/authors.yml +++ b/website/blog/authors.yml @@ -473,4 +473,14 @@ jaiveer-katariya: image_url: https://github.com/jaiveerk.png page: true socials: - github: jaiveerk \ No newline at end of file + github: jaiveerk + +datta-rajpure: + name: Datta Rajpure + title: Principal Group Eng Manager at Microsoft Azure Core + url: https://www.linkedin.com/in/dattarajpure/ + image_url: https://github.com/drajpure.png + page: true + socials: + linkedin: dattarajpure + github: drajpure diff --git a/website/blog/tags.yml b/website/blog/tags.yml index c1ad44fde..19aa25bcc 100644 --- a/website/blog/tags.yml +++ b/website/blog/tags.yml @@ -22,11 +22,21 @@ ai: permalink: /ai description: Artificial intelligence workloads, patterns, model deployment, and orchestration on AKS. +ai-inference: + label: AI Inference + permalink: /ai-inference + description: AI inference is the process of using a trained AI model to make predictions, generate content, or make decisions on new, unseen data. + airflow: label: Airflow permalink: /airflow description: Using Apache Airflow for orchestrating data and machine learning workflows on AKS. +aks-arc: + label: AKS Arc + permalink: /aks-arc + description: Azure Kubernetes Service enabled by Azure Arc. + aks-automatic: label: AKS Automatic permalink: /aks-automatic @@ -336,4 +346,4 @@ workload-identity: label: Workload Identity permalink: /workload-identity description: Using Microsoft Entra Workload Identity for secure pod identity and access management in AKS. -# NOTE: If you remove or rename a tag key, search the repo for usages first to avoid broken references. +# NOTE: If you remove or rename a tag key, search the repo for usages first to avoid broken references. \ No newline at end of file