diff --git a/plugins/dotnet-ai/skills/local-llm-inference/SKILL.md b/plugins/dotnet-ai/skills/local-llm-inference/SKILL.md new file mode 100644 index 0000000000..489e848939 --- /dev/null +++ b/plugins/dotnet-ai/skills/local-llm-inference/SKILL.md @@ -0,0 +1,151 @@ +--- +name: local-llm-inference +description: > + USE FOR: Running LLMs locally in .NET without cloud API calls, privacy-sensitive or air-gapped + scenarios, reducing inference costs, offline development, on-device inference with Ollama or + Foundry Local. + DO NOT USE FOR: Cloud-based LLM calls (use meai-chat-integration), ONNX model inference for + non-LLM models (use onnx-runtime-inference), classical ML tasks (use mlnet). +--- + +# Local LLM Inference + +Run large language models entirely on-device in .NET applications — no cloud dependencies, no API +keys, no network calls after model download. This skill covers two approaches that both wire through +Microsoft.Extensions.AI `IChatClient` for provider-agnostic application code. + +## When to Use + +- Running LLMs locally without cloud API calls +- Privacy-sensitive or air-gapped environments +- Reducing inference costs by avoiding per-token billing +- Offline development and testing with real models + +## When Not to Use + +- Cloud-based LLM calls → use **meai-chat-integration** +- ONNX model inference for non-LLM models → use **onnx-runtime-inference** +- ML.NET classical/traditional ML → use **mlnet** + +## Decision Tree + +| Criteria | Ollama | Foundry Local | +|---|---|---| +| Hosting | Separate server process | On-device service with OpenAI-compatible API | +| Model management | `ollama pull` | `foundry model run` with curated catalog | +| Hardware optimization | Manual (select model variant) | Automatic (selects optimal execution provider for CPU/GPU/NPU) | +| Model switching | Hot-switch via API | Load/unload via SDK or CLI | +| Platforms | Windows, macOS, Linux | Windows, macOS | + +- **IF** using Ollama → follow the **Ollama** workflow below +- **IF** using Foundry Local → follow the **Foundry Local** workflow below + +Both approaches produce an `IChatClient` — application code is identical regardless of provider. + +## Workflow — Ollama + +### Step 1: Install Ollama and Pull a Model + +Install Ollama from https://ollama.ai, then pull a model: + +```bash +ollama pull phi3:mini +``` + +### Step 2: Install OllamaSharp NuGet Package + +``` +dotnet add package OllamaSharp +``` + +### Step 3: Register as IChatClient + +```csharp +builder.Services.AddChatClient( + new OllamaApiClient("http://localhost:11434", "phi3:mini") + .AsIChatClient()); +``` + +### Step 4: Use Through IChatClient + +The `IChatClient` API is identical to cloud providers — swap the backing implementation without +changing application code. + +## Workflow — Foundry Local + +### Step 1: Install Foundry Local CLI + +```bash +# Windows +winget install Microsoft.FoundryLocal + +# macOS +brew tap microsoft/foundrylocal +brew install foundrylocal +``` + +Verify: `foundry --version` + +### Step 2: Install NuGet Packages + +``` +dotnet add package Microsoft.AI.Foundry.Local +dotnet add package OpenAI +``` + +### Step 3: Initialize Foundry Local and Load a Model + +```csharp +using Microsoft.AI.Foundry.Local; +using OpenAI; + +await FoundryLocalManager.CreateAsync(config, logger); +var mgr = FoundryLocalManager.Instance; +var catalog = await mgr.GetCatalogAsync(); +var model = await catalog.GetModelAsync("phi-4-mini") + ?? throw new Exception("Model not found"); +await model.DownloadAsync(); +await model.LoadAsync(); +await mgr.StartWebServiceAsync(); +``` + +> Browse available models: `foundry model list` + +### Step 4: Wire to IChatClient via OpenAI SDK + +Foundry Local exposes an OpenAI-compatible endpoint. Use the `OpenAI` NuGet package and wrap as `IChatClient`: + +```csharp +var client = new OpenAIClient( + new System.ClientModel.ApiKeyCredential("notneeded"), + new OpenAIClientOptions { Endpoint = new Uri(endpoint + "/v1") }); + +builder.Services.AddChatClient( + client.GetChatClient(model.Id).AsIChatClient()); +``` + +### Step 5: Use Through IChatClient + +Application code is identical to the Ollama path and cloud providers. + +## Validation + +- [ ] Model loads without errors +- [ ] Model generates coherent text responses +- [ ] No outbound network calls are made during inference (after model download) +- [ ] Response quality is acceptable for the target use case +- [ ] Memory usage stays within hardware limits + +## Common Pitfalls + +| Pitfall | Guidance | +|---|---| +| Model too large for available RAM | 7B Q4 needs ~4 GB, 13B needs ~8 GB — size accordingly | +| Expecting cloud-model quality from small local models | Local models are capable but not equivalent to GPT-4-class models for complex tasks | +| Using full-precision models | Always use quantized models — full precision needs ~4× more memory | +| Not checking hardware compatibility | Foundry Local auto-selects execution providers; for Ollama, verify GPU support manually | + +## More Info + +- [Foundry Local](https://learn.microsoft.com/azure/foundry-local/get-started) +- [Ollama](https://ollama.ai) diff --git a/plugins/dotnet-ai/skills/onnx-runtime-inference/SKILL.md b/plugins/dotnet-ai/skills/onnx-runtime-inference/SKILL.md new file mode 100644 index 0000000000..0bf4c76db1 --- /dev/null +++ b/plugins/dotnet-ai/skills/onnx-runtime-inference/SKILL.md @@ -0,0 +1,128 @@ +--- +name: onnx-runtime-inference +description: > + USE FOR: Loading and running pre-trained ONNX models, hardware-accelerated inference (CPU, GPU, DirectML), + custom model deployment, HuggingFace model inference, combining ONNX scoring with ML.NET pipelines. + DO NOT USE FOR: Training models from scratch (use mlnet for classical ML and supported deep learning tasks; TorchSharp for custom architectures), + LLM text generation (use meai-chat-integration or local-llm-inference), + classical ML without a pre-trained model (use mlnet). +--- + +# ONNX Runtime Inference in .NET + +This skill guides running pre-trained ONNX model inference in .NET. + +## Choose Your Approach + +Decide which approach fits your scenario: + +- **Approach A: Standalone ONNX Runtime** — Use when you need maximum control over tensors and execution providers. +- **Approach B: ML.NET + ONNX Integration** — Use when you need to combine ONNX with ML.NET data transforms and pipelines. + +--- + +## Approach A: Standalone ONNX Runtime + +### Step A1: Install Packages + +- `Microsoft.ML.OnnxRuntime` — CPU inference +- IF CUDA GPU → `Microsoft.ML.OnnxRuntime.Gpu` instead +- IF Windows GPU → `Microsoft.ML.OnnxRuntime.DirectML` instead + +### Step A2: Create InferenceSession + +> **SINGLETON** — Create once, reuse. Do not create per request. + +```csharp +using Microsoft.ML.OnnxRuntime; +using Microsoft.ML.OnnxRuntime.Tensors; + +var session = new InferenceSession("model.onnx"); +// Register as singleton in DI +``` + +### Step A3: Prepare Input Tensors + +- Create `DenseTensor` with the correct shape. +- Wrap in `NamedOnnxValue.CreateFromTensor("input_name", tensor)`. +- For detailed tensor operations, read `references/tensors.md`. + +### Step A4: Run Inference + +```csharp +var inputs = new List { NamedOnnxValue.CreateFromTensor("input", inputTensor) }; +using var results = session.Run(inputs); +var output = results.First().AsTensor(); +``` + +### Step A5: Configure Execution Providers + +Only if you need GPU or specialized hardware acceleration: + +```csharp +var options = new SessionOptions(); +// IF CUDA: options.AppendExecutionProvider_CUDA(); +// IF DirectML: options.AppendExecutionProvider_DML(); +// IF TensorRT: options.AppendExecutionProvider_Tensorrt(new()); +var session = new InferenceSession("model.onnx", options); +``` + +--- + +## Approach B: ML.NET + ONNX Integration + +### Step B1: Install Packages + +- `Microsoft.ML` +- `Microsoft.ML.OnnxRuntime` +- `Microsoft.ML.OnnxTransformer` + +### Step B2: Build Pipeline with ONNX Scoring + +```csharp +var mlContext = new MLContext(seed: 0); +var pipeline = mlContext.Transforms.ApplyOnnxModel( + modelFile: "model.onnx", + outputColumnNames: new[] { "output" }, + inputColumnNames: new[] { "input" }); +var model = pipeline.Fit(emptyData); +``` + +### Step B3: Combine with ML.NET Transforms + +Add tokenization, normalization, or post-processing transforms before/after ONNX scoring in the pipeline. + +### Text Preprocessing for ONNX Models + +When running NLP ONNX models (BERT, DistilBERT, MiniLM), use `Microsoft.ML.Tokenizers` to convert text to token IDs before inference: + +``` +dotnet add package Microsoft.ML.Tokenizers +``` + +```csharp +var tokenizer = BertTokenizer.Create("vocab.txt"); +IReadOnlyList ids = tokenizer.EncodeToIds(text); +// Feed ids into input tensor (see references/tensors.md) +``` + +--- + +## Validation + +- Model loads without errors. +- Inference produces output tensors with expected shape. +- Execution provider correctly selected. +- Singleton pattern used for `InferenceSession`. + +## Pitfalls + +- **Creating InferenceSession per request** — Expensive; use singleton. +- **Wrong input tensor shape** — Check model's expected input with `session.InputMetadata`. +- **Not matching preprocessing to what the model expects** — Verify normalization, channel order, etc. +- **Missing execution provider packages** — Falls back to CPU silently. +- **Not disposing `IDisposableReadOnlyCollection` from `session.Run`** — Always use `using`. + +## More Info + +- https://onnxruntime.ai/docs/get-started/with-csharp.html diff --git a/plugins/dotnet-ai/skills/onnx-runtime-inference/references/tensors.md b/plugins/dotnet-ai/skills/onnx-runtime-inference/references/tensors.md new file mode 100644 index 0000000000..9f6a5289d3 --- /dev/null +++ b/plugins/dotnet-ai/skills/onnx-runtime-inference/references/tensors.md @@ -0,0 +1,82 @@ +# System.Numerics.Tensors for ONNX Input/Output + +Tensor types needed when working with ONNX Runtime's standalone API (Approach A). Use this reference when creating and manipulating input/output tensors. + +## Key Types + +### DenseTensor\ (Microsoft.ML.OnnxRuntime.Tensors) + +ONNX Runtime's tensor type for creating inputs. + +```csharp +// Create from data and dimensions +var data = new float[] { 1f, 2f, 3f, 4f }; +var dimensions = new[] { 1, 4 }; +var tensor = new DenseTensor(data, dimensions); +``` + +Common shapes: +- `[1, 3, 224, 224]` — images (batch, channels, height, width) +- `[1, sequenceLength]` — text token IDs + +### Tensor\ (System.Numerics.Tensors, .NET 10+) + +.NET's native tensor type. Heap-allocated N-dimensional array with rich slicing and reshape operations. + +```csharp +var tensor = Tensor.Create(data, shape); +``` + +### TensorSpan\ (System.Numerics.Tensors) + +Mutable zero-copy view for in-place operations on tensor data. `ref struct` — cannot be stored in fields. + +### TensorPrimitives (System.Numerics.Tensors) + +SIMD-accelerated math for pre/post-processing: + +```csharp +TensorPrimitives.Softmax(input, output); // Post-process logits +TensorPrimitives.CosineSimilarity(a, b); // Similarity +TensorPrimitives.Multiply(a, scalar, result); // Scaling +``` + +## Common Patterns + +### Normalize Image + +Scale pixel values to `[0, 1]`, subtract mean, divide by std: + +```csharp +var pixels = new float[1 * 3 * 224 * 224]; +// Load and scale pixels to [0, 1] +// Subtract channel means: [0.485, 0.456, 0.406] +// Divide by channel stds: [0.229, 0.224, 0.225] +var input = new DenseTensor(pixels, new[] { 1, 3, 224, 224 }); +``` + +### Post-process Classification + +Apply softmax to logits and find argmax: + +```csharp +var logits = output.ToArray(); +var probabilities = new float[logits.Length]; +TensorPrimitives.Softmax(logits, probabilities); +int predictedClass = TensorPrimitives.IndexOfMax(probabilities); +``` + +### Extract Embeddings + +Take mean or CLS pooling of model output: + +```csharp +// CLS pooling: use the first token's embedding +var cls = outputTensor.Buffer.Span.Slice(0, hiddenSize).ToArray(); + +// Mean pooling: average across the sequence dimension +var embedding = new float[hiddenSize]; +for (int i = 0; i < sequenceLength; i++) + TensorPrimitives.Add(embedding, tokenEmbeddings.AsSpan(i * hiddenSize, hiddenSize), embedding); +TensorPrimitives.Divide(embedding, sequenceLength, embedding); +``` diff --git a/tests/dotnet-ai/local-llm-inference/eval.yaml b/tests/dotnet-ai/local-llm-inference/eval.yaml new file mode 100644 index 0000000000..7256a4efae --- /dev/null +++ b/tests/dotnet-ai/local-llm-inference/eval.yaml @@ -0,0 +1,56 @@ +scenarios: + - name: "Set up Ollama with MEAI" + prompt: "Set up local LLM inference in this .NET 10 app using Ollama with the phi3:mini model. Wire it through Microsoft.Extensions.AI so the rest of the app uses IChatClient." + setup: + files: + - path: "LocalAI/LocalAI.csproj" + content: | + + + Exe + net10.0 + + + - path: "LocalAI/Program.cs" + content: | + Console.WriteLine("TODO: Local LLM"); + assertions: + - type: "output_contains" + value: "Ollama" + - type: "output_contains" + value: "IChatClient" + - type: "exit_success" + rubric: + - "Uses OllamaSharp package for Ollama integration" + - "Wires through IChatClient via MEAI abstraction" + - "Does not require cloud API keys" + - "Specifies model name (phi3:mini or similar)" + timeout: 360 + + - name: "Set up Foundry Local with MEAI" + prompt: "Set up local LLM inference in this .NET 10 app using Foundry Local with the phi-4-mini model. Wire it through Microsoft.Extensions.AI so the rest of the app uses IChatClient." + setup: + files: + - path: "LocalAI/LocalAI.csproj" + content: | + + + Exe + net10.0 + + + - path: "LocalAI/Program.cs" + content: | + Console.WriteLine("TODO: Local LLM with Foundry Local"); + assertions: + - type: "output_contains" + value: "FoundryLocalManager" + - type: "output_contains" + value: "IChatClient" + - type: "exit_success" + rubric: + - "Uses Microsoft.AI.Foundry.Local package" + - "Uses OpenAI package for OpenAI-compatible API integration" + - "Wires through IChatClient via MEAI abstraction" + - "Does not require cloud API keys" + timeout: 360 diff --git a/tests/dotnet-ai/onnx-runtime-inference/eval.yaml b/tests/dotnet-ai/onnx-runtime-inference/eval.yaml new file mode 100644 index 0000000000..e80027762d --- /dev/null +++ b/tests/dotnet-ai/onnx-runtime-inference/eval.yaml @@ -0,0 +1,55 @@ +scenarios: + - name: "Standalone ONNX model inference" + prompt: "Load an ONNX image classification model and run inference on an input image in this .NET 10 console app. Use the standalone ONNX Runtime API." + setup: + files: + - path: "OnnxApp/OnnxApp.csproj" + content: | + + + Exe + net10.0 + + + - path: "OnnxApp/Program.cs" + content: | + Console.WriteLine("TODO: Run ONNX inference"); + assertions: + - type: "output_contains" + value: "InferenceSession" + - type: "output_contains" + value: "Microsoft.ML.OnnxRuntime" + - type: "exit_success" + rubric: + - "Uses Microsoft.ML.OnnxRuntime (standalone, not ML.NET integration)" + - "Creates InferenceSession as a reusable singleton" + - "Prepares input tensors with correct shape" + - "Extracts and processes output tensors" + timeout: 360 + + - name: "ML.NET with ONNX integration" + prompt: "I have a text classification ONNX model. Integrate it into an ML.NET pipeline that tokenizes input text, scores with the ONNX model, and returns the predicted class." + setup: + files: + - path: "TextClassifier/TextClassifier.csproj" + content: | + + + Exe + net10.0 + + + - path: "TextClassifier/Program.cs" + content: | + Console.WriteLine("TODO: ML.NET + ONNX text classification"); + assertions: + - type: "output_contains" + value: "MLContext" + - type: "output_contains" + value: "Onnx" + - type: "exit_success" + rubric: + - "Uses ML.NET's ONNX integration (ApplyOnnxModel or OnnxScoringEstimator)" + - "Combines ML.NET transforms with ONNX scoring in a pipeline" + - "Creates MLContext with seed for reproducibility" + timeout: 360