dotnet · luisquintanilla · Mar 5, 2026 · Mar 6, 2026
@@ -0,0 +1,151 @@
+---
+name: local-llm-inference
+description: >
+  USE FOR: Running LLMs locally in .NET without cloud API calls, privacy-sensitive or air-gapped
+  scenarios, reducing inference costs, offline development, on-device inference with Ollama or
+  Foundry Local.
+  DO NOT USE FOR: Cloud-based LLM calls (use meai-chat-integration), ONNX model inference for
+  non-LLM models (use onnx-runtime-inference), classical ML tasks (use mlnet).
+---
+
+# Local LLM Inference
+
+Run large language models entirely on-device in .NET applications — no cloud dependencies, no API
+keys, no network calls after model download. This skill covers two approaches that both wire through
+Microsoft.Extensions.AI `IChatClient` for provider-agnostic application code.
+
+## When to Use
+
+- Running LLMs locally without cloud API calls
+- Privacy-sensitive or air-gapped environments
+- Reducing inference costs by avoiding per-token billing
+- Offline development and testing with real models
+
+## When Not to Use
+
+- Cloud-based LLM calls → use **meai-chat-integration**
+- ONNX model inference for non-LLM models → use **onnx-runtime-inference**
+- ML.NET classical/traditional ML → use **mlnet**
+
+## Decision Tree
+
+| Criteria | Ollama | Foundry Local |
+|---|---|---|
+| Hosting | Separate server process | On-device service with OpenAI-compatible API |
+| Model management | `ollama pull` | `foundry model run` with curated catalog |
+| Hardware optimization | Manual (select model variant) | Automatic (selects optimal execution provider for CPU/GPU/NPU) |
+| Model switching | Hot-switch via API | Load/unload via SDK or CLI |
+| Platforms | Windows, macOS, Linux | Windows, macOS |
+
+- **IF** using Ollama → follow the **Ollama** workflow below
+- **IF** using Foundry Local → follow the **Foundry Local** workflow below
+
+Both approaches produce an `IChatClient` — application code is identical regardless of provider.
+
+## Workflow — Ollama
+
+### Step 1: Install Ollama and Pull a Model
+
+Install Ollama from https://ollama.ai, then pull a model:
+
+```bash
+ollama pull phi3:mini
+```
+
+### Step 2: Install OllamaSharp NuGet Package
+
+```
+dotnet add package OllamaSharp
+```
+
+### Step 3: Register as IChatClient
+
+```csharp
+builder.Services.AddChatClient(
+    new OllamaApiClient("http://localhost:11434", "phi3:mini")
+        .AsIChatClient());
+```
+
+### Step 4: Use Through IChatClient
+
+The `IChatClient` API is identical to cloud providers — swap the backing implementation without
+changing application code.
+
+## Workflow — Foundry Local
+
+### Step 1: Install Foundry Local CLI
+
+```bash
+# Windows
+winget install Microsoft.FoundryLocal
+
+# macOS
+brew tap microsoft/foundrylocal
+brew install foundrylocal
+```
+
+Verify: `foundry --version`
+
+### Step 2: Install NuGet Packages
+
+```
+dotnet add package Microsoft.AI.Foundry.Local
+dotnet add package OpenAI
+```
+
+### Step 3: Initialize Foundry Local and Load a Model
+
+```csharp
+using Microsoft.AI.Foundry.Local;
+using OpenAI;
+
+await FoundryLocalManager.CreateAsync(config, logger);
+var mgr = FoundryLocalManager.Instance;
+var catalog = await mgr.GetCatalogAsync();
+var model = await catalog.GetModelAsync("phi-4-mini")
+    ?? throw new Exception("Model not found");
+await model.DownloadAsync();
+await model.LoadAsync();
+await mgr.StartWebServiceAsync();
+```
+
+> Browse available models: `foundry model list`
+
+### Step 4: Wire to IChatClient via OpenAI SDK
+
+Foundry Local exposes an OpenAI-compatible endpoint. Use the `OpenAI` NuGet package and wrap as `IChatClient`:
+
+```csharp
+var client = new OpenAIClient(
+    new System.ClientModel.ApiKeyCredential("notneeded"),
+    new OpenAIClientOptions { Endpoint = new Uri(endpoint + "/v1") });
+
+builder.Services.AddChatClient(
+    client.GetChatClient(model.Id).AsIChatClient());
+```
+
+### Step 5: Use Through IChatClient
+
+Application code is identical to the Ollama path and cloud providers.
+
+## Validation
+
+- [ ] Model loads without errors
+- [ ] Model generates coherent text responses
+- [ ] No outbound network calls are made during inference (after model download)
+- [ ] Response quality is acceptable for the target use case
+- [ ] Memory usage stays within hardware limits
+
+## Common Pitfalls
+
+| Pitfall | Guidance |
+|---|---|
+| Model too large for available RAM | 7B Q4 needs ~4 GB, 13B needs ~8 GB — size accordingly |
+| Expecting cloud-model quality from small local models | Local models are capable but not equivalent to GPT-4-class models for complex tasks |
+| Using full-precision models | Always use quantized models — full precision needs ~4× more memory |
+| Not checking hardware compatibility | Foundry Local auto-selects execution providers; for Ollama, verify GPU support manually |
+
+## More Info
+
+- [Foundry Local](https://learn.microsoft.com/azure/foundry-local/get-started)
+- [Ollama](https://ollama.ai)
@@ -0,0 +1,128 @@
+---
+name: onnx-runtime-inference
+description: >
+  USE FOR: Loading and running pre-trained ONNX models, hardware-accelerated inference (CPU, GPU, DirectML),
+  custom model deployment, HuggingFace model inference, combining ONNX scoring with ML.NET pipelines.
+  DO NOT USE FOR: Training models from scratch (use mlnet for classical ML and supported deep learning tasks; TorchSharp for custom architectures),
+  LLM text generation (use meai-chat-integration or local-llm-inference),
+  classical ML without a pre-trained model (use mlnet).
+---
+
+# ONNX Runtime Inference in .NET
+
+This skill guides running pre-trained ONNX model inference in .NET.
+
+## Choose Your Approach
+
+Decide which approach fits your scenario:
+
+- **Approach A: Standalone ONNX Runtime** — Use when you need maximum control over tensors and execution providers.
+- **Approach B: ML.NET + ONNX Integration** — Use when you need to combine ONNX with ML.NET data transforms and pipelines.
+
+---
+
+## Approach A: Standalone ONNX Runtime
+
+### Step A1: Install Packages
+
+- `Microsoft.ML.OnnxRuntime` — CPU inference
+- IF CUDA GPU → `Microsoft.ML.OnnxRuntime.Gpu` instead
+- IF Windows GPU → `Microsoft.ML.OnnxRuntime.DirectML` instead
+
+### Step A2: Create InferenceSession
+
+> **SINGLETON** — Create once, reuse. Do not create per request.
+
+```csharp
+using Microsoft.ML.OnnxRuntime;
+using Microsoft.ML.OnnxRuntime.Tensors;
+
+var session = new InferenceSession("model.onnx");
+// Register as singleton in DI
+```
+
+### Step A3: Prepare Input Tensors
+
+- Create `DenseTensor<float>` with the correct shape.
+- Wrap in `NamedOnnxValue.CreateFromTensor("input_name", tensor)`.
+- For detailed tensor operations, read `references/tensors.md`.
+
+### Step A4: Run Inference
+
+```csharp
+var inputs = new List<NamedOnnxValue> { NamedOnnxValue.CreateFromTensor("input", inputTensor) };
+using var results = session.Run(inputs);
+var output = results.First().AsTensor<float>();
+```
+
+### Step A5: Configure Execution Providers
+
+Only if you need GPU or specialized hardware acceleration:
+
+```csharp
+var options = new SessionOptions();
+// IF CUDA: options.AppendExecutionProvider_CUDA();
+// IF DirectML: options.AppendExecutionProvider_DML();
+// IF TensorRT: options.AppendExecutionProvider_Tensorrt(new());
+var session = new InferenceSession("model.onnx", options);
+```
+
+---
+
+## Approach B: ML.NET + ONNX Integration
+
+### Step B1: Install Packages
+
+- `Microsoft.ML`
+- `Microsoft.ML.OnnxRuntime`
+- `Microsoft.ML.OnnxTransformer`
+
+### Step B2: Build Pipeline with ONNX Scoring
+
+```csharp
+var mlContext = new MLContext(seed: 0);
+var pipeline = mlContext.Transforms.ApplyOnnxModel(
+    modelFile: "model.onnx",
+    outputColumnNames: new[] { "output" },
+    inputColumnNames: new[] { "input" });
+var model = pipeline.Fit(emptyData);
+```
+
+### Step B3: Combine with ML.NET Transforms
+
+Add tokenization, normalization, or post-processing transforms before/after ONNX scoring in the pipeline.
+
+### Text Preprocessing for ONNX Models
+
+When running NLP ONNX models (BERT, DistilBERT, MiniLM), use `Microsoft.ML.Tokenizers` to convert text to token IDs before inference:
+
+```
+dotnet add package Microsoft.ML.Tokenizers
+```
+
+```csharp
+var tokenizer = BertTokenizer.Create("vocab.txt");
+IReadOnlyList<int> ids = tokenizer.EncodeToIds(text);
+// Feed ids into input tensor (see references/tensors.md)
+```
+
+---
+
+## Validation
+
+- Model loads without errors.
+- Inference produces output tensors with expected shape.
+- Execution provider correctly selected.
+- Singleton pattern used for `InferenceSession`.
+
+## Pitfalls
+
+- **Creating InferenceSession per request** — Expensive; use singleton.
+- **Wrong input tensor shape** — Check model's expected input with `session.InputMetadata`.
+- **Not matching preprocessing to what the model expects** — Verify normalization, channel order, etc.
+- **Missing execution provider packages** — Falls back to CPU silently.
+- **Not disposing `IDisposableReadOnlyCollection` from `session.Run`** — Always use `using`.
+
+## More Info
+
+- https://onnxruntime.ai/docs/get-started/with-csharp.html
@@ -0,0 +1,82 @@
+# System.Numerics.Tensors for ONNX Input/Output
+
+Tensor types needed when working with ONNX Runtime's standalone API (Approach A). Use this reference when creating and manipulating input/output tensors.
+
+## Key Types
+
+### DenseTensor\<T\> (Microsoft.ML.OnnxRuntime.Tensors)
+
+ONNX Runtime's tensor type for creating inputs.
+
+```csharp
+// Create from data and dimensions
+var data = new float[] { 1f, 2f, 3f, 4f };
+var dimensions = new[] { 1, 4 };
+var tensor = new DenseTensor<float>(data, dimensions);
+```
+
+Common shapes:
+- `[1, 3, 224, 224]` — images (batch, channels, height, width)
+- `[1, sequenceLength]` — text token IDs
+
+### Tensor\<T\> (System.Numerics.Tensors, .NET 10+)
+
+.NET's native tensor type. Heap-allocated N-dimensional array with rich slicing and reshape operations.
+
+```csharp
+var tensor = Tensor.Create<float>(data, shape);
+```
+
+### TensorSpan\<T\> (System.Numerics.Tensors)
+
+Mutable zero-copy view for in-place operations on tensor data. `ref struct` — cannot be stored in fields.
+
+### TensorPrimitives (System.Numerics.Tensors)
+
+SIMD-accelerated math for pre/post-processing:
+
+```csharp
+TensorPrimitives.Softmax(input, output);              // Post-process logits
+TensorPrimitives.CosineSimilarity(a, b);               // Similarity
+TensorPrimitives.Multiply(a, scalar, result);           // Scaling
+```
+
+## Common Patterns
+
+### Normalize Image
+
+Scale pixel values to `[0, 1]`, subtract mean, divide by std:
+
+```csharp
+var pixels = new float[1 * 3 * 224 * 224];
+// Load and scale pixels to [0, 1]
+// Subtract channel means: [0.485, 0.456, 0.406]
+// Divide by channel stds: [0.229, 0.224, 0.225]
+var input = new DenseTensor<float>(pixels, new[] { 1, 3, 224, 224 });
+```
+
+### Post-process Classification
+
+Apply softmax to logits and find argmax:
+
+```csharp
+var logits = output.ToArray();
+var probabilities = new float[logits.Length];
+TensorPrimitives.Softmax(logits, probabilities);
+int predictedClass = TensorPrimitives.IndexOfMax(probabilities);
+```
+
+### Extract Embeddings
+
+Take mean or CLS pooling of model output:
+
+```csharp
+// CLS pooling: use the first token's embedding
+var cls = outputTensor.Buffer.Span.Slice(0, hiddenSize).ToArray();
+
+// Mean pooling: average across the sequence dimension
+var embedding = new float[hiddenSize];
+for (int i = 0; i < sequenceLength; i++)
+    TensorPrimitives.Add(embedding, tokenEmbeddings.AsSpan(i * hiddenSize, hiddenSize), embedding);
+TensorPrimitives.Divide(embedding, sequenceLength, embedding);
+```