Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 151 additions & 0 deletions plugins/dotnet-ai/skills/local-llm-inference/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
---
name: local-llm-inference
description: >
USE FOR: Running LLMs locally in .NET without cloud API calls, privacy-sensitive or air-gapped
scenarios, reducing inference costs, offline development, on-device inference with Ollama or
Foundry Local.
DO NOT USE FOR: Cloud-based LLM calls (use meai-chat-integration), ONNX model inference for
non-LLM models (use onnx-runtime-inference), classical ML tasks (use mlnet).
---

# Local LLM Inference

Run large language models entirely on-device in .NET applications — no cloud dependencies, no API
keys, no network calls after model download. This skill covers two approaches that both wire through
Microsoft.Extensions.AI `IChatClient` for provider-agnostic application code.

## When to Use

- Running LLMs locally without cloud API calls
- Privacy-sensitive or air-gapped environments
- Reducing inference costs by avoiding per-token billing
- Offline development and testing with real models

## When Not to Use

- Cloud-based LLM calls → use **meai-chat-integration**
- ONNX model inference for non-LLM models → use **onnx-runtime-inference**
- ML.NET classical/traditional ML → use **mlnet**

## Decision Tree

| Criteria | Ollama | Foundry Local |
|---|---|---|
| Hosting | Separate server process | On-device service with OpenAI-compatible API |
| Model management | `ollama pull` | `foundry model run` with curated catalog |
| Hardware optimization | Manual (select model variant) | Automatic (selects optimal execution provider for CPU/GPU/NPU) |
| Model switching | Hot-switch via API | Load/unload via SDK or CLI |
| Platforms | Windows, macOS, Linux | Windows, macOS |

- **IF** using Ollama → follow the **Ollama** workflow below
- **IF** using Foundry Local → follow the **Foundry Local** workflow below

Both approaches produce an `IChatClient` — application code is identical regardless of provider.

## Workflow — Ollama

### Step 1: Install Ollama and Pull a Model

Install Ollama from https://ollama.ai, then pull a model:

```bash
ollama pull phi3:mini
```

### Step 2: Install OllamaSharp NuGet Package

```
dotnet add package OllamaSharp
```

### Step 3: Register as IChatClient

```csharp
builder.Services.AddChatClient(
new OllamaApiClient("http://localhost:11434", "phi3:mini")
.AsIChatClient());
```

### Step 4: Use Through IChatClient

The `IChatClient` API is identical to cloud providers — swap the backing implementation without
changing application code.

## Workflow — Foundry Local

### Step 1: Install Foundry Local CLI

```bash
# Windows
winget install Microsoft.FoundryLocal

# macOS
brew tap microsoft/foundrylocal
brew install foundrylocal
```

Verify: `foundry --version`

### Step 2: Install NuGet Packages

```
dotnet add package Microsoft.AI.Foundry.Local
dotnet add package OpenAI
```

### Step 3: Initialize Foundry Local and Load a Model

```csharp
using Microsoft.AI.Foundry.Local;
using OpenAI;

await FoundryLocalManager.CreateAsync(config, logger);
var mgr = FoundryLocalManager.Instance;
var catalog = await mgr.GetCatalogAsync();
var model = await catalog.GetModelAsync("phi-4-mini")
?? throw new Exception("Model not found");
await model.DownloadAsync();
await model.LoadAsync();
await mgr.StartWebServiceAsync();
```

> Browse available models: `foundry model list`

### Step 4: Wire to IChatClient via OpenAI SDK

Foundry Local exposes an OpenAI-compatible endpoint. Use the `OpenAI` NuGet package and wrap as `IChatClient`:

```csharp
var client = new OpenAIClient(
new System.ClientModel.ApiKeyCredential("notneeded"),
new OpenAIClientOptions { Endpoint = new Uri(endpoint + "/v1") });

builder.Services.AddChatClient(
client.GetChatClient(model.Id).AsIChatClient());
```

### Step 5: Use Through IChatClient

Application code is identical to the Ollama path and cloud providers.

## Validation

- [ ] Model loads without errors
- [ ] Model generates coherent text responses
- [ ] No outbound network calls are made during inference (after model download)
- [ ] Response quality is acceptable for the target use case
- [ ] Memory usage stays within hardware limits

## Common Pitfalls

| Pitfall | Guidance |
|---|---|
| Model too large for available RAM | 7B Q4 needs ~4 GB, 13B needs ~8 GB — size accordingly |
| Expecting cloud-model quality from small local models | Local models are capable but not equivalent to GPT-4-class models for complex tasks |
| Using full-precision models | Always use quantized models — full precision needs ~4× more memory |
| Not checking hardware compatibility | Foundry Local auto-selects execution providers; for Ollama, verify GPU support manually |

## More Info

- [Foundry Local](https://learn.microsoft.com/azure/foundry-local/get-started)
- [Ollama](https://ollama.ai)
128 changes: 128 additions & 0 deletions plugins/dotnet-ai/skills/onnx-runtime-inference/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
name: onnx-runtime-inference
description: >
USE FOR: Loading and running pre-trained ONNX models, hardware-accelerated inference (CPU, GPU, DirectML),
custom model deployment, HuggingFace model inference, combining ONNX scoring with ML.NET pipelines.
DO NOT USE FOR: Training models from scratch (use mlnet for classical ML and supported deep learning tasks; TorchSharp for custom architectures),
LLM text generation (use meai-chat-integration or local-llm-inference),
classical ML without a pre-trained model (use mlnet).
---

# ONNX Runtime Inference in .NET

This skill guides running pre-trained ONNX model inference in .NET.

## Choose Your Approach

Decide which approach fits your scenario:

- **Approach A: Standalone ONNX Runtime** — Use when you need maximum control over tensors and execution providers.
- **Approach B: ML.NET + ONNX Integration** — Use when you need to combine ONNX with ML.NET data transforms and pipelines.

---

## Approach A: Standalone ONNX Runtime

### Step A1: Install Packages

- `Microsoft.ML.OnnxRuntime` — CPU inference
- IF CUDA GPU → `Microsoft.ML.OnnxRuntime.Gpu` instead
- IF Windows GPU → `Microsoft.ML.OnnxRuntime.DirectML` instead

### Step A2: Create InferenceSession

> **SINGLETON** — Create once, reuse. Do not create per request.

```csharp
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;

var session = new InferenceSession("model.onnx");
// Register as singleton in DI
```

### Step A3: Prepare Input Tensors

- Create `DenseTensor<float>` with the correct shape.
- Wrap in `NamedOnnxValue.CreateFromTensor("input_name", tensor)`.
- For detailed tensor operations, read `references/tensors.md`.

### Step A4: Run Inference

```csharp
var inputs = new List<NamedOnnxValue> { NamedOnnxValue.CreateFromTensor("input", inputTensor) };
using var results = session.Run(inputs);
var output = results.First().AsTensor<float>();
```

### Step A5: Configure Execution Providers

Only if you need GPU or specialized hardware acceleration:

```csharp
var options = new SessionOptions();
// IF CUDA: options.AppendExecutionProvider_CUDA();
// IF DirectML: options.AppendExecutionProvider_DML();
// IF TensorRT: options.AppendExecutionProvider_Tensorrt(new());
var session = new InferenceSession("model.onnx", options);
```

---

## Approach B: ML.NET + ONNX Integration

### Step B1: Install Packages

- `Microsoft.ML`
- `Microsoft.ML.OnnxRuntime`
- `Microsoft.ML.OnnxTransformer`

### Step B2: Build Pipeline with ONNX Scoring

```csharp
var mlContext = new MLContext(seed: 0);
var pipeline = mlContext.Transforms.ApplyOnnxModel(
modelFile: "model.onnx",
outputColumnNames: new[] { "output" },
inputColumnNames: new[] { "input" });
var model = pipeline.Fit(emptyData);
```

### Step B3: Combine with ML.NET Transforms

Add tokenization, normalization, or post-processing transforms before/after ONNX scoring in the pipeline.

### Text Preprocessing for ONNX Models

When running NLP ONNX models (BERT, DistilBERT, MiniLM), use `Microsoft.ML.Tokenizers` to convert text to token IDs before inference:

```
dotnet add package Microsoft.ML.Tokenizers
```

```csharp
var tokenizer = BertTokenizer.Create("vocab.txt");
IReadOnlyList<int> ids = tokenizer.EncodeToIds(text);
// Feed ids into input tensor (see references/tensors.md)
```

---

## Validation

- Model loads without errors.
- Inference produces output tensors with expected shape.
- Execution provider correctly selected.
- Singleton pattern used for `InferenceSession`.

## Pitfalls

- **Creating InferenceSession per request** — Expensive; use singleton.
- **Wrong input tensor shape** — Check model's expected input with `session.InputMetadata`.
- **Not matching preprocessing to what the model expects** — Verify normalization, channel order, etc.
- **Missing execution provider packages** — Falls back to CPU silently.
- **Not disposing `IDisposableReadOnlyCollection` from `session.Run`** — Always use `using`.

## More Info

- https://onnxruntime.ai/docs/get-started/with-csharp.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# System.Numerics.Tensors for ONNX Input/Output

Tensor types needed when working with ONNX Runtime's standalone API (Approach A). Use this reference when creating and manipulating input/output tensors.

## Key Types

### DenseTensor\<T\> (Microsoft.ML.OnnxRuntime.Tensors)

ONNX Runtime's tensor type for creating inputs.

```csharp
// Create from data and dimensions
var data = new float[] { 1f, 2f, 3f, 4f };
var dimensions = new[] { 1, 4 };
var tensor = new DenseTensor<float>(data, dimensions);
```

Common shapes:
- `[1, 3, 224, 224]` — images (batch, channels, height, width)
- `[1, sequenceLength]` — text token IDs

### Tensor\<T\> (System.Numerics.Tensors, .NET 10+)

.NET's native tensor type. Heap-allocated N-dimensional array with rich slicing and reshape operations.

```csharp
var tensor = Tensor.Create<float>(data, shape);
```

### TensorSpan\<T\> (System.Numerics.Tensors)

Mutable zero-copy view for in-place operations on tensor data. `ref struct` — cannot be stored in fields.

### TensorPrimitives (System.Numerics.Tensors)

SIMD-accelerated math for pre/post-processing:

```csharp
TensorPrimitives.Softmax(input, output); // Post-process logits
TensorPrimitives.CosineSimilarity(a, b); // Similarity
TensorPrimitives.Multiply(a, scalar, result); // Scaling
```

## Common Patterns

### Normalize Image

Scale pixel values to `[0, 1]`, subtract mean, divide by std:

```csharp
var pixels = new float[1 * 3 * 224 * 224];
// Load and scale pixels to [0, 1]
// Subtract channel means: [0.485, 0.456, 0.406]
// Divide by channel stds: [0.229, 0.224, 0.225]
var input = new DenseTensor<float>(pixels, new[] { 1, 3, 224, 224 });
```

### Post-process Classification

Apply softmax to logits and find argmax:

```csharp
var logits = output.ToArray();
var probabilities = new float[logits.Length];
TensorPrimitives.Softmax(logits, probabilities);
int predictedClass = TensorPrimitives.IndexOfMax(probabilities);
```

### Extract Embeddings

Take mean or CLS pooling of model output:

```csharp
// CLS pooling: use the first token's embedding
var cls = outputTensor.Buffer.Span.Slice(0, hiddenSize).ToArray();

// Mean pooling: average across the sequence dimension
var embedding = new float[hiddenSize];
for (int i = 0; i < sequenceLength; i++)
TensorPrimitives.Add(embedding, tokenEmbeddings.AsSpan(i * hiddenSize, hiddenSize), embedding);
TensorPrimitives.Divide(embedding, sequenceLength, embedding);
```
Loading
Loading