Releases: microsoft/Foundry-Local
v1.1.0 Foundry Local
🚀 Foundry Local v1.1.0 Release Notes
We're excited to announce Foundry Local v1.1.0 — packed with new capabilities for on-device AI! This release brings expanded platform support, new model types, and performance improvements across the board.
🆕 What's New
🎯 .NET netstandard2.0 / net8.0 Support
The C# SDK now targets both net8.0 and netstandard2.0, broadening compatibility to .NET Framework 4.6.1+, .NET Core 2.0+, Xamarin, Unity, and more. Ship on-device AI to virtually any .NET application!
<PackageReference Include="Microsoft.AI.Foundry.Local" Version="1.1.0" />🎙️ Live Audio Transcription
Real-time speech-to-text is here! Stream microphone audio directly to the SDK and receive transcription results as they arrive — no cloud round-trips, no latency. Built on the Nemotron ASR model with an OpenAI Realtime-compatible API surface.
Python
audio_client = model.get_audio_client()
session = audio_client.create_live_transcription_session()
session.settings.sample_rate = 16000
session.settings.channels = 1
session.settings.language = "en"
session.start()
# Push audio
session.append(pcm_bytes)
# Read results (typically on a background thread)
for result in session.get_stream():
print(result.content[0].text) # transcribed text
print(result.is_final) # True for final results
session.stop()JavaScript
const audioClient = model.createAudioClient();
const session = audioClient.createLiveTranscriptionSession();
session.settings.sampleRate = 16000;
session.settings.channels = 1;
session.settings.language = 'en';
await session.start();
// Push audio
await session.append(pcmBytes);
// Read results
for await (const result of session.getStream()) {
console.log(result.content[0].text); // transcribed text
console.log(result.is_final); // true for final results
}
await session.stop();C#
var audioClient = await model.GetAudioClientAsync();
var session = audioClient.CreateLiveTranscriptionSession();
session.Settings.SampleRate = 16000;
session.Settings.Channels = 1;
session.Settings.Language = "en";
await session.StartAsync();
// Push audio
await session.AppendAsync(pcmBytes);
// Read results
await foreach (var result in session.GetStream())
{
Console.WriteLine(result.Content[0].Text); // transcribed text
Console.WriteLine(result.IsFinal); // true for final results
}
await session.StopAsync();Rust
let audio_client = model.create_audio_client();
let session = audio_client.create_live_transcription_session();
session.start(None).await?;
// Push audio
session.append(&pcm_bytes).await?;
// Read results
let mut stream = session.get_stream().await?;
while let Some(result) = stream.next().await {
let r = result?;
if let Some(content) = r.content.first() {
println!("{}", content.text); // transcribed text
println!("{}", r.is_final); // true for final results
}
}
session.stop().await?;📐 Embeddings
Generate text embeddings entirely on-device for semantic search, RAG, clustering, and more. The new qwen3-0.6b-embedding model delivers high-quality vector representations in a compact footprint.
Python
model = manager.catalog.get_model("qwen3-0.6b-embedding")
model.download()
model.load()
client = model.get_embedding_client()
# Single embedding
response = client.generate_embedding("The quick brown fox jumps over the lazy dog")
embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}")
# Batch embeddings
batch_response = client.generate_embeddings([
"Machine learning is a subset of artificial intelligence",
"The capital of France is Paris",
"Rust is a systems programming language",
])JavaScript
const model = await manager.catalog.getModel('qwen3-0.6b-embedding');
await model.download();
await model.load();
const embeddingClient = model.createEmbeddingClient();
// Single embedding
const response = await embeddingClient.generateEmbedding(
'The quick brown fox jumps over the lazy dog'
);
console.log(`Dimensions: ${response.data[0].embedding.length}`);
// Batch embeddings
const batchResponse = await embeddingClient.generateEmbeddings([
'Machine learning is a subset of artificial intelligence',
'The capital of France is Paris',
'Rust is a systems programming language'
]);C#
var model = await catalog.GetModelAsync("qwen3-0.6b-embedding");
await model.DownloadAsync();
await model.LoadAsync();
var embeddingClient = await model.GetEmbeddingClientAsync();
// Single embedding
var response = await embeddingClient.GenerateEmbeddingAsync(
"The quick brown fox jumps over the lazy dog");
var embedding = response.Data[0].Embedding;
Console.WriteLine($"Dimensions: {embedding.Count}");
// Batch embeddings
var batchResponse = await embeddingClient.GenerateEmbeddingsAsync([
"Machine learning is a subset of artificial intelligence",
"The capital of France is Paris",
"Rust is a systems programming language"
]);Rust
👁️ Qwen 3.5 Vision Language Model
Introducing Qwen 3.5 VL — a multimodal vision-language model that runs entirely on-device. Analyze images, understand visual content, and answer questions about what's in a picture — all without sending data to the cloud.
model = manager.catalog.get_model("qwen3.5-vision")
model.download()
model.load()📦 JavaScript SDK — Koffi Dependency Removed
The JavaScript SDK no longer depends on koffi for native interop. This results in a leaner dependency tree, faster installs, and fewer compatibility issues across platforms and Node.js versions.
- ✅ Smaller
node_modules— no more large native FFI dependency - ✅ Fewer platform quirks — prebuilt N-API addon replaces runtime FFI binding
- ✅ Faster install times — less to download, nothing to compile
📖 JavaScript SDK Documentation
🖥️ WebGPU Plugin Execution Provider
The new WebGPU execution provider is delivered as a plug-in — it's not bundled with the core runtime, keeping your base binary small. When WebGPU acceleration is needed, Foundry Local automatically downloads and registers the EP on the fly, so your users only pay the size cost if their hardware benefits from it.
This approach means:
- ✅ Smaller default install — the core package stays lean (~20 MB)
- ✅ On-demand download — the WebGPU EP is fetched and registered at runtime only when needed
- ✅ Broader GPU coverage — unlocks hardware acceleration through the WebGPU Execution Provider for our cross-platform solution
📚 Resources
| Resource | Link |
|---|---|
| 📖 MSLearn Docs | learn.microsoft.com/en-us/azure/foundry-local/get-started |
| 🐙 GitHub | github.com/microsoft/Foundry-Local |
| 🧪 Samples | samples/ |
💙 Thank You
Thank you to our community for your feedback and contributions!
v1.0.0 Foundry Local - General Availability
We are excited to announce the General Availability of Foundry Local, a unified on-device AI runtime that brings generative AI directly into your applications. All inference runs locally: user data never leaves the device, responses are instant with zero network latency, and everything works offline. No per-token costs, no backend infrastructure.
SDKs
Foundry Local ships production SDKs for C#, JavaScript, Python, and Rust, each providing a consistent API surface for model management, chat completions, audio transcription, and tool calling.
| SDK | Package | |
|---|---|---|
| C# | Microsoft.AI.Foundry.Local |
|
| JavaScript | foundry-local-sdk |
|
| 🐍 | Python | foundry-local-sdk |
| 🦀 | Rust | foundry-local-sdk |
WinML Variants
Each SDK also ships a WinML variant that unlocks more GPU and NPU devices on Windows, available through the Windows ML execution provider catalog.
| SDK | Package | |
|---|---|---|
| C# | Microsoft.AI.Foundry.Local.WinML |
|
| JavaScript | foundry-local-sdk-winml |
|
| 🐍 | Python | foundry-local-sdk-winml |
| 🦀 | Rust | foundry-local-sdk with winml feature flag |
Platform Support
| OS | Architectures | |
|---|---|---|
| Windows | x64, ARM64 | |
| macOS | ARM64 | |
| Linux | x64 |
What You Can Build
Chat Completions
Full OpenAI-compatible chat completions API with multi-turn conversations, and configurable inference parameters (temperature, top-k, top-p, max tokens, frequency/presence penalty, random seed).
Audio Transcription
On-device speech-to-text. Transcribe audio files with language selection and temperature control.
Embedded Web Server
Start an OpenAI-compatible HTTP server from your application with a single call. Useful for multi-process architectures or bridging to tools that speak the OpenAI REST protocol.
Hardware Acceleration
Powered by ONNX Runtime, Foundry Local automatically detects available hardware and selects the best execution provider, with zero hardware detection code needed in your application.
Supported execution providers:
| Execution Provider | Hardware | Platform |
|---|---|---|
| CPU | Universal fallback | All platforms |
| WebGPU | GPU acceleration | Windows x64, macOS arm64 |
| CUDA | NVIDIA GPUs | Windows x64, Linux x64 |
| OpenVINO | Intel GPUs and NPUs | Windows x64 |
| QNN | Qualcomm NPUs | Windows ARM64 |
| TensorRT RTX | NVIDIA GPUs | Windows x64 |
| VitisAI | AMD NPUs | Windows x64 |
Execution providers can be discovered, downloaded, and registered at runtime through the SDK's discoverEps() and downloadAndRegisterEps() APIs, with per-provider progress callbacks.
Model Catalog & Management
Foundry Local includes a built-in model catalog with popular open-source models, optimized with state-of-the-art quantization and compression for on-device performance.
Model management features:
- Browse & search the catalog programmatically
- Multi-variant models - each alias maps to multiple variants optimized for different hardware (CPU, GPU, NPU)
- Automatic variant selection - the SDK picks the best variant based on what's cached and what hardware is available, with manual override via
selectVariant() - Download with progress tracking - real-time percentage callbacks
- Load / unload lifecycle - explicit control over which models are in memory
- Version management - query the catalog for the latest version of any model
Get Started
| Language | Cross-platform | Windows ML |
|---|---|---|
| JavaScript | npm install foundry-local-sdk |
npm install foundry-local-sdk-winml |
| C# | dotnet add package Microsoft.AI.Foundry.Local |
dotnet add package Microsoft.AI.Foundry.Local.WinML |
| Python | pip install foundry-local-sdk |
pip install foundry-local-sdk-winml |
| Rust | cargo add foundry-local-sdk |
cargo add foundry-local-sdk --features winml |
Foundry Local Release 0.8.119
Foundry Local Release 0.8.117
Foundry Local Release 0.8.115
Foundry Local Release Notes: v0.8.115 🚀
This release is an incremental build targeting tool calling scenarios.
🐛 Bug fixes
#335 Guidance error when tool_choice=required
#336 Foundry Local enforcing "required" field of function parameters
📝 Known issues
#346 Tool calling doesn't return tool_calls results in streaming mode
Foundry Local Release 0.8.113
Foundry Local Release Notes: v0.8.113 🚀
✨ New Features
Add support for tool calling. Models that support tool calling have the supportsToolCalling tag, which is also exposed via the SDKs.
🐛 Bug fixes
Fix crash on context length exhaustion. CLI now exits when context length is exhausted and the REST API returns an error if the request requires more tokens than max_length configuration allows.
📝 Known issues
This release only allows one tool call per request.
Foundry Local Release 0.8.103
Foundry Local Release Notes: v0.8.103 🚀
🔨 Filter out automatic speech recognition models from foundry model list
These models can be listed using the /foundry/list endpoint and run using the standalone SDK
⭐ Sign Up for Foundry Local SDK vNext Private Preview – Fill in form ⭐
Foundry Local Release 0.8.101
Foundry Local Release Notes: v0.8.101 🚀
✨ New Features
Improve performance for multi-turn conversations on macOS, especially time to first token, with the addition of the continuous decoding feature. Only new tokens are sent to the model instead of the entire conversation. The previous inputs and responses are saved by the model in the KV-cache.
📝 Known issues
When the context length is exhausted (set by the max_length value), instead of showing a warning / error message, an exception is thrown
Foundry Local Release 0.8.94
Foundry Local Release Notes: v0.8.94 🚀
✨ New Features
Improve performance for multi-turn conversations, especially time to first token, with the addition of the continuous decoding feature. Only new tokens are sent to the model instead of the entire conversation. The previous inputs and responses are saved by the model in the KV-cache.
Website showing full model list with hardware variants: https://foundrylocal.ai/models
🐛 Bug fixes
- Foundry Local now defaults to
--default-log-levelinstead ofInformationif--log-levelis not provided. Foundry Local also elevates the level with which some errors were being written with fromInformationtoError. - #265
- #263
- #71
📝 Known issues
- This version is not supported on macos. Please use the previous release for macos. Support coming soon!
- If model is not found in the catalog, instead of showing a warning / suggestion message and gracefully exiting, an exception
is thrown. - When the context length is exhausted (set by the max_length value), instead of showing a warning / error message, an exception is thrown