- Everywhere: web, iOS, macOS, Android, Windows, Linux.
- Fast: exceeds average reading speed on all platforms except web.
- Private: No network connection, server, cloud required.
- Forward compatible: Any model compatible with llama.cpp. (so, every model.)
- Full OpenAI compatability: chat messages, multimodal/image support via LLaVa models, and function calling. (constrain outputs to valid JSON based on a JSON schema)
- Bare metal interface: call LLMs without being constrained to a chat implementation.
- Use with FONNX for RAG
- Combine with FONNX to have a full retrieval-augmented generation stack available on all platforms.
| Platform | Status |
|---|---|
| Android | |
| iOS | |
| Linux | |
| macOS | |
| Web | |
| Windows |
- Web is now based on WASM compiled from FLLAMA itself, rather than just llama.cpp, guaranteeing native/web parity.
- Tokenizing strings based on the model is 1000x faster, via caching the model. Went from O(300 ms) on native only to O(0.2 ms) on web, O(0.00001 ms) on native. This enables calculating what strings will be in context based on the context size.
- Method renames for consistency, correctness, and clarity. (ex. remove *Async from names, because all methods are async; rename
fllamaChatCompletionAsynctofllamaChat) - Document methods and updated example. TL;DR: Use
fllamaChatunless you're doing something funny with LLMs that isn't user-facing, it will act like a true text completion engine instead of a chatbot.
- Add this to your package's pubspec.yaml file:
dependencies:
fllama:
git:
url: https://github.com/Telosnex/fllama.git
ref: main- Run inference:
import 'package:fllama/fllama.dart';
String latestResult = "";
final request = OpenAiRequest(
maxTokens: 256,
messages: [
Message(Role.system, 'You are a chatbot.'),
Message(Role.user, messageText),
],
numGpuLayers: 99, /* this seems to have no adverse effects in environments w/o GPU support, ex. Android and web */
modelPath: _modelPath!,
mmprojPath: _mmprojPath,
frequencyPenalty: 0.0,
// Don't use below 1.1, LLMs without a repeat penalty
// will repeat the same token.
presencePenalty: 1.1,
topP: 1.0,
// Proportional to RAM use.
// 4096 is a good default.
// 2048 should be considered on devices with low RAM (<8 GB)
// 8192 and higher can be considered on device with high RAM (>16 GB)
// Models are trained on <= a certain context size. Exceeding that # can/will lead to completely incoherent output.
contextSize: 2048,
// Don't use 0.0, some models will repeat the same token.
temperature: 0.1,
logger: (log) {
// ignore: avoid_print
print('[llama.cpp] $log');
},
);
fllamaChat(request, (response, done) {
setState(() {
latestResult = response;
});
});Web uses the package-bundled wllama/WebGPU runtime. Apps should import the package asset bridge from web/index.html:
<script type="module">
import './assets/packages/fllama/assets/web/fllama_web_init.js';
</script>For best performance, serve the app with cross-origin isolation enabled so the multi-threaded WebAssembly backend can run.
Before publishing or after replacing the web runtime, run:
scripts/check_web_assets.sh3 top-tier open models are in the fllama HuggingFace repo.
- Stable LM 3B is the first LLM model that can handle RAG, using documents such as web pages to answer a query, on all devices.
Mistral models via Nous Research. They trained and finetuned the Mistral base models for chat to create the OpenHermes series of models.
- Mistral 7B is best on 2023 iPhones or 2024 Androids or better. It's about 2/3 the speed of Stable LM 3B and requires 5 GB of RAM.
- Mixtral 8x7B should only be considered on a premium laptop or desktop, such as an M-series MacBook or a premium desktop purchased in 2023 or later. It's about 1/3 the speed of Stable LM 3B and requires 26 GB of RAM.
Roughly: you'll need as much RAM as the model file size. If inference runs on CPU, that much regular RAM is required. If inference runs on GPU, that much GPU RAM is required.
HuggingFace, among many things, can be thought of as GitHub for AI models. You can download a model from HuggingFace and use it with fllama. To get a download URL at runtime, see below.
String getHuggingFaceUrl(
{required String repoId,
required String filename,
String? revision,
String? subfolder}) {
// Default values
const String defaultEndpoint = 'https://huggingface.co';
const String defaultRevision = 'main';
// Ensure the revision and subfolder are not null and are URI encoded
final String encodedRevision =
Uri.encodeComponent(revision ?? defaultRevision);
final String encodedFilename = Uri.encodeComponent(filename);
final String? encodedSubfolder =
subfolder != null ? Uri.encodeComponent(subfolder) : null;
// Handle subfolder if provided
final String fullPath = encodedSubfolder != null
? '$encodedSubfolder/$encodedFilename'
: encodedFilename;
// Construct the URL
final String url =
'$defaultEndpoint/$repoId/resolve/$encodedRevision/$fullPath';
return url;
}FLLAMA is licensed under a dual-license model.
The code as-is on GitHub is licensed under GPL v2. That requires distribution of the integrating app's source code, and this is unlikely to be desirable for commercial entities. See LICENSE.md.
Commercial licenses are also available. Contact info@telosnex.com. Expect very fair terms: our intent is to charge only entities, with a launched app, making a lot of money, with FLLAMA as a core dependency. The base agreement is here: https://github.com/lawndoc/dual-license-templates/blob/main/pdf/Basic-Yearly.pdf
fllama has two llama.cpp delivery paths:
- Native platforms compile the vendored source tree in this repo:
src/llama.cpp - Web uses prebuilt wllama artifacts bundled in this repo:
Those artifacts are built from the separate wllama checkout. In the Telosnex dev workspace, that source is usually:
assets/web/wllama/index.js assets/web/wllama/wasm/wllama.js assets/web/wllama/wasm/wllama.wasm/Users/jpo/dev/ngxson_wllama/llama.cpp
That means web's underlying llama.cpp is whatever the wllama checkout contains at build time. To keep native and web matched, refresh fllama's native vendored tree from wllama's llama.cpp tree, then rebuild/copy the web artifacts.
export FLLAMA=/Users/jpo/dev/fllama
export WLLAMA=/Users/jpo/dev/ngxson_wllama
cd "$FLLAMA"
git status --short
# Dry run first. Review every changed/deleted file.
rsync -nrc --delete \
--exclude='.git/' \
--exclude='build/' \
--exclude='FLLAMA_LLAMA_CPP_DROP.txt' \
"$WLLAMA/llama.cpp/" \
"$FLLAMA/src/llama.cpp/"
# If the dry run is sane, do the copy.
rsync -a --delete \
--exclude='.git/' \
--exclude='build/' \
--exclude='FLLAMA_LLAMA_CPP_DROP.txt' \
"$WLLAMA/llama.cpp/" \
"$FLLAMA/src/llama.cpp/"Then update the native build metadata/documentation:
src/CMakeLists.txt: updateLLAMA_BUILD_COMMIT.src/llama.cpp/FLLAMA_LLAMA_CPP_DROP.txt: record where the drop came from and why.
If the wllama llama.cpp directory is a real Git checkout, use:
git -C "$WLLAMA/llama.cpp" rev-parse --short HEADIf it is a copied/submodule tree without .git metadata, use the recorded source commit in FLLAMA_LLAMA_CPP_DROP.txt or the wllama submodule gitlink:
git -C "$WLLAMA" ls-files -s llama.cppClean stale native build outputs before validating:
cd "$FLLAMA"
rm -rf src/build example/build
cmake -S src -B src/build -DCMAKE_BUILD_TYPE=Release
cmake --build src/build -jThen validate through a Flutter host, for example:
cd "$FLLAMA/example"
flutter build macos --debugAfter changing the wllama checkout, rebuild its generated glue, WASM, worker, and ESM bundle:
cd /Users/jpo/dev/ngxson_wllama
npm run build:glue
npx tsc --noEmit -p tsconfig.build.json
npm run build:wasm
npm run build:worker
npm run build:tsupCopy the generated artifacts into fllama:
cd /Users/jpo/dev/fllama
cp /Users/jpo/dev/ngxson_wllama/esm/index.js assets/web/wllama/index.js
cp /Users/jpo/dev/ngxson_wllama/src/wasm/wllama.js assets/web/wllama/wasm/wllama.js
cp /Users/jpo/dev/ngxson_wllama/src/wasm/wllama.wasm assets/web/wllama/wasm/wllama.wasmValidate the web assets:
node --check assets/web/fllama_web_init.js
scripts/check_web_assets.sh- The web example uses wllama's server backend and requires cross-origin isolation for SharedArrayBuffer/pthreads.
- From the example directory, run debug mode with:
flutter run -d chrome --cross-origin-isolation
- Or run a release-style local build with the bundled header-setting server:
flutter build web node web/server.js
flutter run -d webis not a valid Flutter device id; usechromefor an auto-launched browser orweb-serverif you want to open the URL yourself.- If your Flutter version does not support
--cross-origin-isolation, pass the headers explicitly:flutter run -d chrome \ --web-header=Cross-Origin-Opener-Policy=same-origin \ --web-header=Cross-Origin-Embedder-Policy=require-corp
- When changes are made to C++ bindings, run
flutter pub run ffigen --config ffigen.yamlto make them available in Dart. rm -rvf Podfile.lock && rm -rvf Podfile && rm -rvf Pods && flutter clean^ run in example/macos / example/ios when upgrading cpp files, or when getting cryptic errors about build cache.
