Licensing

Everywhere: web, iOS, macOS, Android, Windows, Linux.
Fast: exceeds average reading speed on all platforms except web.
Private: No network connection, server, cloud required.
Forward compatible: Any model compatible with llama.cpp. (so, every model.)
Full OpenAI compatability: chat messages, multimodal/image support via LLaVa models, and function calling. (constrain outputs to valid JSON based on a JSON schema)
Bare metal interface: call LLMs without being constrained to a chat implementation.
Use with FONNX for RAG
- Combine with FONNX to have a full retrieval-augmented generation stack available on all platforms.

CI builds

Platform	Status
Android
iOS
Linux
macOS
Web
Windows

Recent updates

26 Feb 2024

Web is now based on WASM compiled from FLLAMA itself, rather than just llama.cpp, guaranteeing native/web parity.
Tokenizing strings based on the model is 1000x faster, via caching the model. Went from O(300 ms) on native only to O(0.2 ms) on web, O(0.00001 ms) on native. This enables calculating what strings will be in context based on the context size.
Method renames for consistency, correctness, and clarity. (ex. remove *Async from names, because all methods are async; rename fllamaChatCompletionAsync to fllamaChat)
Document methods and updated example. TL;DR: Use fllamaChat unless you're doing something funny with LLMs that isn't user-facing, it will act like a true text completion engine instead of a chatbot.

Integrate

Add this to your package's pubspec.yaml file:

dependencies:
  fllama:
    git:
      url: https://github.com/Telosnex/fllama.git
      ref: main

Run inference:

import 'package:fllama/fllama.dart';

String latestResult = "";

final request = OpenAiRequest(
  maxTokens: 256,
  messages: [
    Message(Role.system, 'You are a chatbot.'),
    Message(Role.user, messageText),
  ],
  numGpuLayers: 99, /* this seems to have no adverse effects in environments w/o GPU support, ex. Android and web */
  modelPath: _modelPath!,
  mmprojPath: _mmprojPath,
  frequencyPenalty: 0.0,
  // Don't use below 1.1, LLMs without a repeat penalty
  // will repeat the same token.
  presencePenalty: 1.1,
  topP: 1.0,
  // Proportional to RAM use. 
  // 4096 is a good default. 
  // 2048 should be considered on devices with low RAM (<8 GB)
  // 8192 and higher can be considered on device with high RAM (>16 GB)
  // Models are trained on <= a certain context size. Exceeding that # can/will lead to completely incoherent output.
  contextSize: 2048,
  // Don't use 0.0, some models will repeat the same token.
  temperature: 0.1,
  logger: (log) {
    // ignore: avoid_print
    print('[llama.cpp] $log');
  },
);
fllamaChat(request, (response, done) {
    setState(() {
      latestResult = response;
    });
});

Tips & Tricks

Web

Web uses the package-bundled wllama/WebGPU runtime. Apps should import the package asset bridge from web/index.html:

<script type="module">
  import './assets/packages/fllama/assets/web/fllama_web_init.js';
</script>

For best performance, serve the app with cross-origin isolation enabled so the multi-threaded WebAssembly backend can run.

Before publishing or after replacing the web runtime, run:

scripts/check_web_assets.sh

Recommended models

3 top-tier open models are in the fllama HuggingFace repo.

Stable LM 3B is the first LLM model that can handle RAG, using documents such as web pages to answer a query, on all devices.

Mistral models via Nous Research. They trained and finetuned the Mistral base models for chat to create the OpenHermes series of models.

Mistral 7B is best on 2023 iPhones or 2024 Androids or better. It's about 2/3 the speed of Stable LM 3B and requires 5 GB of RAM.
Mixtral 8x7B should only be considered on a premium laptop or desktop, such as an M-series MacBook or a premium desktop purchased in 2023 or later. It's about 1/3 the speed of Stable LM 3B and requires 26 GB of RAM.

RAM Requirements

Roughly: you'll need as much RAM as the model file size. If inference runs on CPU, that much regular RAM is required. If inference runs on GPU, that much GPU RAM is required.

Download files from HuggingFace at runtime

HuggingFace, among many things, can be thought of as GitHub for AI models. You can download a model from HuggingFace and use it with fllama. To get a download URL at runtime, see below.

String getHuggingFaceUrl(
  {required String repoId,
  required String filename,
  String? revision,
  String? subfolder}) {
// Default values
const String defaultEndpoint = 'https://huggingface.co';
const String defaultRevision = 'main';

// Ensure the revision and subfolder are not null and are URI encoded
final String encodedRevision =
    Uri.encodeComponent(revision ?? defaultRevision);
final String encodedFilename = Uri.encodeComponent(filename);
final String? encodedSubfolder =
    subfolder != null ? Uri.encodeComponent(subfolder) : null;

// Handle subfolder if provided
final String fullPath = encodedSubfolder != null
    ? '$encodedSubfolder/$encodedFilename'
    : encodedFilename;

// Construct the URL
final String url =
    '$defaultEndpoint/$repoId/resolve/$encodedRevision/$fullPath';

return url;
}

Licensing

FLLAMA is licensed under a dual-license model.

The code as-is on GitHub is licensed under GPL v2. That requires distribution of the integrating app's source code, and this is unlikely to be desirable for commercial entities. See LICENSE.md.

Commercial licenses are also available. Contact info@telosnex.com. Expect very fair terms: our intent is to charge only entities, with a launched app, making a lot of money, with FLLAMA as a core dependency. The base agreement is here: https://github.com/lawndoc/dual-license-templates/blob/main/pdf/Basic-Yearly.pdf

llama.cpp source layout and update process

fllama has two llama.cpp delivery paths:

Native platforms compile the vendored source tree in this repo:
```
src/llama.cpp
```
Web uses prebuilt wllama artifacts bundled in this repo:
```
assets/web/wllama/index.js
assets/web/wllama/wasm/wllama.js
assets/web/wllama/wasm/wllama.wasm
```
Those artifacts are built from the separate wllama checkout. In the Telosnex dev workspace, that source is usually:
```
/Users/jpo/dev/ngxson_wllama/llama.cpp
```

That means web's underlying llama.cpp is whatever the wllama checkout contains at build time. To keep native and web matched, refresh fllama's native vendored tree from wllama's llama.cpp tree, then rebuild/copy the web artifacts.

Refresh native llama.cpp from wllama

export FLLAMA=/Users/jpo/dev/fllama
export WLLAMA=/Users/jpo/dev/ngxson_wllama

cd "$FLLAMA"
git status --short

# Dry run first. Review every changed/deleted file.
rsync -nrc --delete \
  --exclude='.git/' \
  --exclude='build/' \
  --exclude='FLLAMA_LLAMA_CPP_DROP.txt' \
  "$WLLAMA/llama.cpp/" \
  "$FLLAMA/src/llama.cpp/"

# If the dry run is sane, do the copy.
rsync -a --delete \
  --exclude='.git/' \
  --exclude='build/' \
  --exclude='FLLAMA_LLAMA_CPP_DROP.txt' \
  "$WLLAMA/llama.cpp/" \
  "$FLLAMA/src/llama.cpp/"

Then update the native build metadata/documentation:

src/CMakeLists.txt: update LLAMA_BUILD_COMMIT.
src/llama.cpp/FLLAMA_LLAMA_CPP_DROP.txt: record where the drop came from and why.

If the wllama llama.cpp directory is a real Git checkout, use:

git -C "$WLLAMA/llama.cpp" rev-parse --short HEAD

If it is a copied/submodule tree without .git metadata, use the recorded source commit in FLLAMA_LLAMA_CPP_DROP.txt or the wllama submodule gitlink:

git -C "$WLLAMA" ls-files -s llama.cpp

Clean stale native build outputs before validating:

cd "$FLLAMA"
rm -rf src/build example/build
cmake -S src -B src/build -DCMAKE_BUILD_TYPE=Release
cmake --build src/build -j

Then validate through a Flutter host, for example:

cd "$FLLAMA/example"
flutter build macos --debug

Rebuild and copy the web wllama artifacts

After changing the wllama checkout, rebuild its generated glue, WASM, worker, and ESM bundle:

cd /Users/jpo/dev/ngxson_wllama
npm run build:glue
npx tsc --noEmit -p tsconfig.build.json
npm run build:wasm
npm run build:worker
npm run build:tsup

Copy the generated artifacts into fllama:

cd /Users/jpo/dev/fllama
cp /Users/jpo/dev/ngxson_wllama/esm/index.js assets/web/wllama/index.js
cp /Users/jpo/dev/ngxson_wllama/src/wasm/wllama.js assets/web/wllama/wasm/wllama.js
cp /Users/jpo/dev/ngxson_wllama/src/wasm/wllama.wasm assets/web/wllama/wasm/wllama.wasm

Validate the web assets:

node --check assets/web/fllama_web_init.js
scripts/check_web_assets.sh

Web development

The web example uses wllama's server backend and requires cross-origin isolation for SharedArrayBuffer/pthreads.
From the example directory, run debug mode with:
```
flutter run -d chrome --cross-origin-isolation
```
Or run a release-style local build with the bundled header-setting server:
```
flutter build web
node web/server.js
```
flutter run -d web is not a valid Flutter device id; use chrome for an auto-launched browser or web-server if you want to open the URL yourself.

If your Flutter version does not support --cross-origin-isolation, pass the headers explicitly:

flutter run -d chrome \
  --web-header=Cross-Origin-Opener-Policy=same-origin \
  --web-header=Cross-Origin-Embedder-Policy=require-corp

FFI development

When changes are made to C++ bindings, run flutter pub run ffigen --config ffigen.yaml to make them available in Dart.
rm -rvf Podfile.lock && rm -rvf Podfile && rm -rvf Pods && flutter clean ^ run in example/macos / example/ios when upgrading cpp files, or when getting cryptic errors about build cache.

Name		Name	Last commit message	Last commit date
Latest commit History 523 Commits
assets/web		assets/web
dev/web_smoke		dev/web_smoke
docs		docs
example		example
hook		hook
ios		ios
lib		lib
macos/Flutter		macos/Flutter
scripts		scripts
src		src
test		test
windows		windows
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
DESIGN.md		DESIGN.md
LICENSE		LICENSE
README.md		README.md
analysis_options.yaml		analysis_options.yaml
codemagic.yaml		codemagic.yaml
ffigen.yaml		ffigen.yaml
fllama.png		fllama.png
fllama_header.png		fllama_header.png
pubspec.yaml		pubspec.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CI builds

Recent updates

26 Feb 2024

Integrate

Tips & Tricks

Web

Recommended models

RAM Requirements

Download files from HuggingFace at runtime

Licensing

llama.cpp source layout and update process

Refresh native llama.cpp from wllama

Rebuild and copy the web wllama artifacts

Web development

FFI development

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CI builds

Recent updates

26 Feb 2024

Integrate

Tips & Tricks

Web

Recommended models

RAM Requirements

Download files from HuggingFace at runtime

Licensing

llama.cpp source layout and update process

Refresh native llama.cpp from wllama

Rebuild and copy the web wllama artifacts

Web development

FFI development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages