Skip to content

Decentralised-AI/cross-platform-llm-client

 
 

Repository files navigation

PrivateLM

Live Web App

A production-ready, cross-platform AI chat client built with Flutter. It unifies local on-device LLM inference (Android) with cloud API access, giving users full control over how their models run.

Image generation tested on Moto G71 (Snapdragon), Oneplus 10r (Mediatek), Pixel 6A (Tensor), Poco F1 (Snapdragon), Samsung s23 (Snapdragon) 4 steps fast Image generation tested on Moto G71 (Snapdragon), Oneplus 10r (Mediatek), Pixel 6A (Tensor), Poco F1 (Snapdragon), Samsung s23 (Snapdragon) 4 steps fast

Generated on pixel 6 with 20 step Generated on pixel 6 with 20 step


What It Does

  • Local Inference on Android — Download and run GGUF models directly on your phone using GPU-accelerated inference (Vulkan). No internet required after download.
  • Cloud API Fallback — Seamlessly switch to OpenAI, Anthropic, Google Gemini, or Kimi (Moonshot AI) when you need more power or are on unsupported platforms.
  • Multimodal Chat — Send text and images in conversations. Vision support works with both local models (Qwen2-VL) and cloud providers.
  • Persistent Sessions — All chats, tasks, and settings are stored locally via Hive. Nothing leaves your device unless you explicitly choose cloud mode.
  • Background Services — Firebase Cloud Messaging integration for push updates and background task handling.
  • Smart Auto-Configuration — On first launch, the app detects your device's RAM and recommends optimal context size and token limits automatically.
  • Task Management — A dedicated task view for structured AI-assisted workflows alongside free-form chat.

Technical Architecture

Stack

  • Framework: Flutter 3.x (Dart >=3.3.0)
  • State Management: GetX
  • Local Storage: Hive
  • Networking: Dio + package:http
  • Background Execution: flutter_background_service + flutter_local_notifications
  • Push Notifications: Firebase Core + Firebase Messaging

Inference Pipeline

┌─────────────────────────────────────────────────────────────┐
│                        UI Layer                              │
│   ChatView / TaskView / ModelView / SettingsView            │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│                    Controllers (GetX)                        │
│   ChatController · TaskController · ModelController         │
│   SettingsController · HomeController                       │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│                      Services                                │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────┐ │
│  │ InferenceService│  │  CloudService   │  │DownloadSvc  │ │
│  │  (local GGUF)   │  │ (OpenAI/Claude/ │  │ (model dl)  │ │
│  │                 │  │  Gemini/Kimi)   │  │             │ │
│  └─────────────────┘  └─────────────────┘  └─────────────┘ │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────┐ │
│  │  HiveService    │  │ DeviceInfoSvc   │  │ExecutionSvc │ │
│  │  ( persistence) │  │  (RAM/GPU tier) │  │ (bg tasks)  │ │
│  └─────────────────┘  └─────────────────┘  └─────────────┘ │
└─────────────────────────────────────────────────────────────┘

Local Inference (Android)

The app uses llama_flutter_android, a custom Flutter plugin wrapping llama.cpp for ARM64 devices. At runtime it:

  1. Detects GPU capabilities via Vulkan to determine offload layers.
  2. Selects thread count based on device tier (ultra / high / mid / low).
  3. Loads the GGUF model with progress streaming.
  4. Generates tokens via generateChat() with native chat-template support (ChatML, Llama-3, Gemma, Phi).
  5. Falls back to manual prompt construction if native templates fail.

Idle detection (5s) and hard timeouts (180s) keep the UX responsive even on underpowered hardware.

Cloud Inference

CloudService normalizes four different API shapes into a single interface:

  • OpenAI — standard /v1/chat/completions
  • Anthropic — Messages API with separate system param
  • Google GeminigenerateContent with inline image base64
  • Kimi — OpenAI-compatible endpoint from Moonshot AI

API keys are stored in Hive and never transmitted anywhere except to the provider's endpoint.

Cross-Platform Abstraction

Local inference is conditionally compiled:

  • Androidinference_android.dart (full llama.cpp engine)
  • Webinference_stub.dart (cloud-only, local coming soon)
  • iOSinference_android.dart (full llama.cpp engine via Metal GPU)

The InferenceService exposes supportsLocalInference so the UI can hide local-model UI on unsupported platforms.


Supported Platforms

Platform Local Inference Cloud APIs Notes
Android ✅ Yes ✅ Yes CPU offload via NEON; minSdk 28
iOS ✅ Yes ✅ Yes Metal GPU acceleration
Web ❌ No ✅ Yes Cloud-only (local coming soon)

iOS / iPad

The iPad release is distributed as a standalone ZIP package for sideloading. Download the latest PrivateLM-iOS.zip from the Releases page, extract it, and install the .ipa via AltStore, Sideloadly, or Xcode. iPhone support is experimental — iPad is the recommended iOS target due to RAM requirements for local models.


Build Configuration

Prerequisites

  • Flutter SDK >=3.3.0
  • Android SDK (API 26+)
  • JDK 17
  • NDK (bundled with Android SDK)

Android

flutter pub get
cd android
./gradlew assembleDebug   # or assembleRelease

For release builds you should configure your own signing in android/app/build.gradle.kts:

buildTypes {
    release {
        signingConfig = signingConfigs.getByName("release")
        isMinifyEnabled = true
        isShrinkResources = true
        proguardFiles(
            getDefaultProguardFile("proguard-android-optimize.txt"),
            "proguard-rules.pro"
        )
    }
}

iOS

flutter pub get
cd ios
pod install
flutter build ios

Web

flutter pub get
flutter build web --release

License

MIT — see LICENSE for details.

About

A unified cross-platform AI client supporting seamless transitions between standard cloud APIs and on-device, offline execution of custom and uncensored language models.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • C++ 61.3%
  • C 18.2%
  • Cuda 8.5%
  • Metal 4.2%
  • GLSL 1.7%
  • Dart 1.5%
  • Other 4.6%