Skip to content

manojkumarredbus/mkLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Java Markov Chain Engine (Trigram)

A high-performance text generation engine using Trigram Markov Chains, MapDB for persistent storage, and a dynamic configuration-driven function calling system. This project demonstrates how to build a coherent text model that can perform real-world tasks through custom function tokens.

System Architecture

The system is built on a modular architecture that separates data storage, transition logic, and post-processing:

  1. Trigram Markov Engine: Unlike basic bigram models, this engine uses a 2-word context (trigrams) to predict the next word. This significantly improves coherence and grammatical consistency.
  2. Persistent Storage (MapDB): Transitions are stored in a MapDB BTreeMap. This allows the model to scale to massive datasets (millions of transitions) without exceeding RAM limits, as data is paged to disk.
  3. Function Execution Pipeline: A post-generation phase that scans generated text for "tool tokens" (e.g., [TIME]) and resolves them using a configuration-driven execution engine.
  4. Interactive GUI: A Swing-based application providing separate environments for model training (including synthetic data generation) and real-time inference with "Thought Trace" visibility.

Dynamic Token Resolution (Function Calling)

The engine supports extensible tokens that trigger real-world actions during the generation phase. These are defined in src/main/resources/functions.yml.

Key Features

  • Config-Driven: Define new tokens, their types, input parameters, and output templates without changing core code.
  • Parameter Support: Tokens can accept arguments, e.g., [ROLL_DICE:min=1,max=20].
  • Templated Output: Resolve results into human-readable strings using ${variable} syntax.

Supported Operations

  • datetime: Returns current time or date (e.g., [TIME], [DATE]). Supports custom format strings.
  • random: Generates random numbers (e.g., [ROLL_DICE]). Supports min and max parameters.

Configuration Example (functions.yml)

functions:
  - token: "ROLL_DICE"
    type: "random"
    input:
      properties:
        min: { type: "integer", default: 1 }
        max: { type: "integer", default: 6 }
    template: "${result}"

GUI Inner Workings (Thought Trace)

The Interactive Chat GUI includes a "Thought Trace" panel that provides a real-time window into the engine's decision-making process:

  • Prefix State: Displays the current 2-word context used for the next prediction.
  • Transition Probabilities: Lists all possible next words discovered in the model and their relative weights/probabilities.
  • Resolution Logs: Shows the internal steps of the FunctionExecutor as it identifies and resolves tokens, including argument parsing and final string replacement.

Core Features

  • Synthetic Data Injection: A dedicated SyntheticDataFactory programmatically generates training samples to teach the model how to use function tokens in context (e.g., answering "What time is it?" with the [TIME] token).
  • Threaded Inference: Generation occurs in background threads to keep the UI responsive, even for long sequences.
  • Flexible Training: Import text files or use the built-in synthetic data generator to populate the model.

Technical Requirements

  • Java 17+
  • Maven 3.8+

How to Run

Quick Start

Use the provided batch file to build and launch the application:

run.bat
  • Option 1 (GUI Mode): Launches the Swing-based chat interface. Recommended for most users.
  • Option 2 (CLI Mode): Launches an interactive command-line session.

Advanced Usage (Maven)

Train the model:

mvn exec:java -Dexec.mainClass="com.markov.Application" -Dexec.args="--train data"

Generate text with custom seeds:

mvn exec:java -Dexec.mainClass="com.markov.Application" -Dexec.args="--generate 50 --seed 'What is'"

Launch GUI directly:

mvn exec:java -Dexec.mainClass="com.markov.Application" -Dexec.args="--gui"

Project Structure

  • com.markov.gui: Swing components (MainFrame, InferencePanel, TrainingPanel, StatusBar).
  • MarkovTrainer: Processes text into trigram transitions and stores them in MapDB.
  • MarkovGenerator: Executes the weighted random walk using a 2-word context.
  • FunctionExecutor: Detects and resolves dynamic tokens via functions.yml.
  • SyntheticDataFactory: Programmatically generates training samples for tool-use scenarios.
  • ModelStore: Manages the persistent MapDB storage layer.

The generated text is probabilistic and evolves as you train the model with more data.

About

Java Markov(mk) Chain Engine (Trigram)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors