Java Markov Chain Engine (Trigram)

A high-performance text generation engine using Trigram Markov Chains, MapDB for persistent storage, and a dynamic configuration-driven function calling system. This project demonstrates how to build a coherent text model that can perform real-world tasks through custom function tokens.

System Architecture

The system is built on a modular architecture that separates data storage, transition logic, and post-processing:

Trigram Markov Engine: Unlike basic bigram models, this engine uses a 2-word context (trigrams) to predict the next word. This significantly improves coherence and grammatical consistency.
Persistent Storage (MapDB): Transitions are stored in a MapDB BTreeMap. This allows the model to scale to massive datasets (millions of transitions) without exceeding RAM limits, as data is paged to disk.
Function Execution Pipeline: A post-generation phase that scans generated text for "tool tokens" (e.g., [TIME]) and resolves them using a configuration-driven execution engine.
Interactive GUI: A Swing-based application providing separate environments for model training (including synthetic data generation) and real-time inference with "Thought Trace" visibility.

Dynamic Token Resolution (Function Calling)

The engine supports extensible tokens that trigger real-world actions during the generation phase. These are defined in src/main/resources/functions.yml.

Key Features

Config-Driven: Define new tokens, their types, input parameters, and output templates without changing core code.
Parameter Support: Tokens can accept arguments, e.g., [ROLL_DICE:min=1,max=20].
Templated Output: Resolve results into human-readable strings using ${variable} syntax.

Supported Operations

datetime: Returns current time or date (e.g., [TIME], [DATE]). Supports custom format strings.
random: Generates random numbers (e.g., [ROLL_DICE]). Supports min and max parameters.

Configuration Example (`functions.yml`)

functions:
  - token: "ROLL_DICE"
    type: "random"
    input:
      properties:
        min: { type: "integer", default: 1 }
        max: { type: "integer", default: 6 }
    template: "${result}"

GUI Inner Workings (Thought Trace)

The Interactive Chat GUI includes a "Thought Trace" panel that provides a real-time window into the engine's decision-making process:

Prefix State: Displays the current 2-word context used for the next prediction.
Transition Probabilities: Lists all possible next words discovered in the model and their relative weights/probabilities.
Resolution Logs: Shows the internal steps of the FunctionExecutor as it identifies and resolves tokens, including argument parsing and final string replacement.

Core Features

Synthetic Data Injection: A dedicated SyntheticDataFactory programmatically generates training samples to teach the model how to use function tokens in context (e.g., answering "What time is it?" with the [TIME] token).
Threaded Inference: Generation occurs in background threads to keep the UI responsive, even for long sequences.
Flexible Training: Import text files or use the built-in synthetic data generator to populate the model.

Technical Requirements

Java 17+
Maven 3.8+

How to Run

Quick Start

Use the provided batch file to build and launch the application:

run.bat

Option 1 (GUI Mode): Launches the Swing-based chat interface. Recommended for most users.
Option 2 (CLI Mode): Launches an interactive command-line session.

Advanced Usage (Maven)

Train the model:

mvn exec:java -Dexec.mainClass="com.markov.Application" -Dexec.args="--train data"

Generate text with custom seeds:

mvn exec:java -Dexec.mainClass="com.markov.Application" -Dexec.args="--generate 50 --seed 'What is'"

Launch GUI directly:

mvn exec:java -Dexec.mainClass="com.markov.Application" -Dexec.args="--gui"

Project Structure

com.markov.gui: Swing components (MainFrame, InferencePanel, TrainingPanel, StatusBar).
MarkovTrainer: Processes text into trigram transitions and stores them in MapDB.
MarkovGenerator: Executes the weighted random walk using a 2-word context.
FunctionExecutor: Detects and resolves dynamic tokens via functions.yml.
SyntheticDataFactory: Programmatically generates training samples for tool-use scenarios.
ModelStore: Manages the persistent MapDB storage layer.

The generated text is probabilistic and evolves as you train the model with more data.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml
run.bat		run.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Java Markov Chain Engine (Trigram)

System Architecture

Dynamic Token Resolution (Function Calling)

Key Features

Supported Operations

Configuration Example (`functions.yml`)

GUI Inner Workings (Thought Trace)

Core Features

Technical Requirements

How to Run

Quick Start

Advanced Usage (Maven)

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Java Markov Chain Engine (Trigram)

System Architecture

Dynamic Token Resolution (Function Calling)

Key Features

Supported Operations

Configuration Example (functions.yml)

GUI Inner Workings (Thought Trace)

Core Features

Technical Requirements

How to Run

Quick Start

Advanced Usage (Maven)

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Configuration Example (`functions.yml`)

Packages