A high-performance text generation engine using Trigram Markov Chains, MapDB for persistent storage, and a dynamic configuration-driven function calling system. This project demonstrates how to build a coherent text model that can perform real-world tasks through custom function tokens.
The system is built on a modular architecture that separates data storage, transition logic, and post-processing:
- Trigram Markov Engine: Unlike basic bigram models, this engine uses a 2-word context (trigrams) to predict the next word. This significantly improves coherence and grammatical consistency.
- Persistent Storage (MapDB): Transitions are stored in a MapDB BTreeMap. This allows the model to scale to massive datasets (millions of transitions) without exceeding RAM limits, as data is paged to disk.
- Function Execution Pipeline: A post-generation phase that scans generated text for "tool tokens" (e.g.,
[TIME]) and resolves them using a configuration-driven execution engine. - Interactive GUI: A Swing-based application providing separate environments for model training (including synthetic data generation) and real-time inference with "Thought Trace" visibility.
The engine supports extensible tokens that trigger real-world actions during the generation phase. These are defined in src/main/resources/functions.yml.
- Config-Driven: Define new tokens, their types, input parameters, and output templates without changing core code.
- Parameter Support: Tokens can accept arguments, e.g.,
[ROLL_DICE:min=1,max=20]. - Templated Output: Resolve results into human-readable strings using
${variable}syntax.
datetime: Returns current time or date (e.g.,[TIME],[DATE]). Supports custom format strings.random: Generates random numbers (e.g.,[ROLL_DICE]). Supportsminandmaxparameters.
functions:
- token: "ROLL_DICE"
type: "random"
input:
properties:
min: { type: "integer", default: 1 }
max: { type: "integer", default: 6 }
template: "${result}"The Interactive Chat GUI includes a "Thought Trace" panel that provides a real-time window into the engine's decision-making process:
- Prefix State: Displays the current 2-word context used for the next prediction.
- Transition Probabilities: Lists all possible next words discovered in the model and their relative weights/probabilities.
- Resolution Logs: Shows the internal steps of the
FunctionExecutoras it identifies and resolves tokens, including argument parsing and final string replacement.
- Synthetic Data Injection: A dedicated
SyntheticDataFactoryprogrammatically generates training samples to teach the model how to use function tokens in context (e.g., answering "What time is it?" with the[TIME]token). - Threaded Inference: Generation occurs in background threads to keep the UI responsive, even for long sequences.
- Flexible Training: Import text files or use the built-in synthetic data generator to populate the model.
- Java 17+
- Maven 3.8+
Use the provided batch file to build and launch the application:
run.bat- Option 1 (GUI Mode): Launches the Swing-based chat interface. Recommended for most users.
- Option 2 (CLI Mode): Launches an interactive command-line session.
Train the model:
mvn exec:java -Dexec.mainClass="com.markov.Application" -Dexec.args="--train data"Generate text with custom seeds:
mvn exec:java -Dexec.mainClass="com.markov.Application" -Dexec.args="--generate 50 --seed 'What is'"Launch GUI directly:
mvn exec:java -Dexec.mainClass="com.markov.Application" -Dexec.args="--gui"com.markov.gui: Swing components (MainFrame,InferencePanel,TrainingPanel,StatusBar).MarkovTrainer: Processes text into trigram transitions and stores them in MapDB.MarkovGenerator: Executes the weighted random walk using a 2-word context.FunctionExecutor: Detects and resolves dynamic tokens viafunctions.yml.SyntheticDataFactory: Programmatically generates training samples for tool-use scenarios.ModelStore: Manages the persistent MapDB storage layer.
The generated text is probabilistic and evolves as you train the model with more data.