Topic-Lode is a production-grade synthetic data generation pipeline designed to create high-quality, technically accurate datasets for LLM fine-tuning. It features two distinct modes: Knowledge Mining for deep technical Q&A and Agentic Mode for tool-use/function-calling training (optimized for Open WebUI).
- Hierarchical Generation: Explores subjects by generating sub-topics (breadth) and then deep technical questions (depth).
- Knowledge Mining Mode: Scrapes real-world technical documentation with a multi-stage fallback (DuckDuckGo -> Wikipedia -> Synthetic LLM Research).
- Agentic/Glaive Mode: Generates specialized datasets for tool-use, including JSON Schema tool definitions and conversational patterns compatible with Open WebUI.
- Quality First:
- PII Sanitization: Automatic redaction of names, emails, and phone numbers.
- Deduplication: MinHash-based near-duplicate detection for unique samples.
- LLM-as-a-Judge: Integrated validation step that filters out low-accuracy or hallucinated answers.
- Multi-Arch Container: Built for both NVIDIA DGX (ARM64) and standard x86 environments.
- Docker
- Ollama (running locally or accessible via API)
docker build -t topic-pipeline .Use this mode to create technical Q&A datasets.
docker run --rm --network host -v $(pwd):/app topic-pipeline \
--subject "Transformer Architecture" \
--breadth 3 \
--depth 5 \
--output transformer_qa.parquet--breadth: Number of sub-topics to explore.--depth: Number of questions per sub-topic.--output: Parquet file for training.
Use this mode to train models to trigger built-in tools in Open WebUI.
docker run --rm --network host -v $(pwd):/app --entrypoint python topic-pipeline \
src/agent_pipeline.py \
--subject "Kubernetes Management" \
--scenarios 10 \
--output k8s_tools.parquetThe pipeline outputs Parquet files, which are the industry standard for LLM training.
| Column | Description |
|---|---|
instruction |
The user question. |
context |
The sanitized technical background. |
response |
The factual answer for training. |
| Column | Description |
|---|---|
system |
Prompt containing Tool JSON Schemas. |
user |
Natural language request requiring a tool. |
assistant_call |
<functioncall>{"name": "...", "arguments": {...}}</functioncall> |
from datasets import load_dataset
ds = load_dataset("parquet", data_files="transformer_qa.parquet")
print(ds["train"][0])| Option | Default | Description |
|---|---|---|
--model |
llama3.1:8b |
The model to use for generation and judgment. |
--api-url |
http://localhost:11434/v1 |
Ollama/OpenAI-compatible API endpoint. |
--push-to-hub |
False |
Automatically push the dataset to Hugging Face. |
MIT