Skip to content

AxeForging/topic-lode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Topic-Lode: Production AI Dataset Miner

Topic-Lode is a production-grade synthetic data generation pipeline designed to create high-quality, technically accurate datasets for LLM fine-tuning. It features two distinct modes: Knowledge Mining for deep technical Q&A and Agentic Mode for tool-use/function-calling training (optimized for Open WebUI).

🚀 Key Features

  • Hierarchical Generation: Explores subjects by generating sub-topics (breadth) and then deep technical questions (depth).
  • Knowledge Mining Mode: Scrapes real-world technical documentation with a multi-stage fallback (DuckDuckGo -> Wikipedia -> Synthetic LLM Research).
  • Agentic/Glaive Mode: Generates specialized datasets for tool-use, including JSON Schema tool definitions and conversational patterns compatible with Open WebUI.
  • Quality First:
    • PII Sanitization: Automatic redaction of names, emails, and phone numbers.
    • Deduplication: MinHash-based near-duplicate detection for unique samples.
    • LLM-as-a-Judge: Integrated validation step that filters out low-accuracy or hallucinated answers.
  • Multi-Arch Container: Built for both NVIDIA DGX (ARM64) and standard x86 environments.

🛠 Installation & Setup

Prerequisites

  • Docker
  • Ollama (running locally or accessible via API)

Build the Image

docker build -t topic-pipeline .

📖 Usage Modes

1. Knowledge Mining (SFT/DPO)

Use this mode to create technical Q&A datasets.

docker run --rm --network host -v $(pwd):/app topic-pipeline \
  --subject "Transformer Architecture" \
  --breadth 3 \
  --depth 5 \
  --output transformer_qa.parquet
  • --breadth: Number of sub-topics to explore.
  • --depth: Number of questions per sub-topic.
  • --output: Parquet file for training.

2. Agentic Mode (Tool-Use/Open WebUI)

Use this mode to train models to trigger built-in tools in Open WebUI.

docker run --rm --network host -v $(pwd):/app --entrypoint python topic-pipeline \
  src/agent_pipeline.py \
  --subject "Kubernetes Management" \
  --scenarios 10 \
  --output k8s_tools.parquet

📂 Data Format & Training

The pipeline outputs Parquet files, which are the industry standard for LLM training.

Schema (Knowledge)

Column Description
instruction The user question.
context The sanitized technical background.
response The factual answer for training.

Schema (Agentic)

Column Description
system Prompt containing Tool JSON Schemas.
user Natural language request requiring a tool.
assistant_call <functioncall>{"name": "...", "arguments": {...}}</functioncall>

Quick Verification (Python)

from datasets import load_dataset
ds = load_dataset("parquet", data_files="transformer_qa.parquet")
print(ds["train"][0])

🏗 Configuration

Option Default Description
--model llama3.1:8b The model to use for generation and judgment.
--api-url http://localhost:11434/v1 Ollama/OpenAI-compatible API endpoint.
--push-to-hub False Automatically push the dataset to Hugging Face.

📜 License

MIT

About

Production-grade synthetic data pipeline. Turns topics into sanitized datasets using Distilabel, vLLM (FlashAttn-2), and Async Scraping. Optimized for NVIDIA DGX.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors