Topic-Lode: Production AI Dataset Miner

Topic-Lode is a production-grade synthetic data generation pipeline designed to create high-quality, technically accurate datasets for LLM fine-tuning. It features two distinct modes: Knowledge Mining for deep technical Q&A and Agentic Mode for tool-use/function-calling training (optimized for Open WebUI).

🚀 Key Features

Hierarchical Generation: Explores subjects by generating sub-topics (breadth) and then deep technical questions (depth).
Knowledge Mining Mode: Scrapes real-world technical documentation with a multi-stage fallback (DuckDuckGo -> Wikipedia -> Synthetic LLM Research).
Agentic/Glaive Mode: Generates specialized datasets for tool-use, including JSON Schema tool definitions and conversational patterns compatible with Open WebUI.
Quality First:
- PII Sanitization: Automatic redaction of names, emails, and phone numbers.
- Deduplication: MinHash-based near-duplicate detection for unique samples.
- LLM-as-a-Judge: Integrated validation step that filters out low-accuracy or hallucinated answers.
Multi-Arch Container: Built for both NVIDIA DGX (ARM64) and standard x86 environments.

🛠 Installation & Setup

Prerequisites

Docker
Ollama (running locally or accessible via API)

Build the Image

docker build -t topic-pipeline .

📖 Usage Modes

1. Knowledge Mining (SFT/DPO)

Use this mode to create technical Q&A datasets.

docker run --rm --network host -v $(pwd):/app topic-pipeline \
  --subject "Transformer Architecture" \
  --breadth 3 \
  --depth 5 \
  --output transformer_qa.parquet

--breadth: Number of sub-topics to explore.
--depth: Number of questions per sub-topic.
--output: Parquet file for training.

2. Agentic Mode (Tool-Use/Open WebUI)

Use this mode to train models to trigger built-in tools in Open WebUI.

docker run --rm --network host -v $(pwd):/app --entrypoint python topic-pipeline \
  src/agent_pipeline.py \
  --subject "Kubernetes Management" \
  --scenarios 10 \
  --output k8s_tools.parquet

📂 Data Format & Training

The pipeline outputs Parquet files, which are the industry standard for LLM training.

Schema (Knowledge)

Column	Description
`instruction`	The user question.
`context`	The sanitized technical background.
`response`	The factual answer for training.

Schema (Agentic)

Column	Description
`system`	Prompt containing Tool JSON Schemas.
`user`	Natural language request requiring a tool.
`assistant_call`	`<functioncall>{"name": "...", "arguments": {...}}</functioncall>`

Quick Verification (Python)

from datasets import load_dataset
ds = load_dataset("parquet", data_files="transformer_qa.parquet")
print(ds["train"][0])

🏗 Configuration

Option	Default	Description
`--model`	`llama3.1:8b`	The model to use for generation and judgment.
`--api-url`	`http://localhost:11434/v1`	Ollama/OpenAI-compatible API endpoint.
`--push-to-hub`	`False`	Automatically push the dataset to Hugging Face.

📜 License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topic-Lode: Production AI Dataset Miner

🚀 Key Features

🛠 Installation & Setup

Prerequisites

Build the Image

📖 Usage Modes

1. Knowledge Mining (SFT/DPO)

2. Agentic Mode (Tool-Use/Open WebUI)

📂 Data Format & Training

Schema (Knowledge)

Schema (Agentic)

Quick Verification (Python)

🏗 Configuration

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Topic-Lode: Production AI Dataset Miner

🚀 Key Features

🛠 Installation & Setup

Prerequisites

Build the Image

📖 Usage Modes

1. Knowledge Mining (SFT/DPO)

2. Agentic Mode (Tool-Use/Open WebUI)

📂 Data Format & Training

Schema (Knowledge)

Schema (Agentic)

Quick Verification (Python)

🏗 Configuration

📜 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages