High-level reference for FineFoundry's data collection interfaces.
This page provides an overview of the scraper and generator modules and links to more detailed per-source pages:
- 4chan Scraper —
src/scrapers/fourchan_scraper.py - Reddit Scraper —
src/scrapers/reddit_scraper.py - Stack Exchange Scraper —
src/scrapers/stackexchange_scraper.py
- Synthetic Generator —
src/helpers/synthetic.py
Generate training data from your own documents using local LLMs powered by Unsloth's SyntheticDataKit.
- PDF documents
- DOCX (Word documents)
- PPTX (PowerPoint)
- HTML/HTM web pages
- TXT plain text
- URLs (fetched and parsed)
- qa — Question-answer pairs from document content
- cot — Chain-of-thought reasoning examples
- summary — Document summaries
- Select Synthetic in the Data Sources tab
- Add files or URLs
- Configure model, generation type, and parameters
- Click Start
from helpers.synthetic import run_synthetic_generation
# Called internally by the UI - async function
await run_synthetic_generation(
page=page,
log_view=log_list,
prog=progress_bar,
labels={"threads": threads_label, "pairs": pairs_label},
preview_host=preview_host,
cancel_flag=cancel_state,
sources=["document.pdf", "https://example.com/article"],
gen_type="qa",
num_pairs=25,
max_chunks=10,
curate=False,
curate_threshold=7.5,
multimodal=False,
dataset_format="ChatML",
model="unsloth/Llama-3.2-3B-Instruct",
)Synthetic generation results are saved to the SQLite database (finefoundry.db).
- GPU with sufficient VRAM (8GB+ recommended)
- Unsloth package with SyntheticDataKit
- vLLM for local model serving
Future iterations can add full parameter tables, examples, and CLI usage for each source.