Scrapers API

High-level reference for FineFoundry's data collection interfaces.

This page provides an overview of the scraper and generator modules and links to more detailed per-source pages:

Network Scrapers

4chan Scraper — src/scrapers/fourchan_scraper.py
Reddit Scraper — src/scrapers/reddit_scraper.py
Stack Exchange Scraper — src/scrapers/stackexchange_scraper.py

Synthetic Data Generation

Synthetic Generator — src/helpers/synthetic.py

Generate training data from your own documents using local LLMs powered by Unsloth's SyntheticDataKit.

Supported Input Formats

PDF documents
DOCX (Word documents)
PPTX (PowerPoint)
HTML/HTM web pages
TXT plain text
URLs (fetched and parsed)

Generation Types

qa — Question-answer pairs from document content
cot — Chain-of-thought reasoning examples
summary — Document summaries

Basic Usage (via UI)

Select Synthetic in the Data Sources tab
Add files or URLs
Configure model, generation type, and parameters
Click Start

Programmatic Usage

from helpers.synthetic import run_synthetic_generation

# Called internally by the UI - async function
await run_synthetic_generation(
    page=page,
    log_view=log_list,
    prog=progress_bar,
    labels={"threads": threads_label, "pairs": pairs_label},
    preview_host=preview_host,
    cancel_flag=cancel_state,
    sources=["document.pdf", "https://example.com/article"],
    gen_type="qa",
    num_pairs=25,
    max_chunks=10,
    curate=False,
    curate_threshold=7.5,
    multimodal=False,
    dataset_format="ChatML",
    model="unsloth/Llama-3.2-3B-Instruct",
)

Synthetic generation results are saved to the SQLite database (finefoundry.db).

Requirements

GPU with sufficient VRAM (8GB+ recommended)
Unsloth package with SyntheticDataKit
vLLM for local model serving

Future iterations can add full parameter tables, examples, and CLI usage for each source.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrapers API

Network Scrapers

Synthetic Data Generation

Supported Input Formats

Generation Types

Basic Usage (via UI)

Programmatic Usage

Requirements

FilesExpand file tree

scrapers.md

Latest commit

History

scrapers.md

File metadata and controls

Scrapers API

Network Scrapers

Synthetic Data Generation

Supported Input Formats

Generation Types

Basic Usage (via UI)

Programmatic Usage

Requirements