🚀 LLMForge

LLMForge is a modular LLMOps pipeline originally designed for Reddit-based dataset curation and LoRA fine-tuning on TinyLLaMA-1.1B. It automates ingestion, cleaning, training, and Hugging Face syncing — all backed by Prefect, Modal, and CI/CD best practices.

✅ What’s New (June 2025)

We’ve added a one-line fine-tuning CLI flow using Modal GPU compute and any Hugging Face dataset.

./llmforge finetune \
  --dataset-repo yourusername/your-dataset \
  --model-repo yourusername/your-model-name \
  --hf-auth ./hf_token.txt \
  --push-to-hub

No setup beyond a HF token file and basic venv install. Full walkthrough and screenshots below 👇

🧠 Why This Update Matters

This update streamlines LLMForge into a modular LoRA trainer for the OSS community:

🔥 No MLOps knowledge needed to fine-tune & push your own model
🧼 Old full-pipeline logic is retained but commented for now
🔁 Fast launch, easy to build on top of

⚙️ Key Features

🔎 Reddit-based prompt-completion dataset creation
🧹 Toxicity filtering using Detoxify
⚙️ LoRA fine-tuning on Modal with GPU (T4)
☁️ Hugging Face Dataset + Model Hub syncing
🔁 Prefect orchestration pipeline (legacy)
🚀 New! One-command fine-tuning using unsloth via CLI + Modal
🧪 CI setup with pytest and GitHub Actions
📧 Email alerts for pipeline (optional)

🆕 New: Simple Finetune CLI

After cloning the repo:

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Place your Hugging Face token in a file named hf_token.txt.

Then run:

./llmforge finetune \
  --dataset-repo yourusername/your-dataset(whatever dataset repo name you want to give) \
  --model-repo yourusername/your-model-name(same naming conventions like dataset) \
  --hf-auth ./hf_token.txt \
  --push-to-hub

That's it. Modal will handle the GPU job and push the model to your HF.

🧼 Notes on Code Cleanup

tinylama and old full-pipeline fine-tuning logic are commented (not deleted).
The 12-hour looped pipeline is inactive, but logic is retained for future revival.
modal secret creation is handled dynamically using your HF token.

🔧 Architecture Overview

Module	Role
`src/`	Pipeline + finetuning logic
`tasks/`	Prefect-wrapped tasks for orchestration
`tests/`	Test coverage
`llmforge`	CLI to trigger training

🛠️ Features Implemented

Feature	Status
Reddit Ingestion with PRAW	✅
Toxicity Filtering (Detoxify)	✅
Prompt-Completion Generation	✅
Hugging Face Dataset Push	✅
LoRA Fine-tuning on Modal	✅
Hugging Face Model Push	✅
CLI-based Finetune Trigger	✅
CI + `pytest` test integration	✅
Email Alerts	✅
Streamlit Inference UI (local)	✅

📁 Project Structure

.
├── hf_uploader.py
├── inference.py
├── LICENSE
├── llmforge                   # CLI wrapper for launching fine-tune jobs
├── prefect.yaml
├── __pycache__
├── README.md
├── requirements.txt
├── run_pipeline.py
├── src/
│   ├── data_ingestion.py
│   ├── data_processor.py
│   ├── email.py
│   ├── finetune.py
│   ├── __init__.py
│   └── __pycache__
├── tasks/
│   ├── data_ingestion_task.py
│   ├── data_processor_task.py
│   ├── finetune_task.py
│   └── __pycache__
├── tests/
│   ├── __init__.py
│   └── test_pipeline.py
├── venv
└──

---

📷 Screenshots

🔄 One-command Modal GPU Job

460599508-74a9533e-0a70-4cda-a368-5f9862588a57

✅ Model Pushed to Hugging Face

🧪 CI + Prefect Pipelines

🎛️ Streamlit Inference UI

Model and Dataset Pushed to Hugging Face after CLI run

Terminal outputs after I ran CLI command

🧪 Testing

pytest tests/

✅ All tests pass for ETL logic and CLI triggers

⚠️ Dev Notes

LLMForge was built on a 4GB GPU laptop:

Modal handles all remote fine-tuning
The 12-hr pipeline is currently disabled for cost reasons
Adapter merging was skipped for memory savings

Despite that:

It’s cloud-ready and reproducible
Works on real Reddit + HF datasets
Fully OSS and tweakable

🧠 Future Extensions

Merge adapters for complete model export
Auto-deploy Streamlit via Modal or Render
Reactivate scheduled flows (Prefect)
Add Rouge/BLEU scoring post-finetune
Support data balancing + multi-source ingestion

🧑‍💻 Author

Hindol R. Choudhury MLOps • LLM Infra • Applied AI 📫 LinkedIn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 LLMForge

✅ What’s New (June 2025)

🧠 Why This Update Matters

⚙️ Key Features

🆕 New: Simple Finetune CLI

🧼 Notes on Code Cleanup

🔧 Architecture Overview

🛠️ Features Implemented

📁 Project Structure

📷 Screenshots

🔄 One-command Modal GPU Job

✅ Model Pushed to Hugging Face

🧪 CI + Prefect Pipelines

🎛️ Streamlit Inference UI

Model and Dataset Pushed to Hugging Face after CLI run

Terminal outputs after I ran CLI command

🧪 Testing

⚠️ Dev Notes

🧠 Future Extensions

🧑‍💻 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.dvc		.dvc
.github/workflows		.github/workflows
src		src
tasks		tasks
tests		tests
.dvcignore		.dvcignore
.gitignore		.gitignore
.prefectignore		.prefectignore
LICENSE		LICENSE
README.md		README.md
hf_uploader.py		hf_uploader.py
inference.py		inference.py
llmforge		llmforge
prefect.yaml		prefect.yaml
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

🚀 LLMForge

✅ What’s New (June 2025)

🧠 Why This Update Matters

⚙️ Key Features

🆕 New: Simple Finetune CLI

🧼 Notes on Code Cleanup

🔧 Architecture Overview

🛠️ Features Implemented

📁 Project Structure

📷 Screenshots

🔄 One-command Modal GPU Job

✅ Model Pushed to Hugging Face

🧪 CI + Prefect Pipelines

🎛️ Streamlit Inference UI

Model and Dataset Pushed to Hugging Face after CLI run

Terminal outputs after I ran CLI command

🧪 Testing

⚠️ Dev Notes

🧠 Future Extensions

🧑‍💻 Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages