DataTalksClub-Projects

DataTalksClub.Streamlit.mp4

Introduction

DataTalksClub-Projects automates the analysis of projects from DataTalksClub courses. It scrapes project submissions, generates descriptive titles using LLMs, and classifies deployment types (Batch/Streaming) and cloud providers (GCP/AWS/Azure).

Supported courses:

DE Zoomcamp (dezoomcamp) → Batch, Streaming
ML Zoomcamp (mlzoomcamp) → Batch, Web Service
MLOps Zoomcamp (mlopszoomcamp) → Batch, Web Service
LLM Zoomcamp (llmzoomcamp) → Batch, Web Service

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           DataTalksClub Website                              │
│                    courses.datatalks.club/*/projects                         │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         1. SCRAPE & DISCOVER                                 │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐          │
│  │ Course Discovery │───▶│  Web Scraping   │───▶│  CSV Generation │          │
│  │ (Auto-detect new │    │ (BeautifulSoup) │    │ (project URLs)  │          │
│  │  finished courses)│    └─────────────────┘    └─────────────────┘          │
│  └─────────────────┘                                                         │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         2. MULTI-FILE FETCHING                               │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐          │
│  │   GitHub API    │───▶│  Repo Analyzer  │───▶│  Key Files:     │          │
│  │  (Tree + Files) │    │ (Prioritization)│    │  • README.md    │          │
│  └─────────────────┘    └─────────────────┘    │  • docker-compose│          │
│                                                 │  • *.tf (Terraform)│        │
│         Parallel fetching with ThreadPool       │  • requirements.txt│        │
│         (5 workers default, configurable)       │  • Dockerfile    │          │
│                                                 │  • dags/*.py     │          │
│                                                 └─────────────────┘          │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                      3. LLM CLASSIFICATION & TITLE GEN                       │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐          │
│  │  OpenRouter API │───▶│ Classification  │───▶│ Title Generation│          │
│  │ (Free LLM tier) │    │ • Deployment    │    │ (Domain-focused,│          │
│  └─────────────────┘    │   Type          │    │  tech-accurate) │          │
│                         │ • Cloud Provider│    └─────────────────┘          │
│                         └─────────────────┘                                  │
│                                                                              │
│  Classification runs FIRST → Title uses deployment context                   │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                            4. OUTPUT                                         │
│  ┌─────────────────────────────────────────────────────────────┐            │
│  │  Data/{course}/{year}/data.csv                               │            │
│  │  ├── project_url                                             │            │
│  │  ├── project_title    (LLM-generated, domain-specific)       │            │
│  │  ├── Deployment Type  (Batch, Streaming, Web Service)        │            │
│  │  ├── Reason           (Evidence from code files)             │            │
│  │  └── Cloud            (GCP, AWS, Azure, Other, Unknown)      │            │
│  └─────────────────────────────────────────────────────────────┘            │
└─────────────────────────────────────────────────────────────────────────────┘

How It Works

Pipeline Steps

Course Discovery - Automatically detects finished courses from DataTalksClub website
Web Scraping - Extracts project submission URLs from course pages
Multi-File Fetching - For each GitHub repo, fetches 10 key files (not just README):
- docker-compose.yml → Shows Kafka, Spark, orchestrators
- *.tf files → Definitive cloud provider indicator
- dags/*.py → Airflow = Batch
- requirements.txt → Dependencies
- Dockerfile, Makefile, etc.
LLM Classification - Analyzes actual code to determine:
- Deployment Type: Batch (Airflow, Kestra, Mage) or Streaming (Kafka, Flink)
- Cloud Provider: GCP, AWS, Azure based on Terraform/SDK usage
Title Generation - Creates descriptive titles based on:
- Actual project functionality (not repo name)
- Deployment type context (no "Real-Time" for Batch projects)
- Domain focus (e.g., "NYC Taxi Analytics Pipeline")

Key Features

Parallel Processing: 5-10x faster with configurable workers
Smart Skipping: Only processes new courses by default
Multi-File Context: Better accuracy than README-only analysis
Course-Specific Types: Each course has valid deployment types
Nested Project Support: Handles /tree/main/project URLs correctly

Performance

Metric	Before (Sequential)	After (Parallel, 5 workers)
381 projects	~60 minutes	~12-15 minutes
Throughput	~0.1 proj/sec	~0.5 proj/sec

Getting Started

Prerequisites

Docker and Docker Compose
GitHub Personal Access Token (create one) - for pipeline
OpenRouter API Key (get free tier) - for pipeline

Setup

git clone https://github.com/dimzachar/DataTalksClub-Projects.git
cd DataTalksClub-Projects

# For pipeline: copy and edit .env
cp .env.example .env

# Build Docker image
make docker-build

View the Dashboard

The easiest way - just visit the live app: datatalksclub-projects.streamlit.app

Or run locally with Docker:

Make Command	Direct Docker Command
`docker-compose up streamlit`	Same

Then open http://localhost:8501

Run the Data Pipeline

Docker Commands (Recommended)

Make Command	Direct Docker Command	Description
`make docker-build`	`docker-compose build`	Build Docker image (run once)
`make docker-discover`	`docker-compose run --rm pipeline python -m src.pipeline_runner --discover`	See available courses
`make docker-pipeline`	`docker-compose run --rm pipeline python -m src.pipeline_runner`	Process new courses only
`make docker-pipeline-all`	`docker-compose run --rm pipeline python -m src.pipeline_runner --all`	Reprocess all courses
`make docker-pipeline-single COURSE=dezoomcamp YEAR=2025`	`docker-compose run --rm pipeline python -m src.pipeline_runner --course dezoomcamp --year 2025`	Process specific course
`make docker-pipeline-test COURSE=dezoomcamp YEAR=2025 LIMIT=10`	`docker-compose run --rm pipeline python -m src.pipeline_runner --course dezoomcamp --year 2025 --limit 10`	Test with limited projects

Pipeline Options

Option	Description
`--discover`	List available courses and their status
`--all`	Reprocess all courses (overwrites existing)
`--course NAME`	Process specific course
`--year YEAR`	Process specific year
`--limit N`	Limit to N projects (for testing)
`--workers N`	Parallel workers (default: 5)

Testing

Make Command	Direct Docker Command	Description
`make docker-test`	`docker-compose run --rm pipeline python -m pytest tests/ -v`	Run all tests in Docker
`make docker-test-cov`	`docker-compose run --rm pipeline python -m pytest tests/ -v --cov=...`	Run tests with coverage in Docker
`make test`	`python -m pytest tests/ -v`	Run all tests locally
`make test-cov`	`python -m pytest tests/ -v --cov=...`	Run tests with coverage locally
`make test-unit`	-	Run unit tests only
`make test-e2e`	-	Run E2E/integration tests only

Quality Checks

Make Command	Direct Docker Command	Description
`make quality_checks`	-	Run isort, black, pylint locally
`make docker-quality-checks`	`docker-compose run --rm pipeline python -m isort . && ...`	Run isort, black, pylint in Docker

Local Development (without Docker)

Requires Python 3.11. Python 3.12+ has dependency issues.

Setup with uv

uv venv --python 3.11
.venv\Scripts\activate      # Windows
source .venv/bin/activate   # Linux/Mac
uv pip install -r requirements.txt

Setup with pip

python3.11 -m venv .venv
.venv\Scripts\activate      # Windows
source .venv/bin/activate   # Linux/Mac
pip install -r requirements.txt

Local Commands

Make Command	Description
`make streamlit`	Run Streamlit dashboard
`make pipeline`	Process new courses
`make pipeline-all`	Reprocess all courses
`make pipeline-discover`	Show available courses
`make pipeline-single COURSE=dezoomcamp YEAR=2025`	Process single course

Output Data

Generated data is saved to Data/{course}/{year}/data.csv:

Column	Description	Example
`project_url`	GitHub repository URL	`https://github.com/user/repo`
`project_title`	LLM-generated title	`NYC Taxi Fare Analytics Pipeline`
`Deployment Type`	Pipeline type	`Batch`, `Streaming`, `Batch, Streaming`
`Reason`	Classification evidence	`Found Airflow DAG in dags/pipeline.py`
`Cloud`	Cloud provider	`GCP`, `AWS`, `Azure`, `Other`, `Unknown`

Contributing

Fork the repository
Create a feature branch
Make changes
Run tests: make docker-test
Run make quality_checks
Submit a pull request

CI/CD

Tests: Run automatically on every PR and push to main
Pipeline: Runs quarterly (Jan, Apr, Jul, Oct) to update course data
Coverage: Minimum 80% required for pipeline files

License

MIT License - see LICENSE file.

Contact

Connect on LinkedIn

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
.github/workflows		.github/workflows
.streamlit		.streamlit
Data		Data
src		src
tests		tests
utils		utils
.env		.env
.env.example		.env.example
.gitignore		.gitignore
404.html		404.html
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
__init__.py		__init__.py
app.py		app.py
blob-scene-haikei.svg		blob-scene-haikei.svg
docker-compose.yml		docker-compose.yml
dtc_logo.png		dtc_logo.png
dtc_logo.webp		dtc_logo.webp
gallery_preview.png		gallery_preview.png
google61f3f6411aa7f8c5.html		google61f3f6411aa7f8c5.html
help.log		help.log
index.html		index.html
linkedin-share-button-icon.svg		linkedin-share-button-icon.svg
linkedin-share-button.png		linkedin-share-button.png
linkedin-share.json		linkedin-share.json
llms.txt		llms.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
robots.txt		robots.txt
sitemap.xml		sitemap.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataTalksClub-Projects

Table of Contents

Introduction

Architecture

How It Works

Pipeline Steps

Key Features

Performance

Getting Started

Prerequisites

Setup

View the Dashboard

Run the Data Pipeline

Docker Commands (Recommended)

Pipeline Options

Testing

Quality Checks

Local Development (without Docker)

Setup with uv

Setup with pip

Local Commands

Output Data

Contributing

CI/CD

License

Contact

Support this project

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataTalksClub-Projects

Table of Contents

Introduction

Architecture

How It Works

Pipeline Steps

Key Features

Performance

Getting Started

Prerequisites

Setup

View the Dashboard

Run the Data Pipeline

Docker Commands (Recommended)

Pipeline Options

Testing

Quality Checks

Local Development (without Docker)

Setup with uv

Setup with pip

Local Commands

Output Data

Contributing

CI/CD

License

Contact

Support this project

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages