DataTalksClub.Streamlit.mp4
- Introduction
- Architecture
- How It Works
- Getting Started
- View the Dashboard
- Run the Data Pipeline
- Testing
- Quality Checks
- Local Development
- Output Data
- Contributing
- License
- Contact
DataTalksClub-Projects automates the analysis of projects from DataTalksClub courses. It scrapes project submissions, generates descriptive titles using LLMs, and classifies deployment types (Batch/Streaming) and cloud providers (GCP/AWS/Azure).
Supported courses:
- DE Zoomcamp (dezoomcamp) → Batch, Streaming
- ML Zoomcamp (mlzoomcamp) → Batch, Web Service
- MLOps Zoomcamp (mlopszoomcamp) → Batch, Web Service
- LLM Zoomcamp (llmzoomcamp) → Batch, Web Service
┌─────────────────────────────────────────────────────────────────────────────┐
│ DataTalksClub Website │
│ courses.datatalks.club/*/projects │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ 1. SCRAPE & DISCOVER │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Course Discovery │───▶│ Web Scraping │───▶│ CSV Generation │ │
│ │ (Auto-detect new │ │ (BeautifulSoup) │ │ (project URLs) │ │
│ │ finished courses)│ └─────────────────┘ └─────────────────┘ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ 2. MULTI-FILE FETCHING │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ GitHub API │───▶│ Repo Analyzer │───▶│ Key Files: │ │
│ │ (Tree + Files) │ │ (Prioritization)│ │ • README.md │ │
│ └─────────────────┘ └─────────────────┘ │ • docker-compose│ │
│ │ • *.tf (Terraform)│ │
│ Parallel fetching with ThreadPool │ • requirements.txt│ │
│ (5 workers default, configurable) │ • Dockerfile │ │
│ │ • dags/*.py │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ 3. LLM CLASSIFICATION & TITLE GEN │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ OpenRouter API │───▶│ Classification │───▶│ Title Generation│ │
│ │ (Free LLM tier) │ │ • Deployment │ │ (Domain-focused,│ │
│ └─────────────────┘ │ Type │ │ tech-accurate) │ │
│ │ • Cloud Provider│ └─────────────────┘ │
│ └─────────────────┘ │
│ │
│ Classification runs FIRST → Title uses deployment context │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ 4. OUTPUT │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Data/{course}/{year}/data.csv │ │
│ │ ├── project_url │ │
│ │ ├── project_title (LLM-generated, domain-specific) │ │
│ │ ├── Deployment Type (Batch, Streaming, Web Service) │ │
│ │ ├── Reason (Evidence from code files) │ │
│ │ └── Cloud (GCP, AWS, Azure, Other, Unknown) │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
- Course Discovery - Automatically detects finished courses from DataTalksClub website
- Web Scraping - Extracts project submission URLs from course pages
- Multi-File Fetching - For each GitHub repo, fetches 10 key files (not just README):
docker-compose.yml→ Shows Kafka, Spark, orchestrators*.tffiles → Definitive cloud provider indicatordags/*.py→ Airflow = Batchrequirements.txt→ DependenciesDockerfile,Makefile, etc.
- LLM Classification - Analyzes actual code to determine:
- Deployment Type: Batch (Airflow, Kestra, Mage) or Streaming (Kafka, Flink)
- Cloud Provider: GCP, AWS, Azure based on Terraform/SDK usage
- Title Generation - Creates descriptive titles based on:
- Actual project functionality (not repo name)
- Deployment type context (no "Real-Time" for Batch projects)
- Domain focus (e.g., "NYC Taxi Analytics Pipeline")
- Parallel Processing: 5-10x faster with configurable workers
- Smart Skipping: Only processes new courses by default
- Multi-File Context: Better accuracy than README-only analysis
- Course-Specific Types: Each course has valid deployment types
- Nested Project Support: Handles
/tree/main/projectURLs correctly
| Metric | Before (Sequential) | After (Parallel, 5 workers) |
|---|---|---|
| 381 projects | ~60 minutes | ~12-15 minutes |
| Throughput | ~0.1 proj/sec | ~0.5 proj/sec |
- Docker and Docker Compose
- GitHub Personal Access Token (create one) - for pipeline
- OpenRouter API Key (get free tier) - for pipeline
git clone https://github.com/dimzachar/DataTalksClub-Projects.git
cd DataTalksClub-Projects
# For pipeline: copy and edit .env
cp .env.example .env
# Build Docker image
make docker-buildThe easiest way - just visit the live app: datatalksclub-projects.streamlit.app
Or run locally with Docker:
| Make Command | Direct Docker Command |
|---|---|
docker-compose up streamlit |
Same |
Then open http://localhost:8501
| Make Command | Direct Docker Command | Description |
|---|---|---|
make docker-build |
docker-compose build |
Build Docker image (run once) |
make docker-discover |
docker-compose run --rm pipeline python -m src.pipeline_runner --discover |
See available courses |
make docker-pipeline |
docker-compose run --rm pipeline python -m src.pipeline_runner |
Process new courses only |
make docker-pipeline-all |
docker-compose run --rm pipeline python -m src.pipeline_runner --all |
Reprocess all courses |
make docker-pipeline-single COURSE=dezoomcamp YEAR=2025 |
docker-compose run --rm pipeline python -m src.pipeline_runner --course dezoomcamp --year 2025 |
Process specific course |
make docker-pipeline-test COURSE=dezoomcamp YEAR=2025 LIMIT=10 |
docker-compose run --rm pipeline python -m src.pipeline_runner --course dezoomcamp --year 2025 --limit 10 |
Test with limited projects |
| Option | Description |
|---|---|
--discover |
List available courses and their status |
--all |
Reprocess all courses (overwrites existing) |
--course NAME |
Process specific course |
--year YEAR |
Process specific year |
--limit N |
Limit to N projects (for testing) |
--workers N |
Parallel workers (default: 5) |
| Make Command | Direct Docker Command | Description |
|---|---|---|
make docker-test |
docker-compose run --rm pipeline python -m pytest tests/ -v |
Run all tests in Docker |
make docker-test-cov |
docker-compose run --rm pipeline python -m pytest tests/ -v --cov=... |
Run tests with coverage in Docker |
make test |
python -m pytest tests/ -v |
Run all tests locally |
make test-cov |
python -m pytest tests/ -v --cov=... |
Run tests with coverage locally |
make test-unit |
- | Run unit tests only |
make test-e2e |
- | Run E2E/integration tests only |
| Make Command | Direct Docker Command | Description |
|---|---|---|
make quality_checks |
- | Run isort, black, pylint locally |
make docker-quality-checks |
docker-compose run --rm pipeline python -m isort . && ... |
Run isort, black, pylint in Docker |
Requires Python 3.11. Python 3.12+ has dependency issues.
uv venv --python 3.11
.venv\Scripts\activate # Windows
source .venv/bin/activate # Linux/Mac
uv pip install -r requirements.txtpython3.11 -m venv .venv
.venv\Scripts\activate # Windows
source .venv/bin/activate # Linux/Mac
pip install -r requirements.txt| Make Command | Description |
|---|---|
make streamlit |
Run Streamlit dashboard |
make pipeline |
Process new courses |
make pipeline-all |
Reprocess all courses |
make pipeline-discover |
Show available courses |
make pipeline-single COURSE=dezoomcamp YEAR=2025 |
Process single course |
Generated data is saved to Data/{course}/{year}/data.csv:
| Column | Description | Example |
|---|---|---|
project_url |
GitHub repository URL | https://github.com/user/repo |
project_title |
LLM-generated title | NYC Taxi Fare Analytics Pipeline |
Deployment Type |
Pipeline type | Batch, Streaming, Batch, Streaming |
Reason |
Classification evidence | Found Airflow DAG in dags/pipeline.py |
Cloud |
Cloud provider | GCP, AWS, Azure, Other, Unknown |
- Fork the repository
- Create a feature branch
- Make changes
- Run tests:
make docker-test - Run
make quality_checks - Submit a pull request
- Tests: Run automatically on every PR and push to main
- Pipeline: Runs quarterly (Jan, Apr, Jul, Oct) to update course data
- Coverage: Minimum 80% required for pipeline files
MIT License - see LICENSE file.
Connect on LinkedIn
