Production-oriented ETL pipeline that collects articles from NewsAPI, validates/transforms them, and loads clean records into PostgreSQL.
extract: gets paginated articles from NewsAPI with timeout, retries, and backoff.transform: validates each article, normalizes fields and timestamps, and collects rejection stats.load: upserts articles and links them to user requests with deduplication.worker: atomically claims queued search requests from PostgreSQL and processes them safely.
project/
├── config/
│ └── config.py
├── data/
│ ├── raw/
│ └── clean/
├── notebooks/
│ └── 01_eda.ipynb
├── src/
│ ├── __init__.py
│ ├── db.py
│ ├── extract.py
│ ├── load.py
│ ├── pipeline.py
│ ├── transform.py
│ └── worker.py
├── .env.example
├── DockerFile
├── docker-compose.yml
├── main.py
├── requirements.txt
└── requirements-dev.txt
Copy .env.example to .env and fill your values:
cp .env.example .envKey variables:
NEWSAPI_KEY: required for extract.DB_HOST,DB_PORT,DB_USER,DB_PASSWORD,DB_NEWS.- Optional reliability settings:
REQUEST_MAX_RETRIES,REQUEST_TIMEOUT_SECONDS,MAX_PAGES_PER_REQUEST,DB_CONNECT_TIMEOUT_SECONDS.
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txtpython main.py --init-onlyAlternative for one-shot run: add --bootstrap to any run command below.
python main.py --debug --keyword python --limit 20 --page_size 50 --language enpython main.py --keyword python --limit 20 --page_size 50 --language en --user_id 1 --search_request_id 1python main.py --worker --poll_interval 3Run app + Postgres with Docker Compose:
docker compose up --buildBy default, app container runs worker mode.
python -m compileall .
ruff check .
pytest -q
bandit -r src config main.py
pip-audit- Request retry with backoff and HTTP status handling (
429,5xx). - Input validation in transform layer (bad records are rejected with reason stats).
- Idempotent article upsert by URL.
- Atomic worker dequeue (
FOR UPDATE SKIP LOCKED) to avoid duplicate processing across workers. - DB connection timeout and statement timeout support.
.env,.venv,__pycache__, and generated files underdata/rawanddata/cleanare ignored by git.- Keep secrets only in
.env(never commit real keys).
Database 'news_db' does not exist:- Run
python main.py --init-onlyonce, or add--bootstrapto your run command.
- Run
password authentication failed:- Verify
.envvaluesDB_HOST,DB_PORT,DB_USER,DB_PASSWORD. - Manually test credentials with psql/pgAdmin for the same host/port/user.
- Verify