This project scrapes downloadable DRG indicator files from drg.ro using Selenium and Chrome.
The workflow has two steps:
- Build a manifest (what periods exist and where to open them)
- Process that manifest in parallel (download files for each period)
Scans the DRG indicators page and creates a manifest.json file with:
- years discovered on the website
- period codes for each year (months, trimesters, full year, and some special periods)
- direct URL for each period
- processing status flags (
status,processed_at,forms_processed)
This script does not download the data files. It prepares the plan for the downloader.
Reads manifest.json, finds periods where status is false, and processes them.
For each period, it:
- opens the period URL
- runs the full form processing logic from
drg_scraper2.py - saves downloaded files under a structured folder
- updates
manifest.jsonso progress is resumable
By default, downloads are saved to:
downloads/<year>/<period_code>/
- Python 3
- Google Chrome installed
- Matching ChromeDriver available to Selenium
- Python dependencies (at least
selenium)
If you use a virtual environment:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtGenerate for a specific year range (example: only 2024):
python generate_manifest.py --start-year 2024 --end-year 2024Or generate the latest N years (default is 5):
python generate_manifest.py --years 5Useful options:
--output <file>: output path (defaultmanifest.json)--headless/--no-headless: run with or without visible browser
Run with 4 parallel workers in headless mode:
python process_manifest.py --workers 4 --headlessUseful options:
--manifest <file>: manifest path (defaultmanifest.json)--workers <n>: number of parallel processes--download-dir <dir>: base output folder (defaultdownloads)--headless/--no-headless: run with or without visible browser
process_manifest.py updates manifest.json after each period.
If a run fails or is interrupted, re-run the same command and it continues with unprocessed periods.
# Step 1: build manifest for 2024 only
python generate_manifest.py --start-year 2024 --end-year 2024
# Step 2: process and download using 4 workers
python process_manifest.py --workers 4 --headless- The scraper depends on website structure; if the DRG page changes, selectors may need updates.
- Running with
--no-headlessis useful for debugging.