DRG Data Scraper

This project scrapes downloadable DRG indicator files from drg.ro using Selenium and Chrome.

The workflow has two steps:

Build a manifest (what periods exist and where to open them)
Process that manifest in parallel (download files for each period)

Scripts Overview

`generate_manifest.py`

Scans the DRG indicators page and creates a manifest.json file with:

years discovered on the website
period codes for each year (months, trimesters, full year, and some special periods)
direct URL for each period
processing status flags (status, processed_at, forms_processed)

This script does not download the data files. It prepares the plan for the downloader.

`process_manifest.py`

Reads manifest.json, finds periods where status is false, and processes them.

For each period, it:

opens the period URL
runs the full form processing logic from drg_scraper2.py
saves downloaded files under a structured folder
updates manifest.json so progress is resumable

By default, downloads are saved to:

downloads/<year>/<period_code>/

Prerequisites

Python 3
Google Chrome installed
Matching ChromeDriver available to Selenium
Python dependencies (at least selenium)

If you use a virtual environment:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

How To Use

1) Generate the manifest

Generate for a specific year range (example: only 2024):

python generate_manifest.py --start-year 2024 --end-year 2024

Or generate the latest N years (default is 5):

python generate_manifest.py --years 5

Useful options:

--output <file>: output path (default manifest.json)
--headless / --no-headless: run with or without visible browser

2) Process/download from the manifest

Run with 4 parallel workers in headless mode:

python process_manifest.py --workers 4 --headless

Useful options:

--manifest <file>: manifest path (default manifest.json)
--workers <n>: number of parallel processes
--download-dir <dir>: base output folder (default downloads)
--headless / --no-headless: run with or without visible browser

Resume Behavior

process_manifest.py updates manifest.json after each period.
If a run fails or is interrupted, re-run the same command and it continues with unprocessed periods.

Typical End-to-End Example

# Step 1: build manifest for 2024 only
python generate_manifest.py --start-year 2024 --end-year 2024

# Step 2: process and download using 4 workers
python process_manifest.py --workers 4 --headless

Notes

The scraper depends on website structure; if the DRG page changes, selectors may need updates.
Running with --no-headless is useful for debugging.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
drg_scraper.py		drg_scraper.py
drg_scraper2.py		drg_scraper2.py
generate_manifest.py		generate_manifest.py
process_manifest.py		process_manifest.py
requirements.txt		requirements.txt
scraper_utils.py		scraper_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DRG Data Scraper

Scripts Overview

`generate_manifest.py`

`process_manifest.py`

Prerequisites

How To Use

1) Generate the manifest

2) Process/download from the manifest

Resume Behavior

Typical End-to-End Example

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DRG Data Scraper

Scripts Overview

generate_manifest.py

process_manifest.py

Prerequisites

How To Use

1) Generate the manifest

2) Process/download from the manifest

Resume Behavior

Typical End-to-End Example

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`generate_manifest.py`

`process_manifest.py`

Packages