Two main features:
- Loading Discogs monthly dump data into a Postgres db.
- Extracting additional data (sellers, sale history, etc.) not available via the API or the XML dump.
- Low Memory XML Loading (under 500mb memory usage for 13gb xml.gz file)
- Proxy rotation to manage request limits.
- Multi-threaded requests for efficient data extraction.
- PostgreSQL for data storage.
- Python 3.9+
- PostgreSQL
- Clone the repository and navigate into it:
git clone git@github.com:rezaisrad/discogs.git cd discogs - (Optional) Set up a virtual environment:
python3 -m venv venv source venv/bin/activate - Install dependencies:
pip install -r requirements.txt
- Set up environment variables by copying
.env.exampleto.envand adjusting values:cp .env.example .env
Run SQL scripts located in db/ to set up and initialize your database.
Load data from Discogs monthly dumps using load.py. This script downloads an XML.gz file, parses relevant fields, and loads data into PostgreSQL. Currently, it supports loading data for releases and artists, storing each record with a primary key and a JSONB column named data.
Example usage for loading artist data:
handler = XMLDataHandler(DATA_URL,
DESTINATION_DIR,
data_store,
ArtistParser()
)- Use
main.pyto fetch additional information from Discogs based on a set of release IDs. Example query fromQUERY_PATH:
SELECT id
FROM releases e
JOIN release_formats f ON f.release_id = e.id
WHERE format_name = 'Vinyl'
AND release_date BETWEEN '2000-01-01` AND '2002-01-01'
- Iterate through the set of
idvalues using thescraperobject
scraper = Scraper(URL, max_workers=MAX_WORKERS)
- Insert into your postgres table using the
BATCH_SIZEconstant
for i in range(0, len(release_ids), BATCH_SIZE):
batch_ids = release_ids[i : i + BATCH_SIZE]
try:
releases = scraper.run(batch_ids)
write_to_postgres(p, releases)
except Exception as e:
logging.error(f"Error processing batch {i//BATCH_SIZE}: {e}")
The SessionManager and ProxyManager classes ensure efficient and reliable extracting:
- SessionManager maintains a session for each thread, utilizing proxies from ProxyManager.
- ProxyManager handles proxy rotation, selecting a new proxy if the current one fails.
Example:
proxy_manager = ProxyManager(PROXIES_URL)
session_manager = SessionManager(proxy_manager)
scraper = Scraper(proxy_list_url=PROXIES_URL, max_workers=MAX_WORKERS)Each thread created by the Scraper uses a unique session and proxy, managed by SessionManager. I have had success using setting my MAX_WORKERS=32.
Run unit tests using pytest:
pytest tests/