Discogs Data Extractor

Overview

Two main features:

Loading Discogs monthly dump data into a Postgres db.
Extracting additional data (sellers, sale history, etc.) not available via the API or the XML dump.

Features

Low Memory XML Loading (under 500mb memory usage for 13gb xml.gz file)
Proxy rotation to manage request limits.
Multi-threaded requests for efficient data extraction.
PostgreSQL for data storage.

Getting Started

Prerequisites

Python 3.9+
PostgreSQL

Installation

Clone the repository and navigate into it:

git clone git@github.com:rezaisrad/discogs.git
cd discogs

(Optional) Set up a virtual environment:

python3 -m venv venv
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```
Set up environment variables by copying .env.example to .env and adjusting values:
```
cp .env.example .env
```

Database Setup

Run SQL scripts located in db/ to set up and initialize your database.

Usage

1. XML Data Loading

Load data from Discogs monthly dumps using load.py. This script downloads an XML.gz file, parses relevant fields, and loads data into PostgreSQL. Currently, it supports loading data for releases and artists, storing each record with a primary key and a JSONB column named data.

Example usage for loading artist data:

handler = XMLDataHandler(DATA_URL, 
                         DESTINATION_DIR,
                         data_store,
                         ArtistParser()
                         )

2. Extracting Additional Information

Use main.py to fetch additional information from Discogs based on a set of release IDs. Example query from QUERY_PATH:

SELECT id
FROM releases e
JOIN release_formats f ON f.release_id = e.id
WHERE format_name = 'Vinyl'
AND release_date BETWEEN '2000-01-01` AND '2002-01-01'

Iterate through the set of id values using the scraper object

scraper = Scraper(URL, max_workers=MAX_WORKERS)

Insert into your postgres table using the BATCH_SIZE constant

   for i in range(0, len(release_ids), BATCH_SIZE):
      batch_ids = release_ids[i : i + BATCH_SIZE]
      try:
         releases = scraper.run(batch_ids)
         write_to_postgres(p, releases)
      except Exception as e:
         logging.error(f"Error processing batch {i//BATCH_SIZE}: {e}")

Session and Proxy Management

The SessionManager and ProxyManager classes ensure efficient and reliable extracting:

SessionManager maintains a session for each thread, utilizing proxies from ProxyManager.
ProxyManager handles proxy rotation, selecting a new proxy if the current one fails.

Example:

proxy_manager = ProxyManager(PROXIES_URL)
session_manager = SessionManager(proxy_manager)
scraper = Scraper(proxy_list_url=PROXIES_URL, max_workers=MAX_WORKERS)

Each thread created by the Scraper uses a unique session and proxy, managed by SessionManager. I have had success using setting my MAX_WORKERS=32.

Testing

Run unit tests using pytest:

pytest tests/

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
db		db
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Discogs Data Extractor

Overview

Features

Getting Started

Prerequisites

Installation

Database Setup

Usage

1. XML Data Loading

2. Extracting Additional Information

Session and Proxy Management

Testing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Discogs Data Extractor

Overview

Features

Getting Started

Prerequisites

Installation

Database Setup

Usage

1. XML Data Loading

2. Extracting Additional Information

Session and Proxy Management

Testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages