This repository contains Python scripts designed to scrape product data from e-commerce websites built on Shopify and WooCommerce platforms. The scraped data is then stored in a MySQL database for further analysis.
The primary goal of these scrapers is to collect detailed product information, including:
- Shopify Scraper (
shopify_scraper/):- Product Title
- Vendor
- Price
- Availability
- Description (HTML)
- Product Category/Type
- Product URL
- Store Name
- WooCommerce Scraper (
woo_scraper/):- Product Title
- Price
- Tags
- SKU
- Product Category (derived from the scraped category page)
- Product URL
- Store Name (currently hardcoded for "Barefoot Buttons" but can be adapted)
This data serves as the input for a subsequent analysis pipeline that identifies top-performing products based on various metrics.
.
├── shopify_scraper/
│ ├── Scrapping_Shop.py # Main script for Shopify scraping
│ ├── stores.json # Configuration file listing Shopify store URLs
│ ├── Dockerfile # Dockerfile for containerizing the Shopify scraper
│ └── requirements.txt # Python dependencies for Shopify scraper
├── woo_scraper/
│ ├── Scrapping_Woo.py # Main script for WooCommerce (Barefoot Buttons) scraping
│ ├── barefoot_categories.json # Configuration for WooCommerce categories to scrape (e.g., Barefoot Buttons)
│ ├── Dockerfile # Dockerfile for containerizing the WooCommerce scraper
│ └── requirements.txt # Python dependencies for WooCommerce scraper
└── README.md # This file
- Python 3.8+
- MySQL Server accessible by the scripts
- Docker (if running via containers)
- Access to the internet to reach the target e-commerce sites
git clone https://github.com/Smart-eCommerce-Intelligence/Web-Scrapping
cd Web-ScrappingEnsure you have a MySQL database server running and accessible. The scripts will attempt to create the necessary tables if they don't exist. You'll need to provide database credentials.
- Shopify Scraper: Expects a database (e.g.,
shopify_data) and will create aproductstable. - WooCommerce Scraper: Expects a database (e.g.,
web_scraping_db) and will create abarefoot_productstable.
It's recommended to use a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activateInstall dependencies for each scraper:
pip install -r shopify_scraper/requirements.txt
pip install -r woo_scraper/requirements.txtshopify_scraper/requirements.txt should contain:
requests
mysql-connector-python
# argparse is part of standard library
# json is part of standard library
woo_scraper/requirements.txt should contain:
requests-html # For HTML parsing and JavaScript rendering (if used, though current script is static)
mysql-connector-python
# argparse is part of standard library
# json is part of standard library
-
Shopify Scraper (
shopify_scraper/stores.json): Update this JSON file with a list of Shopify store base URLs you want to scrape. Example:[ "https://www.allbirds.com", "https://www.untuckit.com" ] -
WooCommerce Scraper (
woo_scraper/barefoot_categories.json): This file is configured to scrape categories from a specific WooCommerce site (Barefoot Buttons by default). Each entry needs aname(for the category label in the DB) and aurl(the category page URL). Example:[ { "name": "Version 1", "url": "https://barefootbuttons.com/product-category/version-1/" }, { "name": "Version 2", "url": "https://barefootbuttons.com/product-category/version-2/" } ]
Shopify Scraper:
Navigate to the shopify_scraper directory:
cd shopify_scraperRun the script with appropriate command-line arguments:
python Scrapping_Shop.py --db_host <your_db_host> --db_user <your_db_user> --db_password "<your_db_password>" --db_name <shopify_db_name> --stores_file_path stores.json- Replace placeholders with your actual database credentials and desired database name.
- If
--db_passwordis empty, you can omit it or pass"".
WooCommerce Scraper:
Navigate to the woo_scraper directory:
cd woo_scraperRun the script:
python Scrapping_Woo.py --db_host <your_db_host> --db_user <your_db_user> --db_password "<your_db_password>" --db_name <woocommerce_db_name> --categories_file_path barefoot_categories.json- Replace placeholders accordingly.
Dockerfiles are provided for each scraper. These are typically built and run as part of a larger CI/CD pipeline (e.g., using Jenkins and Kubernetes).
To build an image (example for Shopify scraper):
cd shopify_scraper
docker build -t shopify-scraper:latest .To run a container (example for Shopify scraper, connecting to a host MySQL):
Ensure your MySQL database is accessible from Docker containers. For Minikube or Docker Desktop, host.docker.internal (on Mac/Windows) or your host's actual IP usually works for DB_HOST.
docker run -it --rm \
-e DB_HOST="host.docker.internal" \
-e DB_USER="your_db_user" \
-e DB_PASSWORD="your_db_password" \
-e DB_NAME="shopify_data" \
-e STORES_FILE_PATH_IN_CONTAINER="stores.json" \
shopify-scraper:latest- The
CMDin the Dockerfile is set up to use these environment variables to construct the command-line arguments for the Python script. - Adjust environment variables (
-e) as needed for your setup. The Dockerfiles define_DEFAULTvalues for these, which are then used if the runtime environment variables are not set.
Both scripts accept command-line arguments to configure database connections and input file paths. Use the -h or --help flag to see all available options:
python shopify_scraper/Scrapping_Shop.py --help
python woo_scraper/Scrapping_Woo.py --help- Ethical Scraping: Always be respectful of the websites you are scraping.
- Do not send too many requests in a short period. The scripts include
time.sleep()calls, but you might need to adjust them. - Check the website's
robots.txtfile and Terms of Service to understand their policies on scraping. - Identify your scraper with a clear User-Agent string (see
REQUEST_HEADERSinScrapping_Shop.pyandHEADERSinScrapping_Woo.py).
- Do not send too many requests in a short period. The scripts include
- Website Structure Changes: Web scrapers are brittle. If the target website's HTML structure changes, the scrapers might break and will need to be updated.
- Error Handling: The scripts include basic error handling, but more robust mechanisms could be added for production use.
- Rate Limiting/IP Bans: Aggressive scraping can lead to your IP address being blocked. Consider using proxies or rotating IP addresses for large-scale scraping (though this is beyond the scope of these basic scripts).
- Shopify
/products.jsonEndpoint: The Shopify scraper relies on the/products.jsonendpoint, which provides a structured way to get product data. Some stores might disable or protect this endpoint. - WooCommerce Scraping: The WooCommerce scraper performs HTML scraping, which is more prone to breaking than API-based methods. If a WooCommerce site offers a REST API and you have access, using that would be more reliable.