Important
DISCLAIMER: FOR EDUCATIONAL PURPOSES ONLY This project, including the associated website and API feed, is a technical research exercise exploring data normalization across e-commerce platforms for educational and data analysis purposes ONLY. This project does not collect user data, distribute malware, or facilitate commercial transactions.
A high-performance, concurrent web scraper to aggregate the best bra deals from major Indian e-commerce platforms.
- Concurrent Scraping: Scrapes 7 stores simultaneously using a
ThreadPoolExecutorfor maximum speed. - Multi-Page Crawling: Scrapes multiple pages and search queries per store for comprehensive data collection.
- Centralized Config: Manage all store URLs, selectors, and performance settings in
config.py. - Structured Logging: Comprehensive logs in
scraper.logfor debugging and health monitoring. - Unified Schema: Standardized output format for all products across different platforms.
- Smart Retries: Exponential backoff with jitter to handle network flakes and minor blocks.
- Premium Classification: Automatic premium product identification based on brand and price.
- Install dependencies:
pip install -r requirements.txt
- Run the main scraper:
python src/bras_scraper.py
- Check the results: Output saved in
data/bras_deals.json.
| Store | Strategy | Status |
|---|---|---|
| Amazon India | HTML Scraping (Search Results) | ✅ Active |
| Flipkart | JSON-LD + HTML Parsing | ✅ Active |
| Myntra | Internal API + Embedded Data | ✅ Active |
| Ajio | REST API | ✅ Active |
| Zivame | REST API + HTML Fallback | ✅ Active |
| Clovia | REST API | ✅ Active |
| Nykaa Fashion | REST API | ✅ Active |
| File | Description |
|---|---|
data/bras_deals.json |
Filtered deals (≥20% discount) |
data/premium_bras.json |
Premium brand deals only |
data/all_bras.json |
All scraped products |
A pre-configured YAML is available at .github/workflows/scrape_deals.yml to run the scraper daily and commit new deals automatically.
Warning
This scraper is for educational and personal use. Always respect robots.txt and the store's Terms of Service.
