Skip to content

ShivaniJo/Python-Workbook-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Ethical Web Scraping Project (Python)

๐Ÿ“Œ Overview This project demonstrates three types of ethical Python web scrapers:

  • Static scraping** using BeautifulSoup
  • Scalable scraping** using Scrapy
  • Dynamic JS scraping** using Playwright

All scripts are designed with built-in middleware to respect ethical and legal constraints, including robots.txt, User-Agent transparency, and rate limiting.

๐Ÿ“ Project Structure

web_scraping_project/ โ”œโ”€โ”€ beautifulsoup_scraper.py # Static HTML scraper โ”œโ”€โ”€ scrapy_spider.py # Asynchronous crawler with Scrapy โ”œโ”€โ”€ playwright_scraper.py # Dynamic scraper with Playwright โ”œโ”€โ”€ ethical_mw.py # Middleware enforcing ethical rules โ”œโ”€โ”€ requirements.txt # Python dependencies โ”œโ”€โ”€ README.md # Project documentation

โš™๏ธ Setup Instructions

  1. Clone or download this repository.
  2. Install required packages:

bash pip install -r requirements.txt playwright install # Needed for browser automation

For Jupyter Notebook users: python import nest_asyncio nest_asyncio.apply()

๐Ÿš€ Running the Scrapers -BeautifulSoup Scraper:python beautifulsoup_scraper.py -Scrapy Spider:scrapy runspider scrapy_spider.py

  • Playwright Scraper:python playwright_scraper.py

โœ… Ethical Middleware Highlights (ethical_mw.py)

  • Respects robots.txt (via urllib.robotparser)
  • Delays requests (rate limit = 1/sec)
  • Custom User-Agent: identifies scraper with contact URL/email
  • Logs scraping actions (can be extended to database or file)

๐Ÿ“š Research Context This project supports the academic paper: โ€œAdvanced Web Scraping with Python: Ethical, Legal, and Technical Challengesโ€ It benchmarks popular scraping tools and proposes best practices for responsible automation.

๐Ÿ‘ค Author Shivani Joisar
Masterโ€™s Student in Computer Science IU International University of Applied Sciences

๐Ÿชช License

MIT License โ€” for educational and research purposes only.


๐Ÿ›‘ Disclaimer

This project is intended for ethical research and educational use only. Do not scrape websites without explicit permission or in violation of their terms of service.

About

Python Project demonstrating ethical web scraping with BeautifulSoup, Scrapy, and Playwright

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages