Skip to content

Kabi10/Webscraper

Repository files navigation

🕷️ Python Web Scraper

Python-based web scraping tool with ethical data collection practices. Features rate limiting, robots.txt compliance, and data validation.

Python GitHub Actions CSV Output

Professional web scraping solution for automated data collection with enterprise-grade reliability and ethical practices.

✨ Key Features

🎯 Data Collection

  • Multi-source Scraping: Google Maps reviews and business data
  • Employment Focus: Intelligent detection of employment-related reviews
  • Rate Limiting: Respectful API usage with built-in delays
  • Error Handling: Robust error recovery and retry mechanisms

🤖 Automation & Reliability

  • GitHub Actions: Automated daily runs at midnight UTC
  • Scheduled Execution: Consistent data collection without manual intervention
  • Duplicate Prevention: Smart detection and filtering of duplicate entries
  • Data Validation: Quality checks and data integrity verification

📊 Data Processing

  • CSV Export: Clean, structured output for analysis
  • Employment Scoring: Relevance scoring for employment-related content
  • Data Enrichment: Additional metadata and categorization
  • Analytics Ready: Formatted for immediate use in data analysis tools

🔒 Ethical Practices

  • Robots.txt Compliance: Respects website scraping policies
  • Rate Limiting: Prevents server overload with controlled request timing
  • Data Privacy: Responsible handling of collected information
  • Legal Compliance: Adheres to web scraping best practices

Setup 🛠️

  1. Clone the repository
  2. Install dependencies:
    pip install -r requirements.txt
  3. Add your Google Maps API key to GitHub Secrets as GOOGLE_MAPS_API_KEY

Usage 🚀

The scraper runs automatically every day at midnight UTC via GitHub Actions. You can also trigger it manually from the Actions tab.

Manual Run

python test_scraper.py

Output 📊

Reviews are saved to company_reviews_new.csv with the following information:

  • Company name
  • Industry
  • Rating
  • Pros/Cons
  • Position
  • Timestamp

Contributing 🤝

Feel free to open issues or submit pull requests!

About

Python-based web scraping and data analysis toolkit with automated data collection, processing, and visualization capabilities. Demonstrates data engineering and automation skills.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages