Welcome to the GitHub Scrapper, the ultimate tool for extracting real-world GitHub repositories and simulating user activity like a pro. Whether you're populating a dataset for analysis or building a Data Lake for testing purposes, this scrapper does the heavy lifting for you—efficiently, securely, and without cutting corners.
-
Scrape Repositories at Scale: Harvest data from GitHub like a boss. Configure it to pull projects, metadata, and more—perfect for testing, analytics, or just flexing your scrapping skills.
-
Dockerized for Portability: Spin it up anywhere with ease. Our Docker setup ensures consistency across environments, whether you’re working locally or deploying in the cloud.
-
Seamless S3 Integration: Automatically upload scraped repositories to your Amazon S3 Data Lake. Designed for scalability, this integration keeps your datasets centralized and ready for action.
-
Configurable and Flexible: Fine-tune parameters like repository types, search keywords, and scraping depth. Tailor it to your needs and extract only what matters.
- Docker: Make sure you’ve got Docker installed. If not, get it here.
- AWS Credentials: Ensure your AWS credentials are set up for accessing your S3 bucket.
-
Clone the repo:
git clone https://github.com/impoflow/github-scrapper.git cd github-scrapper -
Build the Docker image:
docker build -t github-scrapper . -
Run the container:
docker run -e AWS_ACCESS_KEY_ID=<your-access-key> -e AWS_SECRET_ACCESS_KEY=<your-secret-key> -e S3_BUCKET=<your-s3-bucket> github-scrapper
That’s it. The scrapper will start fetching projects and uploading them to your S3 bucket like a well-oiled machine.
- Simulate User Activity: Populate your systems with real-world projects for testing and scaling experiments.
- Data Analysis: Build datasets for machine learning, analytics, or research.
- Testing Workflows: Use the scrapped repositories to test pipelines, workflows, or cloud infrastructure.
This scrapper respects GitHub’s API rate limits and policies. Make sure to provide a valid GitHub Personal Access Token to avoid interruptions during scraping.
We welcome contributions! Here’s how you can help:
- Fork the repository.
- Create a feature branch:
git checkout -b my-new-feature. - Commit your changes:
git commit -am 'Add some feature'. - Push to the branch:
git push origin my-new-feature. - Open a Pull Request.
Developed and maintained by the brilliant team at Impoflow. If you have questions or suggestions, feel free to open an issue or drop us a message.
This project is licensed under the MIT License. See the LICENSE file for details.