A comprehensive web mapping application that discovers and maps connections between websites. Starting from a given URL, it crawls outward to find linked websites. It can use plugins that can captures screenshots, detects bounding boxes, and it stores all such information in a Neo4j database.
The application includes a web-based dashboard for monitoring crawling progress and viewing captured screenshots.
The project requires Python 3.12+ and all dependencies are managed through pyproject.toml and can be installed using the Makefile commands.
The application runs as two separate containers: a crawler and a dashboard. Start all services using:
make composeThis starts:
- Dashboard: Web interface at
http://localhost:8000 - Crawler: Background service for crawling websites
- Neo4j Database: Web interface at
http://localhost:7474, database atbolt://localhost:7687 - Selenium: Chrome browser for screenshot capture
To stop all services:
make stopFor local development without Docker, you need to run two separate processes:
- Start a Neo4j database (can use Docker or devcontainer):
docker run -p 7474:7474 -p 7687:7687 -e NEO4J_AUTH=neo4j/password neo4j:latest- Configure environment:
cp .env.example .envEdit .env with your Neo4j credentials:
NEO4j_URI="neo4j://localhost:7687"
NEO4j_USERNAME="neo4j"
NEO4j_PASSWORD="password"
- Install dependencies:
make installIn separate terminals, run:
Terminal 1 - Dashboard:
uvicorn dashboard.api:app --host 0.0.0.0 --port 8000Terminal 2 - Crawler:
python src/webmap/main.pyThe dashboard will be accessible at http://localhost:8000.
from webmap import Crawler
# Initialize crawler with starting URL
crawler = Crawler("https://example.com")
# Start crawling
crawler.run()The crawler supports plugins for additional functionality:
from webmap import Crawler
from webmap.boundingbox import BoundingBoxCapture
# Initialize crawler with starting URL
crawler = Crawler("https://example.com")
# Add bounding box capture plugin
def capture_bounding_boxes(url: str) -> None:
capture = BoundingBoxCapture()
capture.capture_and_save(url)
crawler.add(capture_bounding_boxes)
# Start crawling with plugins
crawler.run()Note, if you want to be able to see these tings on the webpage as well, you will have to write a "plugin" web. You can see how that is done in screenshot and boungingbox, for examples.
The dashboard container provides a web interface accessible at http://localhost:8000 when running with Docker Compose. The dashboard provides real-time crawling statistics, crawler control, and screenshot viewing capabilities.
The tools/ directory contains utility scripts:
boundingbox.py: Capture bounding box screenshots for a given URLsave_screenshot.py: Retrieve and save screenshots from the databaseclean_database.py: Database maintenance utilities
See main.py for a complete example.
This project includes a devcontainer configuration for development in VS Code with Docker. This provides a consistent development environment with all dependencies pre-installed.
src/
├── webmap/ # Crawler application
│ ├── main.py
│ ├── screenshot/ # Screenshot capture
│ ├── boundingbox/ # Bounding box detection
│ └── database/ # Neo4j database integration
├── dashboard/ # Dashboard application
│ ├── api.py # FastAPI web interface
│ ├── screenshot/ # Screenshot web plugin
│ └── boundingbox/ # Bounding box web plugin
containers/
├── crawler/ # Crawler Docker container
└── dashboard/ # Dashboard Docker container
tools/ # Utility scripts
Allowed to add anything you want, please follow resonable standards and the ones found in docs.
Author: Uplink036
This project is licensed under the MIT License.