This project automates the scraping, cleaning, deduplication, and visualization of New Hampshire Secretary of State (NH SOS) business records. It is designed to identify and track trucking-related businesses across New England by searching for industry-specific terms (e.g., truck, freight, towing, excavation), cleaning and standardizing records, and storing results in CSV and SQLite databases.
The pipeline integrates Playwright for browser automation, pandas for data processing, and Datasette for interactive exploration.
-
Automated Web Scraping Uses Playwright to query NH SOS with multiple trucking-related search terms.
-
State Persistence Saves progress and resumes scraping if interrupted (
nh_scraper_state.json). -
Data Cleaning Standardizes names, addresses, and agent information.
-
Deduplication Removes duplicates by exact-match checking against:
- Existing HubSpot company records (
hubspot.csv) - Previous scraping outputs (
targets_*.csv)
- Existing HubSpot company records (
-
CSV & Database Outputs
progress.csv: Tracks scraped businesses across search terms.matched.csv: Businesses matched against existing datasets.final_data.csv: Cleaned, deduplicated dataset of active businesses.
-
Visualization Converts final data into an SQLite database and serves it with Datasette.
.
├── main.py # Entry point: orchestrates scraping + processing
├── scraper.py # Core scraping logic
├── browser.py # Browser automation with Playwright
├── state.py # State persistence (progress across terms/pages)
├── processor.py # Cleans, filters, deduplicates scraped data
├── cleaner.py # Utility functions for cleaning names, addresses, agents
├── deduplicator.py # Deduplicates against HubSpot + target datasets
├── visuals.py # SQLite + Datasette visualization
├── config.py # Central configuration (files, search terms, settings)
├── csvs/ # Data storage (progress.csv, matched.csv, final_data.csv, targets_*.csv)
└── nh_scraper_state.json # State file tracking completed pages
-
Run the scraper
python main.py
- Iterates through search terms defined in
config.py. - Scrapes businesses from NH SOS with Playwright.
- Saves intermediate results in
progress.csv.
- Iterates through search terms defined in
-
Process scraped data (
processor.py)- Filters only active businesses (
Good Standing,Active). - Cleans business names, addresses, and agent details.
- Deduplicates against HubSpot and existing
targets_*.csv.
- Filters only active businesses (
-
Generate outputs
progress.csv→ All scraped records with status tracking.matched.csv→ Records matched to existing datasets.final_data.csv→ Final cleaned, deduplicated dataset.
-
Visualize data
python visuals.py
- Loads
final_data.csvinto SQLite (nh_sos_data.db). - Launches Datasette for querying and visualization.
- Loads
- Scraping term “truck” → Collects 175 pages of businesses.
- State stored in
nh_scraper_state.jsonso it can resume if stopped. - Deduplication removes overlaps with
hubspot.csvand previoustargets_*.csv. - Cleaned results written into
final_data.csvfor further use.
-
Python 3.9+
-
Dependencies:
pip install pandas playwright sqlite-utils datasette
-
Playwright setup:
playwright install
-
Start a new scrape:
python main.py
-
Resume after interruption: State will auto-load from
nh_scraper_state.json. -
Process and clean existing data:
python processor.py
-
Explore results in browser:
python visuals.py
progress.csv→ Raw scrape progress.matched.csv→ Businesses matched against existing datasets.final_data.csv→ Cleaned, deduplicated, active businesses (main deliverable).nh_sos_data.db→ SQLite database for visualization.
- Add fuzzy deduplication (beyond exact matches).
- Expand beyond NH SOS to other state registries.
- Build Power BI / dashboard integration for executives.