Skip to content
This repository was archived by the owner on Aug 7, 2025. It is now read-only.

msu-denver/COCrawlerWiki

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Colorado Crawler Wikipedia

Welcome to the Colorado Crawler Wikipedia project, a solution designed for extracting and analyzing web data from Wikipedia pages related to Colorado. This can be used to train NLP models on words related to your needs.

Features

  • Web Scraping: Utilize wikipedia_scraper.py to crawl web pages and gather the data you need.
  • Easy Setup: Quick installation with a simple pip install seleniumbase

Installing

  • Clone the repository and set up the environment:
git clone https://github.com/LukeFarch/COCrawlerWiki.git
cd COCrawlerWiki
  • Change the paths in the code as necessary to match your environment and needs.

Executing the Program

  • To start crawling Wikipedia for Colorado-related pages, run:
python wikipedia_crawler.py
  • Do you wanna get a word count? This will tell you how many files are under 10 words (failed). Adjust as needed
word_count.py
  • Follow the prompts on screen to start crawling cities or counties

  • Do you have an ArcGIS API and you want to scrape all of the parameters without doing it mnaually? Use this file and specify your chrome driver path to run this selenium bot.

  • Don't know where your chromedriver is? https://www.browserstack.com/guide/run-selenium-tests-using-selenium-chromedriver

  • Windows: Run this to find it! Get-ChildItem -Path "C:\" -Recurse -Filter "chromedriver.exe" -ErrorAction SilentlyContinue

arcgis_finder.py
  • Follow the prompts on screen to start crawling cities or counties

About

This will crawl wikipedia for keywords based on Colorado and save them as a txt file.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%