This is a Flask-based web service that categorizes websites by extracting their metadata, computing text embeddings, and matching them against predefined categories using cosine similarity.
✅ Extracts text and metadata from URLs or HTML files
✅ Uses machine learning embeddings to analyze website content
✅ Uses a caching system to avoid redundant processing
✅ Matches websites to categories with cosine similarity
✅ Handles web scraping challenges like Cloudflare protection (still experimental)
- Extract Content: Parses the website, retrieving metadata (
title,description,keywords) and text. - Compute Embeddings: Uses the Ollama ‘nomic-embed-text’ model to generate text embeddings.
- Find Similar Categories: Compares embeddings against a predefined tag database to find the most relevant categories.
- Return Results: Responds with the top matching categories and similarity scores.
- Python 3.8+
- Ollama installed on your localhost with the 'nomic-embed-text' model.
pipto install dependencies
-
Install Ollama
Ollama provides an API for running machine learning models locally. Follow these steps to install it:- Visit the Ollama installation page and download the installer for your platform.
- After downloading, follow the prompts to install Ollama.
-
Install the 'nomic-embed-text' Model
Once Ollama is installed, you'll need to download the 'nomic-embed-text' model by running the following command:ollama pull nomic-embed-text
-
Verify the Installation
To check if Ollama and the model were installed successfully, run:ollama list
This will show the list of models available locally. Ensure 'nomic-embed-text' is listed.
-
Clone the repository and navigate into the project directory:
git clone https://github.com/your-repo/website-categorizer.git cd website-categorizer -
Create a virtual environment:
python3 -m venv venv
-
Activate the virtual environment:
- On Windows:
venv\Scripts\activate
- On macOS/Linux:
source venv/bin/activate
- On Windows:
-
Install the required dependencies:
pip install -r requirements.txt
- After installing the dependencies, start the Flask application:
python3 web-numpy.app
The application should now be running, and you can start using the API.
To categorize a website using a URL, send a POST request to /process:
POST /process
Content-Type: application/json
{
"url": "https://example.com"
}To categorize a website by directly providing an HTML file, send a POST request with the HTML content:
POST /process
Content-Type: application/json
{
"file": "<html>...</html>",
"file_type": "html"
}[
{
"tags": ["Technology", "AI"],
"tags_description": "AI and machine learning news",
"similarity_score": 0.85
}
]- Edit
tags.jsonto modify categories and descriptions. - Adjust
thresholdinfind_most_similar_tags()to control category matching sensitivity.
Pull requests and improvements are welcome! 🚀
