Website Categorizer

This is a Flask-based web service that categorizes websites by extracting their metadata, computing text embeddings, and matching them against predefined categories using cosine similarity.

Features

✅ Extracts text and metadata from URLs or HTML files
✅ Uses machine learning embeddings to analyze website content
✅ Uses a caching system to avoid redundant processing
✅ Matches websites to categories with cosine similarity
✅ Handles web scraping challenges like Cloudflare protection (still experimental)

How It Works

Extract Content: Parses the website, retrieving metadata (title, description, keywords) and text.
Compute Embeddings: Uses the Ollama ‘nomic-embed-text’ model to generate text embeddings.
Find Similar Categories: Compares embeddings against a predefined tag database to find the most relevant categories.
Return Results: Responds with the top matching categories and similarity scores.

Installation

Requirements

Python 3.8+
Ollama installed on your localhost with the 'nomic-embed-text' model.
pip to install dependencies

Setting up Ollama on localhost

Install Ollama
Ollama provides an API for running machine learning models locally. Follow these steps to install it:
- Visit the Ollama installation page and download the installer for your platform.
- After downloading, follow the prompts to install Ollama.
Install the 'nomic-embed-text' Model
Once Ollama is installed, you'll need to download the 'nomic-embed-text' model by running the following command:
```
ollama pull nomic-embed-text
```
Verify the Installation
To check if Ollama and the model were installed successfully, run:
```
ollama list
```
This will show the list of models available locally. Ensure 'nomic-embed-text' is listed.

Set up the Environment

Clone the repository and navigate into the project directory:

git clone https://github.com/your-repo/website-categorizer.git
cd website-categorizer

Create a virtual environment:
```
python3 -m venv venv
```
Activate the virtual environment:
- On Windows:
```
venv\Scripts\activate
```
- On macOS/Linux:
```
source venv/bin/activate
```
Install the required dependencies:
```
pip install -r requirements.txt
```

Run the App

After installing the dependencies, start the Flask application:
```
python3 web-numpy.app
```

The application should now be running, and you can start using the API.

API Usage

1. Process a URL

To categorize a website using a URL, send a POST request to /process:

POST /process
Content-Type: application/json
{
    "url": "https://example.com"
}

2. Process an HTML File

To categorize a website by directly providing an HTML file, send a POST request with the HTML content:

POST /process
Content-Type: application/json
{
    "file": "<html>...</html>",
    "file_type": "html"
}

Response Format

[
    {
        "tags": ["Technology", "AI"],
        "tags_description": "AI and machine learning news",
        "similarity_score": 0.85
    }
]

Customization

Edit tags.json to modify categories and descriptions.
Adjust threshold in find_most_similar_tags() to control category matching sensitivity.

Contributing

Pull requests and improvements are welcome! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
static		static
.gitignore		.gitignore
README.md		README.md
app.py		app.py
embedding_cache.json		embedding_cache.json
enrichment.log		enrichment.log
enrichment_report.txt		enrichment_report.txt
main.py		main.py
main2.py		main2.py
normalize_tags.py		normalize_tags.py
reassign_tags.log		reassign_tags.log
reassign_tags.py		reassign_tags.py
reassignment_report.txt		reassignment_report.txt
requirements.txt		requirements.txt
sort.py		sort.py
tag_hierarchy.dot		tag_hierarchy.dot
tag_hierarchy.dot.jpg		tag_hierarchy.dot.jpg
tag_hierarchy.dot.png		tag_hierarchy.dot.png
tag_hierarchy.dot.svg		tag_hierarchy.dot.svg
tag_hierarchy.json		tag_hierarchy.json
tag_hierarchy.log		tag_hierarchy.log
tag_hierarchy.py		tag_hierarchy.py
tag_hierarchy2.dot		tag_hierarchy2.dot
tag_hierarchy2.dot.svg		tag_hierarchy2.dot.svg
tag_hierarchy2.json		tag_hierarchy2.json
tag_hierarchy3.dot		tag_hierarchy3.dot
tag_hierarchy3.json		tag_hierarchy3.json
tag_hierarchy4.dot		tag_hierarchy4.dot
tag_hierarchy4.json		tag_hierarchy4.json
tag_hierarchy_report.txt		tag_hierarchy_report.txt
tag_hierarchy_report2.txt		tag_hierarchy_report2.txt
tag_hierarchy_report3.txt		tag_hierarchy_report3.txt
tag_hierarchy_report4.txt		tag_hierarchy_report4.txt
tag_ids.json		tag_ids.json
tag_relations.dot		tag_relations.dot
tag_relations.dot.svg		tag_relations.dot.svg
tag_relations.json		tag_relations.json
tag_relations.log		tag_relations.log
tag_relations.py		tag_relations.py
tag_relations2.dot		tag_relations2.dot
tag_relations2.dot.svg		tag_relations2.dot.svg
tag_relations2.json		tag_relations2.json
tag_relations3.dot		tag_relations3.dot
tag_relations3.dot.svg		tag_relations3.dot.svg
tag_relations3.json		tag_relations3.json
tag_relations_report.txt		tag_relations_report.txt
tag_relations_report2.txt		tag_relations_report2.txt
tag_relations_report3.txt		tag_relations_report3.txt
tags.json		tags.json
tags_reassigned.json		tags_reassigned.json
th.svg		th.svg
th2.svg		th2.svg
th3.svg		th3.svg
th4.svg		th4.svg
th5.svg		th5.svg
utils.py		utils.py
web-faiss.py		web-faiss.py
web-milvus.py		web-milvus.py
web-numpy.py		web-numpy.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Website Categorizer

Features

How It Works

Installation

Requirements

Setting up Ollama on localhost

Set up the Environment

Run the App

API Usage

1. Process a URL

2. Process an HTML File

Response Format

Customization

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Website Categorizer

Features

How It Works

Installation

Requirements

Setting up Ollama on localhost

Set up the Environment

Run the App

API Usage

1. Process a URL

2. Process an HTML File

Response Format

Customization

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages