Soleka — Text Classifier

Automated topic categorization of regulatory submissions using probabilistic NLP.
Built for the Communications Regulation Commission of Colombia (CRC).

Overview

Soleka is a Python-based text classification API designed to automate the routing and categorization of mobile device homologation requests — official regulatory submissions that telecom manufacturers must file to certify devices for the Colombian market.

Instead of manually triaging hundreds of documents, Soleka applies a Naive Bayes classifier trained on regulatory text to predict the procedural category of each submission. This reduces processing time, standardizes categorization, and creates a scalable foundation for more advanced NLP pipelines.

Problem

The CRC receives a continuous flow of regulatory documents. Manually reading and categorizing each request is time-consuming and error-prone. A single misclassification can delay device certifications and create friction for manufacturers.

Soleka addresses this by treating document triage as a text classification problem — automating what was previously a human bottleneck.

Features

POST /predict — Submit text and receive a predicted regulatory category (homologación or not)
POST /demographics — Gender and age prediction from text input
Naive Bayes probabilistic classification
Decision Tree and Deep Learning models (analytics module)
Token-based user authentication
Connection to production regulatory databases

Tech Stack

Layer	Technology
Language	Python 3
ML / NLP	scikit-learn (Naive Bayes, Decision Tree)
API	REST (POST endpoints)
Auth	Token-based authentication
Frontend	HTML templates
License	Apache 2.0

Project Structure

soleka-text-classifier/
├── soleka.py            # Main application entry point
├── models.py            # ML model definitions and training logic
├── config.py            # Configuration and environment settings
├── analytics/           # Experimental model comparisons (NB, DT, Deep Learning)
├── resources_v1/        # Training data and regulatory document resources
├── templates/           # HTML templates for the interface
└── instructions.txt     # Setup and usage notes

Quickstart

# Clone the repository
git clone https://github.com/jshenaop/soleka-text-classifier.git
cd soleka-text-classifier

# Install dependencies
pip install -r requirements.txt

# Run the app
python soleka.py

Then send a POST request with your regulatory text:

curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "Solicitud de homologación para dispositivo móvil marca Samsung..."}'

Models

Soleka was designed with a multi-model analytics layer to compare approaches:

Naive Bayes — Fast, probabilistic baseline. Well-suited for sparse regulatory text.
Decision Tree — Interpretable rule-based classification.
Deep Learning — Experimental neural approach for future versions.

Roadmap

Version 1 (current)

Topic classification for homologación requests
Demographics prediction endpoint
Real database integration
User and token authentication

Version 2 (planned)

Multi-label classification across all CRC procedural categories
Active learning pipeline to improve with new submissions
Dashboard for classification analytics

Context

This project was built as a RegTech tool for public sector efficiency — applying NLP to reduce operational overhead in a government regulatory body. It demonstrates how probabilistic text classification can be practically deployed in Spanish-language institutional contexts.

Author

Juan Sebastián Henao
jshenaop@gmail.com
github.com/jshenaop

License

Distributed under the Apache 2.0 License.

SOLEKA - Version 2

(Waiting for ideas)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Soleka — Text Classifier

Overview

Problem

Features

Tech Stack

Project Structure

Quickstart

Models

Roadmap

Context

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
analytics		analytics
resources_v1		resources_v1
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
instructions.txt		instructions.txt
models.py		models.py
soleka.py		soleka.py

Folders and files

Latest commit

History

Repository files navigation

Soleka — Text Classifier

Overview

Problem

Features

Tech Stack

Project Structure

Quickstart

Models

Roadmap

Context

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages