Automated topic categorization of regulatory submissions using probabilistic NLP.
Built for the Communications Regulation Commission of Colombia (CRC).
Soleka is a Python-based text classification API designed to automate the routing and categorization of mobile device homologation requests — official regulatory submissions that telecom manufacturers must file to certify devices for the Colombian market.
Instead of manually triaging hundreds of documents, Soleka applies a Naive Bayes classifier trained on regulatory text to predict the procedural category of each submission. This reduces processing time, standardizes categorization, and creates a scalable foundation for more advanced NLP pipelines.
The CRC receives a continuous flow of regulatory documents. Manually reading and categorizing each request is time-consuming and error-prone. A single misclassification can delay device certifications and create friction for manufacturers.
Soleka addresses this by treating document triage as a text classification problem — automating what was previously a human bottleneck.
POST /predict— Submit text and receive a predicted regulatory category (homologación or not)POST /demographics— Gender and age prediction from text input- Naive Bayes probabilistic classification
- Decision Tree and Deep Learning models (analytics module)
- Token-based user authentication
- Connection to production regulatory databases
| Layer | Technology |
|---|---|
| Language | Python 3 |
| ML / NLP | scikit-learn (Naive Bayes, Decision Tree) |
| API | REST (POST endpoints) |
| Auth | Token-based authentication |
| Frontend | HTML templates |
| License | Apache 2.0 |
soleka-text-classifier/
├── soleka.py # Main application entry point
├── models.py # ML model definitions and training logic
├── config.py # Configuration and environment settings
├── analytics/ # Experimental model comparisons (NB, DT, Deep Learning)
├── resources_v1/ # Training data and regulatory document resources
├── templates/ # HTML templates for the interface
└── instructions.txt # Setup and usage notes
# Clone the repository
git clone https://github.com/jshenaop/soleka-text-classifier.git
cd soleka-text-classifier
# Install dependencies
pip install -r requirements.txt
# Run the app
python soleka.pyThen send a POST request with your regulatory text:
curl -X POST http://localhost:5000/predict \
-H "Content-Type: application/json" \
-d '{"text": "Solicitud de homologación para dispositivo móvil marca Samsung..."}'Soleka was designed with a multi-model analytics layer to compare approaches:
- Naive Bayes — Fast, probabilistic baseline. Well-suited for sparse regulatory text.
- Decision Tree — Interpretable rule-based classification.
- Deep Learning — Experimental neural approach for future versions.
Version 1 (current)
- Topic classification for homologación requests
- Demographics prediction endpoint
- Real database integration
- User and token authentication
Version 2 (planned)
- Multi-label classification across all CRC procedural categories
- Active learning pipeline to improve with new submissions
- Dashboard for classification analytics
This project was built as a RegTech tool for public sector efficiency — applying NLP to reduce operational overhead in a government regulatory body. It demonstrates how probabilistic text classification can be practically deployed in Spanish-language institutional contexts.
Juan Sebastián Henao
jshenaop@gmail.com
github.com/jshenaop
Distributed under the Apache 2.0 License.
SOLEKA - Version 2
(Waiting for ideas)