Skip to content

jshenaop/soleka-text-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Soleka — Text Classifier

Automated topic categorization of regulatory submissions using probabilistic NLP.
Built for the Communications Regulation Commission of Colombia (CRC).

Python License scikit-learn Status Domain NLP Built For


Overview

Soleka is a Python-based text classification API designed to automate the routing and categorization of mobile device homologation requests — official regulatory submissions that telecom manufacturers must file to certify devices for the Colombian market.

Instead of manually triaging hundreds of documents, Soleka applies a Naive Bayes classifier trained on regulatory text to predict the procedural category of each submission. This reduces processing time, standardizes categorization, and creates a scalable foundation for more advanced NLP pipelines.


Problem

The CRC receives a continuous flow of regulatory documents. Manually reading and categorizing each request is time-consuming and error-prone. A single misclassification can delay device certifications and create friction for manufacturers.

Soleka addresses this by treating document triage as a text classification problem — automating what was previously a human bottleneck.


Features

  • POST /predict — Submit text and receive a predicted regulatory category (homologación or not)
  • POST /demographics — Gender and age prediction from text input
  • Naive Bayes probabilistic classification
  • Decision Tree and Deep Learning models (analytics module)
  • Token-based user authentication
  • Connection to production regulatory databases

Tech Stack

Layer Technology
Language Python 3
ML / NLP scikit-learn (Naive Bayes, Decision Tree)
API REST (POST endpoints)
Auth Token-based authentication
Frontend HTML templates
License Apache 2.0

Project Structure

soleka-text-classifier/
├── soleka.py            # Main application entry point
├── models.py            # ML model definitions and training logic
├── config.py            # Configuration and environment settings
├── analytics/           # Experimental model comparisons (NB, DT, Deep Learning)
├── resources_v1/        # Training data and regulatory document resources
├── templates/           # HTML templates for the interface
└── instructions.txt     # Setup and usage notes

Quickstart

# Clone the repository
git clone https://github.com/jshenaop/soleka-text-classifier.git
cd soleka-text-classifier

# Install dependencies
pip install -r requirements.txt

# Run the app
python soleka.py

Then send a POST request with your regulatory text:

curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "Solicitud de homologación para dispositivo móvil marca Samsung..."}'

Models

Soleka was designed with a multi-model analytics layer to compare approaches:

  • Naive Bayes — Fast, probabilistic baseline. Well-suited for sparse regulatory text.
  • Decision Tree — Interpretable rule-based classification.
  • Deep Learning — Experimental neural approach for future versions.

Roadmap

Version 1 (current)

  • Topic classification for homologación requests
  • Demographics prediction endpoint
  • Real database integration
  • User and token authentication

Version 2 (planned)

  • Multi-label classification across all CRC procedural categories
  • Active learning pipeline to improve with new submissions
  • Dashboard for classification analytics

Context

This project was built as a RegTech tool for public sector efficiency — applying NLP to reduce operational overhead in a government regulatory body. It demonstrates how probabilistic text classification can be practically deployed in Spanish-language institutional contexts.


Author

Juan Sebastián Henao
jshenaop@gmail.com
github.com/jshenaop


License

Distributed under the Apache 2.0 License.

SOLEKA - Version 2

(Waiting for ideas)

About

Naive Bayes classifier for automated topic categorization of mobile device homologation requests. Applies probabilistic text classification over regulatory submission documents to predict procedural categories. Built in Python.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors